Alink教程(Java版)

第27.4节 录音人识别

在原始数据集中记录了每个音频的录音人,我们将录音人看作音频数据对应的分类标签,便可以将录音人识别问题对应到熟悉的多分类问题。

27.4.1 Softmax模型


与前面语音情绪识别的思路一样,先尝试简单的Softmax模型,快速拿到实验结果,作为后续深入研究的base line。

MFCC特征为张量(Tensor)格式,shape为:(num_window, num_mfcc, num_channel);而Softmax分类器需要的输入特征为向量格式。在将MFCC张量转换为向量特征的方案上,选择下面两种:

  • MFCC张量均值作为特征。结果向量的维度为 num_mfcc * num_channel
  • 扩展生成特征。



27.4.1.1 MFCC张量均值作为特征


本节将调整张量转化向量的方式,调用TensorToVector组件的setConvertMethod方法,设置为“MEAN”,代码如下:

        new Pipeline()
            .add(
                new ExtractMfccFeature()
                    .setSelectedCol("audio_data")
                    .setSampleRate(AUDIO_SAMPLE_RATE)
                    .setOutputCol("mfcc")
                    .setReservedCols("speaker")
            )
            .add(
                new TensorToVector()
                    .setSelectedCol("mfcc")
                    .setConvertMethod(ConvertMethod.MEAN)
                    .setOutputCol("mfcc")
            )
            .add(
                new Softmax()
                    .setVectorCol("mfcc")
                    .setLabelCol("speaker")
                    .setPredictionCol("pred")
            )
            .fit(train_set)
            .transform(test_set)
            .link(
                new EvalMultiClassBatchOp()
                    .setLabelCol("speaker")
                    .setPredictionCol("pred")
                    .lazyPrintMetrics()
            );
        BatchOperator.execute();

模型评估结果如下,Accuracy为0.9333

-------------------------------- Metrics: --------------------------------
Accuracy:0.9333	Macro F1:0.9318	Micro F1:0.9333	Kappa:0.9095	
|   Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang|
|------------|-----------|-------|---------|------------|
| zhaoquanyin|         27|      0|        1|           0|
|     wangzhe|          1|     39|        0|           0|
|   liuchanhg|          2|      1|       21|           0|
|ZhaoZuoxiang|          0|      3|        0|          25|

27.4.1.2 扩展生成特征


在上一节特征构造方法上进行扩展,除了“MEAN”向量,还计算出“MIN”向量和“MAX”向量,然后,将此三个向量拼接成为特征向量。具体代码如下:

new Pipeline()
    .add(
        new ExtractMfccFeature()
            .setSelectedCol("audio_data")
            .setSampleRate(AUDIO_SAMPLE_RATE)
            .setOutputCol("mfcc")
            .setReservedCols("speaker")
    )
    .add(
        new TensorToVector()
            .setSelectedCol("mfcc")
            .setConvertMethod(ConvertMethod.MEAN)
            .setOutputCol("mfcc_mean")
    )
    .add(
        new TensorToVector()
            .setSelectedCol("mfcc")
            .setConvertMethod(ConvertMethod.MIN)
            .setOutputCol("mfcc_min")
    )
    .add(
        new TensorToVector()
            .setSelectedCol("mfcc")
            .setConvertMethod(ConvertMethod.MAX)
            .setOutputCol("mfcc_max")
    )
    .add(
        new VectorAssembler()
            .setSelectedCols("mfcc_mean", "mfcc_min", "mfcc_max")
            .setOutputCol("mfcc")
    )
    .add(
        new Softmax()
            .setVectorCol("mfcc")
            .setLabelCol("speaker")
            .setPredictionCol("pred")
    )
    .fit(train_set)
    .transform(test_set)
    .link(
        new EvalMultiClassBatchOp()
            .setLabelCol("speaker")
            .setPredictionCol("pred")
            .lazyPrintMetrics()
    );
BatchOperator.execute();

模型评估结果如下,对比单独将“MEAN”作为特征的情形,Accuracy有提升,为0.9667。

-------------------------------- Metrics: --------------------------------
Accuracy:0.9667	Macro F1:0.963	Micro F1:0.9667	Kappa:0.9546	
|   Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang|
|------------|-----------|-------|---------|------------|
| zhaoquanyin|         27|      0|        0|           0|
|     wangzhe|          0|     42|        0|           0|
|   liuchanhg|          3|      0|       22|           0|
|ZhaoZuoxiang|          0|      1|        0|          25|

27.4.2 CNN模型


每个音频数据提取的MFCC特征都为张量(Tensor),其shape为(90, 128, 1)。

定义CNN模型结构如下:

_________________________________________________________________  
Layer (type)                 Output Shape              Param #     
=================================================================  
mfcc (InputLayer)            [(None, 90, 128, 1)]      0           
_________________________________________________________________  
reshape (Reshape)            (None, 90, 128)           0           
_________________________________________________________________  
conv1d (Conv1D)              (None, 90, 256)           164096      
_________________________________________________________________  
conv1d_1 (Conv1D)            (None, 90, 128)           163968      
_________________________________________________________________  
dropout (Dropout)            (None, 90, 128)           0           
_________________________________________________________________  
max_pooling1d (MaxPooling1D) (None, 11, 128)           0           
_________________________________________________________________  
conv1d_2 (Conv1D)            (None, 11, 128)           82048       
_________________________________________________________________  
conv1d_3 (Conv1D)            (None, 11, 128)           82048       
_________________________________________________________________  
flatten (Flatten)            (None, 1408)              0           
_________________________________________________________________  
logits (Dense)               (None, 4)                 5636        
=================================================================  
Total params: 497,796                                              
Trainable params: 497,796                                          
Non-trainable params: 0                                            
_________________________________________________________________  


使用KerasSequentialClassifier组件,设置CNN模型结构的代码如下:

BatchOperator.setParallelism(1);

new Pipeline()
    .add(
        new ExtractMfccFeature()
            .setSelectedCol("audio_data")
            .setSampleRate(AUDIO_SAMPLE_RATE)
            .setOutputCol("mfcc")
            .setReservedCols("speaker")
            .setNumThreads(12)
    )
    .add(
        new KerasSequentialClassifier()
            .setTensorCol("mfcc")
            .setLabelCol("speaker")
            .setPredictionCol("pred")
            .setLayers(
                "Reshape((90, 128))",
                "Conv1D(256, 5, padding='same', activation='relu')",
                "Conv1D(128, 5, padding='same', activation='relu')",
                "Dropout(0.1)",
                "MaxPooling1D(pool_size=8)",
                "Conv1D(128, 5, padding='same', activation='relu')",
                "Conv1D(128, 5, padding='same', activation='relu')",
                "Flatten()"
            )
            .setNumEpochs(50)
            .setSaveCheckpointsEpochs(3.0)
            .setValidationSplit(0.1)
            .setSaveBestOnly(true)
            .setBestMetric("sparse_categorical_accuracy")
    )
    .fit(train_set)
    .transform(test_set)
    .link(
        new EvalMultiClassBatchOp()
            .setLabelCol("speaker")
            .setPredictionCol("pred")
            .lazyPrintMetrics()
    );

BatchOperator.execute();


运行结果如下,Accuracy指标为0.95。

-------------------------------- Metrics: --------------------------------
Accuracy:0.95	Macro F1:0.9492	Micro F1:0.95	Kappa:0.9318	
|   Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang|
|------------|-----------|-------|---------|------------|
| zhaoquanyin|         30|      0|        1|           0|
|     wangzhe|          0|     40|        0|           2|
|   liuchanhg|          0|      2|       21|           0|
|ZhaoZuoxiang|          0|      1|        0|          23|