在原始数据集中记录了每个音频的录音人,我们将录音人看作音频数据对应的分类标签,便可以将录音人识别问题对应到熟悉的多分类问题。
与前面语音情绪识别的思路一样,先尝试简单的Softmax模型,快速拿到实验结果,作为后续深入研究的base line。
MFCC特征为张量(Tensor)格式,shape为:(num_window, num_mfcc, num_channel);而Softmax分类器需要的输入特征为向量格式。在将MFCC张量转换为向量特征的方案上,选择下面两种:
本节将调整张量转化向量的方式,调用TensorToVector组件的setConvertMethod方法,设置为“MEAN”,代码如下:
new Pipeline()
.add(
new ExtractMfccFeature()
.setSelectedCol("audio_data")
.setSampleRate(AUDIO_SAMPLE_RATE)
.setOutputCol("mfcc")
.setReservedCols("speaker")
)
.add(
new TensorToVector()
.setSelectedCol("mfcc")
.setConvertMethod(ConvertMethod.MEAN)
.setOutputCol("mfcc")
)
.add(
new Softmax()
.setVectorCol("mfcc")
.setLabelCol("speaker")
.setPredictionCol("pred")
)
.fit(train_set)
.transform(test_set)
.link(
new EvalMultiClassBatchOp()
.setLabelCol("speaker")
.setPredictionCol("pred")
.lazyPrintMetrics()
);
BatchOperator.execute();模型评估结果如下,Accuracy为0.9333
-------------------------------- Metrics: -------------------------------- Accuracy:0.9333 Macro F1:0.9318 Micro F1:0.9333 Kappa:0.9095 | Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang| |------------|-----------|-------|---------|------------| | zhaoquanyin| 27| 0| 1| 0| | wangzhe| 1| 39| 0| 0| | liuchanhg| 2| 1| 21| 0| |ZhaoZuoxiang| 0| 3| 0| 25|
在上一节特征构造方法上进行扩展,除了“MEAN”向量,还计算出“MIN”向量和“MAX”向量,然后,将此三个向量拼接成为特征向量。具体代码如下:
new Pipeline()
.add(
new ExtractMfccFeature()
.setSelectedCol("audio_data")
.setSampleRate(AUDIO_SAMPLE_RATE)
.setOutputCol("mfcc")
.setReservedCols("speaker")
)
.add(
new TensorToVector()
.setSelectedCol("mfcc")
.setConvertMethod(ConvertMethod.MEAN)
.setOutputCol("mfcc_mean")
)
.add(
new TensorToVector()
.setSelectedCol("mfcc")
.setConvertMethod(ConvertMethod.MIN)
.setOutputCol("mfcc_min")
)
.add(
new TensorToVector()
.setSelectedCol("mfcc")
.setConvertMethod(ConvertMethod.MAX)
.setOutputCol("mfcc_max")
)
.add(
new VectorAssembler()
.setSelectedCols("mfcc_mean", "mfcc_min", "mfcc_max")
.setOutputCol("mfcc")
)
.add(
new Softmax()
.setVectorCol("mfcc")
.setLabelCol("speaker")
.setPredictionCol("pred")
)
.fit(train_set)
.transform(test_set)
.link(
new EvalMultiClassBatchOp()
.setLabelCol("speaker")
.setPredictionCol("pred")
.lazyPrintMetrics()
);
BatchOperator.execute();模型评估结果如下,对比单独将“MEAN”作为特征的情形,Accuracy有提升,为0.9667。
-------------------------------- Metrics: -------------------------------- Accuracy:0.9667 Macro F1:0.963 Micro F1:0.9667 Kappa:0.9546 | Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang| |------------|-----------|-------|---------|------------| | zhaoquanyin| 27| 0| 0| 0| | wangzhe| 0| 42| 0| 0| | liuchanhg| 3| 0| 22| 0| |ZhaoZuoxiang| 0| 1| 0| 25|
每个音频数据提取的MFCC特征都为张量(Tensor),其shape为(90, 128, 1)。
定义CNN模型结构如下:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mfcc (InputLayer) [(None, 90, 128, 1)] 0 _________________________________________________________________ reshape (Reshape) (None, 90, 128) 0 _________________________________________________________________ conv1d (Conv1D) (None, 90, 256) 164096 _________________________________________________________________ conv1d_1 (Conv1D) (None, 90, 128) 163968 _________________________________________________________________ dropout (Dropout) (None, 90, 128) 0 _________________________________________________________________ max_pooling1d (MaxPooling1D) (None, 11, 128) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 11, 128) 82048 _________________________________________________________________ conv1d_3 (Conv1D) (None, 11, 128) 82048 _________________________________________________________________ flatten (Flatten) (None, 1408) 0 _________________________________________________________________ logits (Dense) (None, 4) 5636 ================================================================= Total params: 497,796 Trainable params: 497,796 Non-trainable params: 0 _________________________________________________________________
使用KerasSequentialClassifier组件,设置CNN模型结构的代码如下:
BatchOperator.setParallelism(1);
new Pipeline()
.add(
new ExtractMfccFeature()
.setSelectedCol("audio_data")
.setSampleRate(AUDIO_SAMPLE_RATE)
.setOutputCol("mfcc")
.setReservedCols("speaker")
.setNumThreads(12)
)
.add(
new KerasSequentialClassifier()
.setTensorCol("mfcc")
.setLabelCol("speaker")
.setPredictionCol("pred")
.setLayers(
"Reshape((90, 128))",
"Conv1D(256, 5, padding='same', activation='relu')",
"Conv1D(128, 5, padding='same', activation='relu')",
"Dropout(0.1)",
"MaxPooling1D(pool_size=8)",
"Conv1D(128, 5, padding='same', activation='relu')",
"Conv1D(128, 5, padding='same', activation='relu')",
"Flatten()"
)
.setNumEpochs(50)
.setSaveCheckpointsEpochs(3.0)
.setValidationSplit(0.1)
.setSaveBestOnly(true)
.setBestMetric("sparse_categorical_accuracy")
)
.fit(train_set)
.transform(test_set)
.link(
new EvalMultiClassBatchOp()
.setLabelCol("speaker")
.setPredictionCol("pred")
.lazyPrintMetrics()
);
BatchOperator.execute();运行结果如下,Accuracy指标为0.95。
-------------------------------- Metrics: -------------------------------- Accuracy:0.95 Macro F1:0.9492 Micro F1:0.95 Kappa:0.9318 | Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang| |------------|-----------|-------|---------|------------| | zhaoquanyin| 30| 0| 1| 0| | wangzhe| 0| 40| 0| 2| | liuchanhg| 0| 2| 21| 0| |ZhaoZuoxiang| 0| 1| 0| 23|