在原始数据集中记录了每个音频的录音人,我们将录音人看作音频数据对应的分类标签,便可以将录音人识别问题对应到熟悉的多分类问题。
与前面语音情绪识别的思路一样,先尝试简单的Softmax模型,快速拿到实验结果,作为后续深入研究的base line。
MFCC特征为张量(Tensor)格式,shape为:(num_window, num_mfcc, num_channel);而Softmax分类器需要的输入特征为向量格式。在将MFCC张量转换为向量特征的方案上,选择下面两种:
本节将调整张量转化向量的方式,调用TensorToVector组件的setConvertMethod方法,设置为“MEAN”,代码如下:
new Pipeline() .add( new ExtractMfccFeature() .setSelectedCol("audio_data") .setSampleRate(AUDIO_SAMPLE_RATE) .setOutputCol("mfcc") .setReservedCols("speaker") ) .add( new TensorToVector() .setSelectedCol("mfcc") .setConvertMethod(ConvertMethod.MEAN) .setOutputCol("mfcc") ) .add( new Softmax() .setVectorCol("mfcc") .setLabelCol("speaker") .setPredictionCol("pred") ) .fit(train_set) .transform(test_set) .link( new EvalMultiClassBatchOp() .setLabelCol("speaker") .setPredictionCol("pred") .lazyPrintMetrics() ); BatchOperator.execute();
模型评估结果如下,Accuracy为0.9333
-------------------------------- Metrics: -------------------------------- Accuracy:0.9333 Macro F1:0.9318 Micro F1:0.9333 Kappa:0.9095 | Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang| |------------|-----------|-------|---------|------------| | zhaoquanyin| 27| 0| 1| 0| | wangzhe| 1| 39| 0| 0| | liuchanhg| 2| 1| 21| 0| |ZhaoZuoxiang| 0| 3| 0| 25|
在上一节特征构造方法上进行扩展,除了“MEAN”向量,还计算出“MIN”向量和“MAX”向量,然后,将此三个向量拼接成为特征向量。具体代码如下:
new Pipeline() .add( new ExtractMfccFeature() .setSelectedCol("audio_data") .setSampleRate(AUDIO_SAMPLE_RATE) .setOutputCol("mfcc") .setReservedCols("speaker") ) .add( new TensorToVector() .setSelectedCol("mfcc") .setConvertMethod(ConvertMethod.MEAN) .setOutputCol("mfcc_mean") ) .add( new TensorToVector() .setSelectedCol("mfcc") .setConvertMethod(ConvertMethod.MIN) .setOutputCol("mfcc_min") ) .add( new TensorToVector() .setSelectedCol("mfcc") .setConvertMethod(ConvertMethod.MAX) .setOutputCol("mfcc_max") ) .add( new VectorAssembler() .setSelectedCols("mfcc_mean", "mfcc_min", "mfcc_max") .setOutputCol("mfcc") ) .add( new Softmax() .setVectorCol("mfcc") .setLabelCol("speaker") .setPredictionCol("pred") ) .fit(train_set) .transform(test_set) .link( new EvalMultiClassBatchOp() .setLabelCol("speaker") .setPredictionCol("pred") .lazyPrintMetrics() ); BatchOperator.execute();
模型评估结果如下,对比单独将“MEAN”作为特征的情形,Accuracy有提升,为0.9667。
-------------------------------- Metrics: -------------------------------- Accuracy:0.9667 Macro F1:0.963 Micro F1:0.9667 Kappa:0.9546 | Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang| |------------|-----------|-------|---------|------------| | zhaoquanyin| 27| 0| 0| 0| | wangzhe| 0| 42| 0| 0| | liuchanhg| 3| 0| 22| 0| |ZhaoZuoxiang| 0| 1| 0| 25|
每个音频数据提取的MFCC特征都为张量(Tensor),其shape为(90, 128, 1)。
定义CNN模型结构如下:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mfcc (InputLayer) [(None, 90, 128, 1)] 0 _________________________________________________________________ reshape (Reshape) (None, 90, 128) 0 _________________________________________________________________ conv1d (Conv1D) (None, 90, 256) 164096 _________________________________________________________________ conv1d_1 (Conv1D) (None, 90, 128) 163968 _________________________________________________________________ dropout (Dropout) (None, 90, 128) 0 _________________________________________________________________ max_pooling1d (MaxPooling1D) (None, 11, 128) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 11, 128) 82048 _________________________________________________________________ conv1d_3 (Conv1D) (None, 11, 128) 82048 _________________________________________________________________ flatten (Flatten) (None, 1408) 0 _________________________________________________________________ logits (Dense) (None, 4) 5636 ================================================================= Total params: 497,796 Trainable params: 497,796 Non-trainable params: 0 _________________________________________________________________
使用KerasSequentialClassifier组件,设置CNN模型结构的代码如下:
BatchOperator.setParallelism(1); new Pipeline() .add( new ExtractMfccFeature() .setSelectedCol("audio_data") .setSampleRate(AUDIO_SAMPLE_RATE) .setOutputCol("mfcc") .setReservedCols("speaker") .setNumThreads(12) ) .add( new KerasSequentialClassifier() .setTensorCol("mfcc") .setLabelCol("speaker") .setPredictionCol("pred") .setLayers( "Reshape((90, 128))", "Conv1D(256, 5, padding='same', activation='relu')", "Conv1D(128, 5, padding='same', activation='relu')", "Dropout(0.1)", "MaxPooling1D(pool_size=8)", "Conv1D(128, 5, padding='same', activation='relu')", "Conv1D(128, 5, padding='same', activation='relu')", "Flatten()" ) .setNumEpochs(50) .setSaveCheckpointsEpochs(3.0) .setValidationSplit(0.1) .setSaveBestOnly(true) .setBestMetric("sparse_categorical_accuracy") ) .fit(train_set) .transform(test_set) .link( new EvalMultiClassBatchOp() .setLabelCol("speaker") .setPredictionCol("pred") .lazyPrintMetrics() ); BatchOperator.execute();
运行结果如下,Accuracy指标为0.95。
-------------------------------- Metrics: -------------------------------- Accuracy:0.95 Macro F1:0.9492 Micro F1:0.95 Kappa:0.9318 | Pred\Real|zhaoquanyin|wangzhe|liuchanhg|ZhaoZuoxiang| |------------|-----------|-------|---------|------------| | zhaoquanyin| 30| 0| 1| 0| | wangzhe| 0| 40| 0| 2| | liuchanhg| 0| 2| 21| 0| |ZhaoZuoxiang| 0| 1| 0| 23|