ALinkLab

连接算法与应用！

Alink教程(Java版)

教程内容补充、答疑及勘误

该文档涉及的组件

第27.3节情绪识别

通过语音，我们可以感受到说话人当时的情绪状态。CASIA数据集中标注了6种情绪：生气（angry）、高兴（happy）、害怕（fear）、悲伤（sad）、惊讶（surprise）和中性（neutral）。从机器学习的角度可以看作是对语音数据特征的多分类问题，每种情绪标签看作一个类别。

27.3.1 Softmax模型

先尝试简单的模型：Softmax模型，可以快速拿到实验结果，作为后续深入研究的base line。

上节得到的MFCC特征为张量(Tensor)格式，shape为：(num_window, num_mfcc, num_channel)；而Softmax分类器需要的输入特征为向量格式。很自然的想法，增加一个数据处理过程，将张量转换为向量，便连接起了整个识别流程。

将MFCC张量转换为向量特征，可以有多个途径：

使用所有MFCC特征。直接将张量展开（Flatten）为向量，向量的维度为 num_window * num_mfcc * num_channel
MFCC张量均值作为特征。结果向量的维度为 num_mfcc * num_channel
扩展生成特征。

下面将针对这三种情况，分别实验。

27.3.1.1 使用所有MFCC特征

        new Pipeline()
            .add(
                new ExtractMfccFeature()
                    .setSelectedCol("audio_data")
                    .setSampleRate(AUDIO_SAMPLE_RATE)
                    .setOutputCol("mfcc")
                    .setReservedCols("emotion")
            )
            .add(
                new TensorToVector()
                    .setSelectedCol("mfcc")
                    .setConvertMethod(ConvertMethod.FLATTEN)
                    .setOutputCol("mfcc")
            )
            .add(
                new Softmax()
                    .setVectorCol("mfcc")
                    .setLabelCol("emotion")
                    .setPredictionCol("pred")
                    .enableLazyPrintModelInfo()
            )
            .fit(train_set)
            .transform(test_set)
            .link(
                new EvalMultiClassBatchOp()
                    .setLabelCol("emotion")
                    .setPredictionCol("pred")
                    .lazyPrintMetrics()
            );
        BatchOperator.execute();

这里，使用组件TensorToVector，选择转化方法为“FLATTEN”，将MFCC张量格式转化为向量格式。注意：组件TensorToVector的ConvertMethod参数的默认值也是“FLATTEN”，即不设置该参数也会将张量中各项展开为向量。

模型信息显示如下，有11520个特征。

----------------------------- model meta info -----------------------------
{hasInterception: true, model name: softmax, num feature: 11520, vector colName: mfcc}
--------------------------- model label values ---------------------------
[angry, fear, happy, neutral, sad, surprise]
---------------------------- model weight info ----------------------------
|intercept|          1|          2|         3|          4|          5|          6|          7|          8|... ...|
|---------|-----------|-----------|----------|-----------|-----------|-----------|-----------|-----------|-------|
| -34.5953|-0.00080044| 0.00820478|0.00316187| 0.00411262| 0.00276282| 0.00354308| 0.01158329| 0.00164364|... ...|
|  17.6373| 0.00123709|-0.00063152|0.00250845|-0.00185701| 0.00384927|-0.00105143| 0.00095794|-0.00277940|... ...|
| -11.3275|-0.00269183|-0.00161488|0.00122599|-0.00180487|-0.00736903| 0.00260885|-0.00096520|-0.00993957|... ...|
|   3.1986| 0.00031571| 0.00416582|0.00092497|-0.00053651|-0.00067051| 0.00419201| 0.00340648| 0.00455901|... ...|
|  ... ...|    ... ...|    ... ...|   ... ...|    ... ...|    ... ...|    ... ...|    ... ...|    ... ...|... ...|

评估指标如下，Accuracy为0.5667

-------------------------------- Metrics: --------------------------------
Accuracy:0.5667	Macro F1:0.567	Micro F1:0.5667	Kappa:0.4804	
|Pred\Real|surprise|sad|neutral|happy|fear|angry|
|---------|--------|---|-------|-----|----|-----|
| surprise|      16|  2|      2|    2|   2|    5|
|      sad|       0|  7|      0|    2|  11|    2|
|  neutral|       0|  2|     12|    2|   3|    0|
|    happy|       0|  1|      0|    9|   1|    3|
|     fear|       2|  4|      1|    0|  13|    1|
|    angry|       0|  0|      1|    3|   0|   11|

27.3.1.2 MFCC张量均值作为特征

本节将调整张量转化向量的方式，调用TensorToVector组件的setConvertMethod方法，设置为“MEAN”，代码如下：

        new Pipeline()
            .add(
                new ExtractMfccFeature()
                    .setSelectedCol("audio_data")
                    .setSampleRate(AUDIO_SAMPLE_RATE)
                    .setOutputCol("mfcc")
                    .setReservedCols("emotion")
            )
            .add(
                new TensorToVector()
                    .setSelectedCol("mfcc")
                    .setConvertMethod(ConvertMethod.MEAN)
                    .setOutputCol("mfcc")
            )
            .add(
                new Softmax()
                    .setVectorCol("mfcc")
                    .setLabelCol("emotion")
                    .setPredictionCol("pred")
                    .enableLazyPrintModelInfo()
            )
            .fit(train_set)
            .transform(test_set)
            .link(
                new EvalMultiClassBatchOp()
                    .setLabelCol("emotion")
                    .setPredictionCol("pred")
                    .lazyPrintMetrics()
            );
        BatchOperator.execute();

模型信息显示如下，相对于使用全部MFCC特征的情形，特征数量小了2个数量级，只有128个。

----------------------------- model meta info -----------------------------
{hasInterception: true, model name: softmax, num feature: 128, vector colName: mfcc}
--------------------------- model label values ---------------------------
[angry, fear, happy, neutral, sad, surprise]
---------------------------- model weight info ----------------------------
|intercept|          1|          2|          3|          4|          5|         6|          7|          8|... ...|
|---------|-----------|-----------|-----------|-----------|-----------|----------|-----------|-----------|-------|
|  -5.6638|-0.36002888|-0.01290270|-0.26917412| 0.17632239|-0.67092116|1.01069863| 0.32080440| 0.48781825|... ...|
| -38.2398| 0.49965361| 0.50165944| 1.27197943| 0.57981832| 0.98934310|1.15021675| 0.22005272| 0.82378645|... ...|
|  -3.1645|-0.70812102|-0.29866862|-0.45359318|-0.06860177|-0.82726736|0.58255975|-1.13395863|-0.01848925|... ...|
| -24.9511| 0.31540821| 0.18518077| 0.84455710|-0.34935091|-1.48367114|1.10341540|-0.77014354| 1.87553987|... ...|
|  ... ...|    ... ...|    ... ...|    ... ...|    ... ...|    ... ...|   ... ...|    ... ...|    ... ...|... ...|

评估指标如下，相对于使用全部MFCC特征的情形，Accuracy为0.675。

-------------------------------- Metrics: --------------------------------
Accuracy:0.675	Macro F1:0.6676	Micro F1:0.675	Kappa:0.6066	
|Pred\Real|surprise|sad|neutral|happy|fear|angry|
|---------|--------|---|-------|-----|----|-----|
| surprise|      10|  1|      1|    2|   1|    1|
|      sad|       1| 10|      0|    0|   4|    2|
|  neutral|       1|  1|     13|    1|   1|    0|
|    happy|       3|  0|      0|   12|   1|    5|
|     fear|       0|  4|      1|    0|  22|    0|
|    angry|       3|  0|      1|    3|   1|   14|

27.3.1.3 扩展生成特征

在上一节特征构造方法上进行扩展，除了“MEAN”向量，还计算出“MIN”向量和“MAX”向量，然后，将此三个向量拼接成为特征向量。具体代码如下：

new Pipeline()
    .add(
        new ExtractMfccFeature()
            .setSelectedCol("audio_data")
            .setSampleRate(AUDIO_SAMPLE_RATE)
            .setOutputCol("mfcc")
            .setReservedCols("emotion")
    )
    .add(
        new TensorToVector()
            .setSelectedCol("mfcc")
            .setConvertMethod(ConvertMethod.MEAN)
            .setOutputCol("mfcc_mean")
    )
    .add(
        new TensorToVector()
            .setSelectedCol("mfcc")
            .setConvertMethod(ConvertMethod.MIN)
            .setOutputCol("mfcc_min")
    )
    .add(
        new TensorToVector()
            .setSelectedCol("mfcc")
            .setConvertMethod(ConvertMethod.MAX)
            .setOutputCol("mfcc_max")
    )
    .add(
        new VectorAssembler()
            .setSelectedCols("mfcc_mean", "mfcc_min", "mfcc_max")
            .setOutputCol("mfcc")
    )
    .add(
        new Softmax()
            .setVectorCol("mfcc")
            .setLabelCol("emotion")
            .setPredictionCol("pred")
            .enableLazyPrintModelInfo()
    )
    .fit(train_set)
    .transform(test_set)
    .link(
        new EvalMultiClassBatchOp()
            .setLabelCol("emotion")
            .setPredictionCol("pred")
            .lazyPrintMetrics()
    );
BatchOperator.execute();

模型信息显示如下，特征数量为384个，为单独将“MEAN”作为特征的3倍。

----------------------------- model meta info -----------------------------
{hasInterception: true, model name: softmax, num feature: 384, vector colName: mfcc}
--------------------------- model label values ---------------------------
[angry, fear, happy, neutral, sad, surprise]
---------------------------- model weight info ----------------------------
|intercept|          1|          2|          3|          4|          5|         6|          7|          8|... ...|
|---------|-----------|-----------|-----------|-----------|-----------|----------|-----------|-----------|-------|
|-993.7941|-0.28947547|-1.45203046|-0.02378904| 0.25450998|-1.25171357|2.64734179| 0.40452117| 0.16525270|... ...|
| 153.4819|-0.14299162| 1.47363676| 0.36610210| 1.16231386| 1.85410150|2.72255783| 0.06344456| 1.15060338|... ...|
|-645.1383|-1.08450032|-1.76002308|-0.78953660|-0.90654128|-1.44797040|1.18982205|-2.95229584|-0.83164726|... ...|
|-382.2944| 0.95421092|-0.63094497| 0.31065023|-1.90128762|-4.42426000|0.77819039|-4.19630732| 0.79924466|... ...|
|  ... ...|    ... ...|    ... ...|    ... ...|    ... ...|    ... ...|   ... ...|    ... ...|    ... ...|... ...|

评估指标如下，对比前面两种构造特征的情形，Accuracy值最高，为0.6917。

-------------------------------- Metrics: --------------------------------
Accuracy:0.6917	Macro F1:0.6768	Micro F1:0.6917	Kappa:0.6274	
|Pred\Real|surprise|sad|neutral|happy|fear|angry|
|---------|--------|---|-------|-----|----|-----|
| surprise|      13|  0|      0|    4|   1|    2|
|      sad|       0|  8|      0|    0|   5|    0|
|  neutral|       0|  3|     16|    4|   0|    0|
|    happy|       4|  1|      0|    8|   2|    4|
|     fear|       0|  4|      0|    1|  22|    0|
|    angry|       1|  0|      0|    1|   0|   16|

注意：对于不同的训练集和测试集划分，使用所有MFCC特征模型的Accuracy指标并不都低于使用MFCC张量均值的模型。大多数情况下，使用扩展特征的模型会有较高的Accuracy指标。

27.3.2 CNN模型

本节将尝试使用卷积神经网络(CNN)模型。

27.3.1.1 一维卷积模型

针对当前问题设计的一维卷积网络模型结构如下：

_________________________________________________________________ 
Layer (type)                 Output Shape              Param #    
================================================================= 
mfcc (InputLayer)            [(None, 90, 128, 1)]      0          
_________________________________________________________________ 
reshape (Reshape)            (None, 90, 128)           0          
_________________________________________________________________ 
conv1d (Conv1D)              (None, 90, 256)           164096     
_________________________________________________________________ 
conv1d_1 (Conv1D)            (None, 90, 128)           163968     
_________________________________________________________________ 
dropout (Dropout)            (None, 90, 128)           0          
_________________________________________________________________ 
max_pooling1d (MaxPooling1D) (None, 11, 128)           0          
_________________________________________________________________ 
conv1d_2 (Conv1D)            (None, 11, 128)           82048      
_________________________________________________________________ 
conv1d_3 (Conv1D)            (None, 11, 128)           82048      
_________________________________________________________________ 
flatten (Flatten)            (None, 1408)              0          
_________________________________________________________________ 
logits (Dense)               (None, 6)                 8454       
================================================================= 
Total params: 500,614                                             
Trainable params: 500,614                                         
Non-trainable params: 0                                           
_________________________________________________________________

使用KerasSequentialClassifier组件，可以简洁地将此模型表示出来，如下面代码所示。

BatchOperator.setParallelism(1);

new Pipeline()
    .add(
        new ExtractMfccFeature()
            .setSelectedCol("audio_data")
            .setSampleRate(AUDIO_SAMPLE_RATE)
            .setOutputCol("mfcc")
            .setReservedCols("emotion")
            .setNumThreads(12)
    )
    .add(
        new KerasSequentialClassifier()
            .setTensorCol("mfcc")
            .setLabelCol("emotion")
            .setPredictionCol("pred")
            .setLayers(
                "Reshape((90, 128))",
                "Conv1D(256, 5, padding='same', activation='relu')",
                "Conv1D(128, 5, padding='same', activation='relu')",
                "Dropout(0.1)",
                "MaxPooling1D(pool_size=8)",
                "Conv1D(128, 5, padding='same', activation='relu')",
                "Conv1D(128, 5, padding='same', activation='relu')",
                "Flatten()"
            )
            .setOptimizer("Adam(lr=0.001,decay=4e-5)")
            .setBatchSize(32)
            .setIntraOpParallelism(1)
            .setNumEpochs(50)
            .setSaveCheckpointsEpochs(3.0)
            .setValidationSplit(0.1)
            .setSaveBestOnly(true)
            .setBestMetric("sparse_categorical_accuracy")
    )
    .fit(train_set)
    .transform(test_set)
    .link(
        new EvalMultiClassBatchOp()
            .setLabelCol("emotion")
            .setPredictionCol("pred")
            .lazyPrintMetrics()
    );

BatchOperator.execute();

在单机上运行深度模型训练，建议将并发度设为1。如果想充分使用CPU计算资源、提高计算速度，可以采用设置组件“线程数”的方法。

深度学习的组件，譬如：KerasSequentialClassifier，BertTextClassifier等，可以设置参数IntraOpParallelism。
Alink的推理类计算组件都支持参数NumThreads，即计算线程数。譬如上面例子中，将ExtractMfccFeature的线程数NumThreads设为12，通过多线程并行计算，大大减少了整体计算时间。

该模型评估指标如下，在Accuracy指标上高于Softmax模型。

-------------------------------- Metrics: --------------------------------
Accuracy:0.725	Macro F1:0.7334	Micro F1:0.725	Kappa:0.6688	
|Pred\Real|surprise|sad|neutral|happy|fear|angry|
|---------|--------|---|-------|-----|----|-----|
| surprise|      15|  0|      0|    2|   1|    3|
|      sad|       1| 10|      0|    0|  10|    0|
|  neutral|       0|  0|     16|    0|   0|    0|
|    happy|       0|  1|      0|   14|   0|    5|
|     fear|       2|  3|      0|    1|  18|    0|
|    angry|       2|  0|      0|    2|   0|   14|

27.3.3.2 二维卷积模型

针对当前问题设计的二维卷积网络模型结构如下：

_________________________________________________________________  
Layer (type)                 Output Shape              Param #     
=================================================================  
mfcc (InputLayer)            [(None, 90, 128, 1)]      0           
_________________________________________________________________  
conv2d (Conv2D)              (None, 90, 128, 32)       320         
_________________________________________________________________  
conv2d_1 (Conv2D)            (None, 90, 128, 32)       9248        
_________________________________________________________________  
average_pooling2d (AveragePo (None, 30, 43, 32)        0           
_________________________________________________________________  
conv2d_2 (Conv2D)            (None, 10, 15, 64)        18496       
_________________________________________________________________  
conv2d_3 (Conv2D)            (None, 4, 5, 64)          36928       
_________________________________________________________________  
conv2d_4 (Conv2D)            (None, 2, 2, 64)          36928       
_________________________________________________________________  
average_pooling2d_1 (Average (None, 1, 1, 64)          0           
_________________________________________________________________  
conv2d_5 (Conv2D)            (None, 1, 1, 128)         73856       
_________________________________________________________________  
flatten (Flatten)            (None, 128)               0           
_________________________________________________________________  
logits (Dense)               (None, 6)                 774         
=================================================================  
Total params: 176,550                                              
Trainable params: 176,550                                          
Non-trainable params: 0                                            
_________________________________________________________________