Alink教程(Python版)

第26.2节 构造二分类模型

与25.2节的想法类似,我们先将每个像素看作一个特征,用常用的逻辑回归模型做一下尝试,看看对于彩色图像的分类效果;随后,再实验图像分类问题的经典模型:卷积神经网络(CNN)。



26.2.1 逻辑回归模型

尝试逻辑回归模型,将每个像素看作一个特征,使用TensorToVector组件,将张量格式的图片数据转换为向量,然后使用LogisticRegression进行训练,并计算模型指标。

def lr(train_set, test_set) :
    Pipeline()\
        .add(\
            TensorToVector()\
                .setSelectedCol("tensor")\
                .setReservedCols(["label"])\
        )\
        .add(\
            LogisticRegression()\
                .setVectorCol("tensor")\
                .setLabelCol("label")\
                .setPredictionCol(PREDICTION_COL)\
                .setPredictionDetailCol(PREDICTION_DETAIL_COL)\
        )\
        .fit(train_set)\
        .transform(test_set)\
        .link(\
            EvalBinaryClassBatchOp()\
                .setLabelCol("label")\
                .setPredictionDetailCol(PREDICTION_DETAIL_COL)\
                .lazyPrintMetrics()\
        )
    
    BatchOperator.execute()


得到LR模型的评估指标如下,精确度为0.6164。

-------------------------------- Metrics: --------------------------------
Auc:0.6496	Accuracy:0.6164	Precision:0.6264	Recall:0.6022	F1:0.6141	LogLoss:0.6812
|Pred\Real|dog|cat|
|---------|---|---|
|      dog|763|455|
|      cat|504|778|



26.2.2 CNN模型


定义CNN模型结构如下:

_________________________________________________________________ 
Layer (type)                 Output Shape              Param #    
================================================================= 
tensor (InputLayer)          [(None, 32, 32, 3)]       0          
_________________________________________________________________ 
conv2d (Conv2D)              (None, 30, 30, 32)        896        
_________________________________________________________________ 
max_pooling2d (MaxPooling2D) (None, 15, 15, 32)        0          
_________________________________________________________________ 
conv2d_1 (Conv2D)            (None, 13, 13, 64)        18496      
_________________________________________________________________ 
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 64)          0          
_________________________________________________________________ 
flatten (Flatten)            (None, 2304)              0          
_________________________________________________________________ 
dropout (Dropout)            (None, 2304)              0          
_________________________________________________________________ 
logits (Dense)               (None, 1)                 2305       
================================================================= 
Total params: 21,697                                              
Trainable params: 21,697                                          
Non-trainable params: 0                                           
_________________________________________________________________ 

使用KerasSequentialClassifierTrainBatchOp进行模型训练,并将模型保存到文件MODEL_CNN_FILE,相应代码如下

    if not(os.path.exists(DATA_DIR + MODEL_CNN_FILE)):
        train_set\
            .link(
                KerasSequentialClassifierTrainBatchOp()\
                    .setTensorCol("tensor")\
                    .setLabelCol("label")\
                    .setLayers([
                        "Conv2D(32, kernel_size=(3, 3), activation='relu')",
                        "MaxPooling2D(pool_size=(2, 2))",
                        "Conv2D(64, kernel_size=(3, 3), activation='relu')",
                        "MaxPooling2D(pool_size=(2, 2))",
                        "Flatten()",
                        "Dropout(0.5)"
                    ])\
                    .setNumEpochs(50)\
                    .setSaveCheckpointsEpochs(2.0)\
                    .setValidationSplit(0.1)\
                    .setSaveBestOnly(True)\
                    .setBestMetric("auc")\
            )\
            .link(
                AkSinkBatchOp()\
                    .setFilePath(DATA_DIR + MODEL_CNN_FILE)\
            )
        BatchOperator.execute()


再使用导入训练好的模型,对测试集进行预测,并做二分类模型评估。

    KerasSequentialClassifierPredictBatchOp()\
        .setPredictionCol(PREDICTION_COL)\
        .setPredictionDetailCol(PREDICTION_DETAIL_COL)\
        .setReservedCols(["relative_path", "label"])\
        .linkFrom(
            AkSourceBatchOp().setFilePath(DATA_DIR + MODEL_CNN_FILE),
            test_set
        )\
        .lazyPrint(10)\
        .lazyPrintStatistics()\
        .link(
            EvalBinaryClassBatchOp()\
                .setLabelCol("label")\
                .setPredictionDetailCol(PREDICTION_DETAIL_COL)\
                .lazyPrintMetrics()
        )
    BatchOperator.execute();

模型评估结果如下,明显优于逻辑回归模型。由于本实验考虑训练时间不宜太长,训练次数设定为50次,如果读者想要获得更好的模型效果,可以调整训练参数。另外,下一节介绍使用预训练模型的方法,可以帮助我们在较短的时间内拿到更好的效果。

Summary: 
|      colName|count|missing|sum|mean|variance|min|max|
|-------------|-----|-------|---|----|--------|---|---|
|relative_path| 2500|      0|NaN| NaN|     NaN|NaN|NaN|
|        label| 2500|      0|NaN| NaN|     NaN|NaN|NaN|
|         pred| 2500|      0|NaN| NaN|     NaN|NaN|NaN|
|    pred_info| 2500|      0|NaN| NaN|     NaN|NaN|NaN|

-------------------------------- Metrics: --------------------------------
Auc:0.951	Accuracy:0.8672	Precision:0.9057	Recall:0.812	F1:0.8563	LogLoss:0.3023
|Pred\Real|dog| cat|
|---------|---|----|
|      dog|989| 103|
|      cat|229|1179|