该文档涉及的组件

分组Dbscan模型 (GroupDbscanModelBatchOp)

Java 类名:com.alibaba.alink.operator.batch.clustering.GroupDbscanModelBatchOp

Python 类名:GroupDbscanModelBatchOp

功能介绍

DBSCAN,Density-Based Spatial Clustering of Applications with Noise,是一个比较有代表性的基于密度的聚类算法。与划分和层次聚类方法不同,它将簇定义为密度相连的点的最大集合,能够把具有足够高密度的区域划分为簇,并可在噪声的空间数据库中发现任意形状的聚类。

距离度量方式
参数名称 参数描述 说明
EUCLIDEAN 欧式距离
COSINE 夹角余弦距离
CITYBLOCK 城市街区距离,也称曼哈顿距离

参数说明

名称 中文名称 描述 类型 是否必须? 取值范围 默认值
epsilon 邻域距离阈值 邻域距离阈值 Double
featureCols 特征列名 特征列名,必选 String[] 所选列类型为 [BIGDECIMAL, BIGINTEGER, BYTE, DOUBLE, FLOAT, INTEGER, LONG, SHORT]
groupCols 分组列名,多列 分组列名,多列,必选 String[]
minPoints 邻域中样本个数的阈值 邻域中样本个数的阈值 Integer
predictionCol 预测结果列名 预测结果列名 String
distanceType 距离度量方式 聚类使用的距离类型 String “EUCLIDEAN”, “COSINE”, “CITYBLOCK”, “HAVERSINE”, “JACCARD” “EUCLIDEAN”
groupMaxSamples 每个分组的最大样本数 每个分组的最大样本数 Integer 2147483647
skip 每个分组超过最大样本数时,是否跳过 每个分组超过最大样本数时,是否跳过 Boolean false

代码示例

Python 代码

from pyalink.alink import *

import pandas as pd

useLocalEnv(1)

df = pd.DataFrame([
    [0, "id_1", 2.0, 3.0],
    [0, "id_2", 2.1, 3.1],
    [0, "id_18", 2.4, 3.2],
    [0, "id_15", 2.8, 3.2],
    [0, "id_12", 2.1, 3.1],
    [0, "id_3", 200.1, 300.1],
    [0, "id_4", 200.2, 300.2],
    [0, "id_8", 200.6, 300.6],

    [1, "id_5", 200.3, 300.3],
    [1, "id_6", 200.4, 300.4],
    [1, "id_7", 200.5, 300.5],
    [1, "id_16", 300., 300.2],
    [1, "id_9", 2.1, 3.1],
    [1, "id_10", 2.2, 3.2],
    [1, "id_11", 2.3, 3.3],
    [1, "id_13", 2.4, 3.4],
    [1, "id_14", 2.5, 3.5],
    [1, "id_17", 2.6, 3.6],
    [1, "id_19", 2.7, 3.7],
    [1, "id_20", 2.8, 3.8],
    [1, "id_21", 2.9, 3.9],

    [2, "id_20", 2.8, 3.8]])

source = BatchOperator.fromDataframe(df, schemaStr='group string, id string, c1 double, c2 double')

groupDbscan = GroupDbscanModelBatchOp()\
    .setGroupCols(["group"])\
    .setFeatureCols(["c1", "c2"])\
    .setMinPoints(4)\
    .setEpsilon(0.6)\
    .linkFrom(source)

groupDbscan.print()

Java 代码

import org.apache.flink.types.Row;

import com.alibaba.alink.operator.batch.BatchOperator;
import com.alibaba.alink.operator.batch.source.MemSourceBatchOp;
import com.alibaba.alink.operator.batch.clustering.GroupDbscanModelBatchOp;

import org.junit.Test;

import java.util.Arrays;
import java.util.List;

public class GroupDbscanModelBatchOpTest {

	@Test
	public void testGroupDbscanModelBatchOp() throws Exception {
		List<Row> trainData = Arrays.asList(
			Row.of(0, "id_1", 2.0, 3.0),
			Row.of(0, "id_2", 2.1, 3.1),
			Row.of(0, "id_18", 2.4, 3.2),
			Row.of(0, "id_15", 2.8, 3.2),
			Row.of(0, "id_12", 2.1, 3.1),
			Row.of(0, "id_3", 200.1, 300.1),
			Row.of(0, "id_4", 200.2, 300.2),
			Row.of(0, "id_8", 200.6, 300.6),

			Row.of(1, "id_5", 200.3, 300.3),
			Row.of(1, "id_6", 200.4, 300.4),
			Row.of(1, "id_7", 200.5, 300.5),
			Row.of(1, "id_16", 300., 300.2),
			Row.of(1, "id_9", 2.1, 3.1),
			Row.of(1, "id_10", 2.2, 3.2),
			Row.of(1, "id_11", 2.3, 3.3),
			Row.of(1, "id_13", 2.4, 3.4),
			Row.of(1, "id_14", 2.5, 3.5),
			Row.of(1, "id_17", 2.6, 3.6),
			Row.of(1, "id_19", 2.7, 3.7),
			Row.of(1, "id_20", 2.8, 3.8),
			Row.of(1, "id_21", 2.9, 3.9),

			Row.of(2, "id_20", 2.8, 3.8)
		);

		MemSourceBatchOp inputOp = new MemSourceBatchOp(trainData,
			new String[] {"group", "id", "c1", "c2"});
		GroupDbscanModelBatchOp op = new GroupDbscanModelBatchOp()
			.setGroupCols("group")
			.setFeatureCols("c1", "c2")
			.setMinPoints(4)
			.setEpsilon(0.6)
			.linkFrom(inputOp);
		op.print();
	}
}

运行结果

group cluster_id count c1 c2
1 0 9 2.5000 3.5000
0 0 5 2.2800 3.1200