Alink教程(Java版)

第31.4节 查看Embedding


前面介绍的三种计算Embedding的算法,它们的Embedding模型结果在数据形式是一样的。Embedding模型结果是个数据表,包含两列,第一列“node”为节点的ID或名称,第二列“vec”是对应的向量值。下面使用Alink的数据显示和统计功能,让我们对Embedding模型有更加具体的印象。具体代码如下:

for (String embedding_model_file :
	new String[] {DEEPWALK_EMBEDDING, NODE2VEC_EMBEDDING, METAPATH2VEC_EMBEDDING}
) {

	System.out.println("\n\n< " + embedding_model_file + " >\n");

	new AkSourceBatchOp()
		.setFilePath(DATA_DIR + embedding_model_file)
		.lazyPrint(3)
		.lazyPrintStatistics()
		.link(
			new VectorSummarizerBatchOp()
				.setSelectedCol("vec")
				.lazyPrintVectorSummary()
		);
	BatchOperator.execute();
}

运行结果如下。由于向量的维度为100,显示内容过长,这里只显示了部分的分量值;同样,也省略了大部分向量分量的统计结果。

< deepwalk_embedding.ak >

node|vec
----|---
A113014|0.20347502827644348 0.20719735324382782 0.15886491537094116 …
A130208|0.10066881775856018 -0.018122194334864616 0.07961387932300568 …
A115857|0.23014286160469055 0.1857871264219284 0.04826408624649048 …

Summary: 
|colName| count|missing|sum|mean|variance|min|max|
|-------|------|-------|---|----|--------|---|---|
|   node|134060|      0|NaN| NaN|     NaN|NaN|NaN|
|    vec|134060|      0|NaN| NaN|     NaN|NaN|NaN|

DenseVectorSummary:
| id| count|        sum|   mean|variance|stdDev|    min|   max|    normL1|  normL2|
|---|------|-----------|-------|--------|------|-------|------|----------|--------|
|  0|134060| 20265.8801| 0.1512|  0.0081|0.0897|-1.4333|2.3575|20505.1363| 64.3648|
|  1|134060|  9384.3044|   0.07|  0.0133|0.1155|-1.9001|2.4042|12489.9207| 49.4474|
|  2|134060|   4222.792| 0.0315|  0.0091|0.0954|-1.5937|2.5049| 9220.0963| 36.7889|
……
| 98|134060| 11835.2852| 0.0883|  0.0062| 0.079|-1.4292|2.7774|12684.5801| 43.3693|
| 99|134060| 23962.9638| 0.1787|  0.0098| 0.099|-2.8285|1.9443|24136.9401| 74.8116|


< node2vec_embedding.ak >

node|vec
----|---
A113711|-0.15412482619285583 -0.08229376375675201 -0.02651112899184227 …
A123879|0.0010191870387643576 -0.04368223249912262 -0.05337075889110565 …
A116726|-0.06363832205533981 -0.09956340491771698 -0.032154712826013565 …

Summary: 
|colName| count|missing|sum|mean|variance|min|max|
|-------|------|-------|---|----|--------|---|---|
|   node|134060|      0|NaN| NaN|     NaN|NaN|NaN|
|    vec|134060|      0|NaN| NaN|     NaN|NaN|NaN|

DenseVectorSummary:
| id| count|        sum|   mean|variance|stdDev|    min|   max|    normL1|  normL2|
|---|------|-----------|-------|--------|------|-------|------|----------|--------|
|  0|134060|-13715.1514|-0.1023|  0.0126|0.1125|-2.4012|2.1025| 15797.808| 55.6677|
|  1|134060| -6800.0121|-0.0507|  0.0052| 0.072|-2.1189|1.8698| 8728.7526| 32.2599|
|  2|134060|  -8679.904|-0.0647|  0.0095|0.0974|-2.3427|1.7201|11836.4094| 42.8284|
……
| 98|134060|  5132.3967| 0.0383|  0.0183|0.1351|-2.3089| 1.924| 13858.637| 51.4196|
| 99|134060|  6438.4191|  0.048|  0.0064|0.0798|-1.4985|2.0007| 8678.3764| 34.1042|


< metapath2vec_embedding.ak >

node|vec
----|---
A121637|0.0186984371393919 -0.09224890917539597 -0.08401679992675781 …
A15146|0.13034813106060028 -0.2506742477416992 -0.22918404638767242 …
A116936|-0.0010067853145301342 -0.0728369653224945 -0.22371923923492432 …

Summary: 
|colName| count|missing|sum|mean|variance|min|max|
|-------|------|-------|---|----|--------|---|---|
|   node|134056|      0|NaN| NaN|     NaN|NaN|NaN|
|    vec|134056|      0|NaN| NaN|     NaN|NaN|NaN|

DenseVectorSummary:
| id| count|        sum|   mean|variance|stdDev|    min|   max|    normL1|  normL2|
|---|------|-----------|-------|--------|------|-------|------|----------|--------|
|  0|134056|  4744.9496| 0.0354|  0.0041|0.0639| -1.736|1.8371| 7348.2225| 26.7527|
|  1|134056|-21129.0538|-0.1576|  0.0104| 0.102| -1.951|1.4241|21787.3009| 68.7476|
|  2|134056|-20525.5559|-0.1531|  0.0075|0.0868|-2.3799|1.5316|20734.5735| 64.4387|
……
| 98|134056|  6953.8466| 0.0519|  0.0043|0.0656|-1.8078|2.5805| 8156.3795| 30.6233|
| 99|134056| 32186.9595| 0.2401|  0.0083|0.0909| -0.952|1.9069|32207.9876| 93.9998|

从统计结果可以看出,这三种算法的Embedding向量的每个分量取值范围比较相似,均值多为;各分量的取值范围也比较相似。