前面介绍的三种计算Embedding的算法,它们的Embedding模型结果在数据形式是一样的。Embedding模型结果是个数据表,包含两列,第一列“node”为节点的ID或名称,第二列“vec”是对应的向量值。下面使用Alink的数据显示和统计功能,让我们对Embedding模型有更加具体的印象。具体代码如下:
for (String embedding_model_file : new String[] {DEEPWALK_EMBEDDING, NODE2VEC_EMBEDDING, METAPATH2VEC_EMBEDDING} ) { System.out.println("\n\n< " + embedding_model_file + " >\n"); new AkSourceBatchOp() .setFilePath(DATA_DIR + embedding_model_file) .lazyPrint(3) .lazyPrintStatistics() .link( new VectorSummarizerBatchOp() .setSelectedCol("vec") .lazyPrintVectorSummary() ); BatchOperator.execute(); }
运行结果如下。由于向量的维度为100,显示内容过长,这里只显示了部分的分量值;同样,也省略了大部分向量分量的统计结果。
< deepwalk_embedding.ak > node|vec ----|--- A113014|0.20347502827644348 0.20719735324382782 0.15886491537094116 … A130208|0.10066881775856018 -0.018122194334864616 0.07961387932300568 … A115857|0.23014286160469055 0.1857871264219284 0.04826408624649048 … Summary: |colName| count|missing|sum|mean|variance|min|max| |-------|------|-------|---|----|--------|---|---| | node|134060| 0|NaN| NaN| NaN|NaN|NaN| | vec|134060| 0|NaN| NaN| NaN|NaN|NaN| DenseVectorSummary: | id| count| sum| mean|variance|stdDev| min| max| normL1| normL2| |---|------|-----------|-------|--------|------|-------|------|----------|--------| | 0|134060| 20265.8801| 0.1512| 0.0081|0.0897|-1.4333|2.3575|20505.1363| 64.3648| | 1|134060| 9384.3044| 0.07| 0.0133|0.1155|-1.9001|2.4042|12489.9207| 49.4474| | 2|134060| 4222.792| 0.0315| 0.0091|0.0954|-1.5937|2.5049| 9220.0963| 36.7889| …… | 98|134060| 11835.2852| 0.0883| 0.0062| 0.079|-1.4292|2.7774|12684.5801| 43.3693| | 99|134060| 23962.9638| 0.1787| 0.0098| 0.099|-2.8285|1.9443|24136.9401| 74.8116| < node2vec_embedding.ak > node|vec ----|--- A113711|-0.15412482619285583 -0.08229376375675201 -0.02651112899184227 … A123879|0.0010191870387643576 -0.04368223249912262 -0.05337075889110565 … A116726|-0.06363832205533981 -0.09956340491771698 -0.032154712826013565 … Summary: |colName| count|missing|sum|mean|variance|min|max| |-------|------|-------|---|----|--------|---|---| | node|134060| 0|NaN| NaN| NaN|NaN|NaN| | vec|134060| 0|NaN| NaN| NaN|NaN|NaN| DenseVectorSummary: | id| count| sum| mean|variance|stdDev| min| max| normL1| normL2| |---|------|-----------|-------|--------|------|-------|------|----------|--------| | 0|134060|-13715.1514|-0.1023| 0.0126|0.1125|-2.4012|2.1025| 15797.808| 55.6677| | 1|134060| -6800.0121|-0.0507| 0.0052| 0.072|-2.1189|1.8698| 8728.7526| 32.2599| | 2|134060| -8679.904|-0.0647| 0.0095|0.0974|-2.3427|1.7201|11836.4094| 42.8284| …… | 98|134060| 5132.3967| 0.0383| 0.0183|0.1351|-2.3089| 1.924| 13858.637| 51.4196| | 99|134060| 6438.4191| 0.048| 0.0064|0.0798|-1.4985|2.0007| 8678.3764| 34.1042| < metapath2vec_embedding.ak > node|vec ----|--- A121637|0.0186984371393919 -0.09224890917539597 -0.08401679992675781 … A15146|0.13034813106060028 -0.2506742477416992 -0.22918404638767242 … A116936|-0.0010067853145301342 -0.0728369653224945 -0.22371923923492432 … Summary: |colName| count|missing|sum|mean|variance|min|max| |-------|------|-------|---|----|--------|---|---| | node|134056| 0|NaN| NaN| NaN|NaN|NaN| | vec|134056| 0|NaN| NaN| NaN|NaN|NaN| DenseVectorSummary: | id| count| sum| mean|variance|stdDev| min| max| normL1| normL2| |---|------|-----------|-------|--------|------|-------|------|----------|--------| | 0|134056| 4744.9496| 0.0354| 0.0041|0.0639| -1.736|1.8371| 7348.2225| 26.7527| | 1|134056|-21129.0538|-0.1576| 0.0104| 0.102| -1.951|1.4241|21787.3009| 68.7476| | 2|134056|-20525.5559|-0.1531| 0.0075|0.0868|-2.3799|1.5316|20734.5735| 64.4387| …… | 98|134056| 6953.8466| 0.0519| 0.0043|0.0656|-1.8078|2.5805| 8156.3795| 30.6233| | 99|134056| 32186.9595| 0.2401| 0.0083|0.0909| -0.952|1.9069|32207.9876| 93.9998|
从统计结果可以看出,这三种算法的Embedding向量的每个分量取值范围比较相似,均值多为,,;各分量的取值范围也比较相似。