Alink教程(Java版)
该文档涉及的组件

第31.2节 示例数据

Database and Information System (DBIS)数据,涵盖了464个会议、前5000名作者以及相应的72902篇论文。关于该数据集更多的介绍及下载链接,参见

https://ericdongyx.github.io/metapath2vec/m2v.html

我们还会用到一个作者标签数据,同样可以通过上面的链接查询、下载。GS-Labeled results for AMiner Data,下载文件名为label.zip,对246,678 名作者标记其属于8个领域中的哪一个。这8个领域为:

1. Computing Systems

2. Theoretical Computer Science

3. Computer Networks & Wireless Communication

4. Computer Graphics

5. Human Computer Interaction

6. Computational Linguistics

7. Computer Vision & Pattern Recognition

8. Databases & Information Systems

net_dbis.zip下载解压后,有5个文件,其名称和内容介绍如下:

文件名

内容介绍

id_author.txt

两列数据,作者id与姓名,使用’\t’分隔

id_conf.txt

两列数据,会议id与名称,使用’\t’分隔

paper_author.txt

两列数据,论文id与作者id,使用’\t’分隔

paper_conf.txt

两列数据,论文id与会议id,使用’\t’分隔

paper.txt

两列数据,论文者id与标题,论文id后面用空格补齐,第13个字母开始是标题内容

后面,我们会将作者、会议和论文作为节点;论文与作者的关联、论文与会议的关联作为边,从而构成图的结构。这样就需要在图节点的层面有个统一的id,这里,我们使用一个简单的办法,即在作者、会议和论文的原有id前面分别加上后缀ACP,从而形成了唯一的节点id。关于数据读取和统一id的代码如下,基本的思路是用Alink Source组件读入原始数据,然后通过select方法,使用SQL的内置函数进行处理。

static BatchOperator <?> paper_author = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "paper_author.txt")
	.setSchemaStr("paper_id string, author_id string")
	.setFieldDelimiter("\t")
	.select("CONCAT('P', paper_id) AS paper_id, CONCAT('A', author_id) AS author_id");

static BatchOperator <?> paper_conf = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "paper_conf.txt")
	.setSchemaStr("paper_id string, conf_id string")
	.setFieldDelimiter("\t")
	.select("CONCAT('P', paper_id) AS paper_id, CONCAT('C', conf_id) AS conf_id");

static BatchOperator <?> id_author = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "id_author.txt")
	.setSchemaStr("author_id string, author string")
	.setFieldDelimiter("\t")
	.select("CONCAT('A', author_id) AS author_id, author");

static BatchOperator <?> id_conf = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "id_conf.txt")
	.setSchemaStr("conf_id string, conf string")
	.setFieldDelimiter("\t")
	.select("CONCAT('C', conf_id) AS conf_id, conf");

static BatchOperator <?> paper = new TextSourceBatchOp()
	.setFilePath(DATA_DIR + "paper.txt")
	.select("CONCAT('P', TRIM(SUBSTRING(text FROM 1 FOR 12))) AS paper_id, "
		+ "SUBSTRING(text FROM 13) AS paper_name");


打印输出一些数据,让我们对数据有更直观的感受,代码如下:

paper_author.lazyPrint(3, "< paper_author >");
paper_conf.lazyPrint(3, "< paper_conf >");
id_author.lazyPrint(3, "< id_author >");
id_conf.lazyPrint(3, "< id_conf >");
paper.lazyPrint(3, "< paper >");

BatchOperator.execute();

运行结果如下:

< paper_author >
paper_id|author_id
--------|---------
P128744|A55668
P128744|A55670
P128744|A82448
< paper_conf >
paper_id|conf_id
--------|-------
P868583|C4524
P868584|C4524
P868585|C4524
< id_author >
author_id|author
---------|------
A568253|aLourdesY.Collantes
A363709|aJohnR.Rumble
A363708|aRichardA.Johnson
< id_conf >
conf_id|conf
-------|----
C4789|vVLDBJ.
C3258|vISWCWorkshoponTrust,Security,andReputationontheSemanticWeb
C4149|vIWRIDL
< paper >
paper_id|paper_name
--------|----------
P869186|Designing and Writing Online Documentation: Hypermedia for Self-Supporting Products, Second Edition, by William Horton.
P869187|Designer Selves: Construction of Technologically Mediated Identity within Graphical, Multiuser Virtual Environments.
P869188|Dissipative Structure Theory, Synergetics, and Their Implications for the Management of Information Systems.