第31.2节示例数据

Database and Information System (DBIS)数据，涵盖了464个会议、前5000名作者以及相应的72902篇论文。关于该数据集更多的介绍及下载链接，参见

https://ericdongyx.github.io/metapath2vec/m2v.html

我们还会用到一个作者标签数据，同样可以通过上面的链接查询、下载。GS-Labeled results for AMiner Data，下载文件名为label.zip，对246,678 名作者标记其属于8个领域中的哪一个。这8个领域为：

1. Computing Systems

2. Theoretical Computer Science

3. Computer Networks & Wireless Communication

4. Computer Graphics

5. Human Computer Interaction

6. Computational Linguistics

7. Computer Vision & Pattern Recognition

8. Databases & Information Systems

net_dbis.zip下载解压后，有5个文件，其名称和内容介绍如下：

文件名	内容介绍
id_author.txt	两列数据，作者id与姓名，使用’\t’分隔
id_conf.txt	两列数据，会议id与名称，使用’\t’分隔
paper_author.txt	两列数据，论文id与作者id，使用’\t’分隔
paper_conf.txt	两列数据，论文id与会议id，使用’\t’分隔
paper.txt	两列数据，论文者id与标题，论文id后面用空格补齐，第13个字母开始是标题内容

后面，我们会将作者、会议和论文作为节点；论文与作者的关联、论文与会议的关联作为边，从而构成图的结构。这样就需要在图节点的层面有个统一的id，这里，我们使用一个简单的办法，即在作者、会议和论文的原有id前面分别加上后缀A、C、P，从而形成了唯一的节点id。关于数据读取和统一id的代码如下，基本的思路是用Alink Source组件读入原始数据，然后通过select方法，使用SQL的内置函数进行处理。

static BatchOperator <?> paper_author = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "paper_author.txt")
	.setSchemaStr("paper_id string, author_id string")
	.setFieldDelimiter("\t")
	.select("CONCAT('P', paper_id) AS paper_id, CONCAT('A', author_id) AS author_id");

static BatchOperator <?> paper_conf = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "paper_conf.txt")
	.setSchemaStr("paper_id string, conf_id string")
	.setFieldDelimiter("\t")
	.select("CONCAT('P', paper_id) AS paper_id, CONCAT('C', conf_id) AS conf_id");

static BatchOperator <?> id_author = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "id_author.txt")
	.setSchemaStr("author_id string, author string")
	.setFieldDelimiter("\t")
	.select("CONCAT('A', author_id) AS author_id, author");

static BatchOperator <?> id_conf = new CsvSourceBatchOp()
	.setFilePath(DATA_DIR + "id_conf.txt")
	.setSchemaStr("conf_id string, conf string")
	.setFieldDelimiter("\t")
	.select("CONCAT('C', conf_id) AS conf_id, conf");

static BatchOperator <?> paper = new TextSourceBatchOp()
	.setFilePath(DATA_DIR + "paper.txt")
	.select("CONCAT('P', TRIM(SUBSTRING(text FROM 1 FOR 12))) AS paper_id, "
		+ "SUBSTRING(text FROM 13) AS paper_name");

打印输出一些数据，让我们对数据有更直观的感受，代码如下：

paper_author.lazyPrint(3, "< paper_author >");
paper_conf.lazyPrint(3, "< paper_conf >");
id_author.lazyPrint(3, "< id_author >");
id_conf.lazyPrint(3, "< id_conf >");
paper.lazyPrint(3, "< paper >");

BatchOperator.execute();

运行结果如下：

< paper_author >
paper_id|author_id
--------|---------
P128744|A55668
P128744|A55670
P128744|A82448
< paper_conf >
paper_id|conf_id
--------|-------
P868583|C4524
P868584|C4524
P868585|C4524
< id_author >
author_id|author
---------|------
A568253|aLourdesY.Collantes
A363709|aJohnR.Rumble
A363708|aRichardA.Johnson
< id_conf >
conf_id|conf
-------|----
C4789|vVLDBJ.
C3258|vISWCWorkshoponTrust,Security,andReputationontheSemanticWeb
C4149|vIWRIDL
< paper >
paper_id|paper_name
--------|----------
P869186|Designing and Writing Online Documentation: Hypermedia for Self-Supporting Products, Second Edition, by William Horton.
P869187|Designer Selves: Construction of Technologically Mediated Identity within Graphical, Multiuser Virtual Environments.
P869188|Dissipative Structure Theory, Synergetics, and Their Implications for the Management of Information Systems.

ALinkLab

第31.2节 示例数据

第31.2节示例数据