Database and Information System (DBIS)数据,涵盖了464个会议、前5000名作者以及相应的72902篇论文。关于该数据集更多的介绍及下载链接,参见
https://ericdongyx.github.io/metapath2vec/m2v.html
我们还会用到一个作者标签数据,同样可以通过上面的链接查询、下载。GS-Labeled results for AMiner Data,下载文件名为label.zip,对246,678 名作者标记其属于8个领域中的哪一个。这8个领域为:
1. Computing Systems
2. Theoretical Computer Science
3. Computer Networks & Wireless Communication
4. Computer Graphics
5. Human Computer Interaction
6. Computational Linguistics
7. Computer Vision & Pattern Recognition
8. Databases & Information Systems
net_dbis.zip下载解压后,有5个文件,其名称和内容介绍如下:
文件名 | 内容介绍 |
id_author.txt | 两列数据,作者id与姓名,使用’\t’分隔 |
id_conf.txt | 两列数据,会议id与名称,使用’\t’分隔 |
paper_author.txt | 两列数据,论文id与作者id,使用’\t’分隔 |
paper_conf.txt | 两列数据,论文id与会议id,使用’\t’分隔 |
paper.txt | 两列数据,论文者id与标题,论文id后面用空格补齐,第13个字母开始是标题内容 |
后面,我们会将作者、会议和论文作为节点;论文与作者的关联、论文与会议的关联作为边,从而构成图的结构。这样就需要在图节点的层面有个统一的id,这里,我们使用一个简单的办法,即在作者、会议和论文的原有id前面分别加上后缀A、C、P,从而形成了唯一的节点id。关于数据读取和统一id的代码如下,基本的思路是用Alink Source组件读入原始数据,然后通过select方法,使用SQL的内置函数进行处理。
static BatchOperator <?> paper_author = new CsvSourceBatchOp() .setFilePath(DATA_DIR + "paper_author.txt") .setSchemaStr("paper_id string, author_id string") .setFieldDelimiter("\t") .select("CONCAT('P', paper_id) AS paper_id, CONCAT('A', author_id) AS author_id"); static BatchOperator <?> paper_conf = new CsvSourceBatchOp() .setFilePath(DATA_DIR + "paper_conf.txt") .setSchemaStr("paper_id string, conf_id string") .setFieldDelimiter("\t") .select("CONCAT('P', paper_id) AS paper_id, CONCAT('C', conf_id) AS conf_id"); static BatchOperator <?> id_author = new CsvSourceBatchOp() .setFilePath(DATA_DIR + "id_author.txt") .setSchemaStr("author_id string, author string") .setFieldDelimiter("\t") .select("CONCAT('A', author_id) AS author_id, author"); static BatchOperator <?> id_conf = new CsvSourceBatchOp() .setFilePath(DATA_DIR + "id_conf.txt") .setSchemaStr("conf_id string, conf string") .setFieldDelimiter("\t") .select("CONCAT('C', conf_id) AS conf_id, conf"); static BatchOperator <?> paper = new TextSourceBatchOp() .setFilePath(DATA_DIR + "paper.txt") .select("CONCAT('P', TRIM(SUBSTRING(text FROM 1 FOR 12))) AS paper_id, " + "SUBSTRING(text FROM 13) AS paper_name");
打印输出一些数据,让我们对数据有更直观的感受,代码如下:
paper_author.lazyPrint(3, "< paper_author >"); paper_conf.lazyPrint(3, "< paper_conf >"); id_author.lazyPrint(3, "< id_author >"); id_conf.lazyPrint(3, "< id_conf >"); paper.lazyPrint(3, "< paper >"); BatchOperator.execute();
运行结果如下:
< paper_author > paper_id|author_id --------|--------- P128744|A55668 P128744|A55670 P128744|A82448 < paper_conf > paper_id|conf_id --------|------- P868583|C4524 P868584|C4524 P868585|C4524 < id_author > author_id|author ---------|------ A568253|aLourdesY.Collantes A363709|aJohnR.Rumble A363708|aRichardA.Johnson < id_conf > conf_id|conf -------|---- C4789|vVLDBJ. C3258|vISWCWorkshoponTrust,Security,andReputationontheSemanticWeb C4149|vIWRIDL < paper > paper_id|paper_name --------|---------- P869186|Designing and Writing Online Documentation: Hypermedia for Self-Supporting Products, Second Edition, by William Horton. P869187|Designer Selves: Construction of Technologically Mediated Identity within Graphical, Multiuser Virtual Environments. P869188|Dissipative Structure Theory, Synergetics, and Their Implications for the Management of Information Systems.