Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active August 25, 2022 10:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/5acad373d485272b5c1f4352b1dd0fc6 to your computer and use it in GitHub Desktop.
Save rjurney/5acad373d485272b5c1f4352b1dd0fc6 to your computer and use it in GitHub Desktop.
DBLP Types, Schemas and Example Records

DBLP Training Data

I need to create a network with a set of edges that include a SAME_AS edge type and a NOT_SAME_AS edge type for entity resolution to serve as training data to enable @tanmoyio to proceed with training an entity resolution model in #3.

DBLP Datasets

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

Collecting and Preparing the Training Data

The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__ via:

python -m graphlet.dblp

See the example data at: https://gist.github.com/rjurney/5acad373d485272b5c1f4352b1dd0fc6

article Index(['@mdate', '@key', '@publtype', 'title', 'author', 'pages', 'year',
'journal', 'number', 'ee', 'url', 'volume', 'crossref', 'note', 'cdrom',
'editor', 'cite', 'booktitle', 'publnr', 'month', '@cdate',
'publisher'],
dtype='object')
book Index(['@mdate', '@key', 'author', 'title', 'year', 'pages', 'publisher',
'isbn', 'ee', 'school', '@publtype', 'series', 'volume', 'note',
'editor', 'booktitle', 'url', 'crossref', 'month', 'cite', 'cdrom'],
dtype='object')
incollection Index(['@mdate', '@key', 'author', 'title', 'pages', 'year', 'booktitle', 'ee',
'crossref', 'url', '@publtype', 'cite', 'publisher', 'number', 'note',
'cdrom', 'chapter'],
dtype='object')
inproceedings Index(['@mdate', '@key', 'author', 'title', 'booktitle', 'year', 'url',
'crossref', 'ee', 'pages', 'cite', 'cdrom', '@publtype', 'note',
'editor', 'number', 'volume', 'month'],
dtype='object')
mastersthesis Index(['@mdate', '@key', 'author', 'title', 'year', 'school', 'ee', 'note'], dtype='object')
phdthesis Index(['@mdate', '@key', 'author', 'title', 'year', 'school', 'publisher',
'number', 'pages', 'isbn', 'ee', 'month', 'series', 'volume', 'note',
'@publtype'],
dtype='object')
proceedings Index(['@mdate', '@key', 'editor', 'title', 'publisher', 'year', 'isbn', 'ee',
'url', 'booktitle', 'series', 'volume', 'note', 'number', 'pages',
'@publtype', 'author', 'school', 'address', 'journal', 'cite'],
dtype='object')
www Index(['@mdate', '@key', 'author', 'title', 'url', 'note', '@publtype',
'crossref', 'cite', 'ee', 'year', 'editor'],
dtype='object')
{
"@cdate": NaN,
"@key": "conf/www/Ericsson07",
"@mdate": "2017-06-05",
"@publtype": NaN,
"author": {
"#text": "Morgan Ericsson",
"@orcid": "0000-0003-1173-5187"
},
"booktitle": NaN,
"cdrom": NaN,
"cite": NaN,
"crossref": NaN,
"editor": NaN,
"ee": "https://doi.org/10.1007/s11280-007-0032-y",
"journal": "World Wide Web",
"month": NaN,
"note": NaN,
"number": "3",
"pages": "279-307",
"publisher": NaN,
"publnr": NaN,
"title": "The Effects of XML Compression on SOAP Performance.",
"url": "db/journals/www/www10.html#Ericsson07",
"volume": NaN,
"year": "2007"
}
{
"@key": "phd/dnb/Curth89",
"@mdate": "2021-07-17",
"@publtype": NaN,
"author": "Michael A. Curth",
"booktitle": NaN,
"cdrom": NaN,
"cite": NaN,
"crossref": NaN,
"editor": NaN,
"ee": "https://d-nb.info/891654135",
"isbn": "978-3-89012-177-2",
"month": NaN,
"note": NaN,
"pages": "1-370",
"publisher": "Eul, Germany",
"school": NaN,
"series": NaN,
"title": "Planspieltechnik und Computer-based-Training zur Schulung von Einkufern im Handel.",
"url": NaN,
"volume": NaN,
"year": "1989"
}
{
"@key": "reference/sp/Parker15",
"@mdate": "2017-05-16",
"@publtype": NaN,
"author": "Lynne Parker",
"booktitle": "Handbook of Computational Intelligence",
"cdrom": NaN,
"chapter": NaN,
"cite": NaN,
"crossref": "reference/sp/2015ci",
"ee": "https://doi.org/10.1007/978-3-662-43505-2_72",
"note": NaN,
"number": NaN,
"pages": "1395-1406",
"publisher": NaN,
"title": "Collective Manipulation and Construction.",
"url": "db/reference/sp/ci2015.html#Parker15",
"year": "2015"
}
{
"@key": "www/org/mitre/future",
"@mdate": "2019-07-30",
"@publtype": NaN,
"author": "Arnon Rosenthal",
"booktitle": "SWEE",
"cdrom": NaN,
"cite": NaN,
"crossref": "conf/swee/1998",
"editor": NaN,
"ee": "http://www.mitre.org/support/swee/rosenthal.html",
"month": NaN,
"note": NaN,
"number": NaN,
"pages": NaN,
"title": "The Future of Classic Data Administration: Objects + Databases + CASE",
"url": "db/conf/swee/swee1998.html",
"volume": NaN,
"year": "1998"
}
{
"@key": "phd/Ylonen94",
"@mdate": "2002-01-03",
"author": "Tatu Ylnen",
"ee": NaN,
"note": NaN,
"school": "Helsinki University of Technology, Department of Computer Science",
"title": "Shadow Paging Is Feasible.",
"year": "1994"
}
{
"@key": "phd/nl/Christiansen2008",
"@mdate": "2021-07-17",
"@publtype": NaN,
"author": "Kenneth Rohde Christiansen",
"ee": "https://d-nb.info/991212878",
"isbn": "978-3-8364-7586-0",
"month": NaN,
"note": NaN,
"number": NaN,
"pages": "1-135",
"publisher": NaN,
"school": "Groningen, Univ.",
"series": NaN,
"title": "Grid-enabling a software product: integration of grid resources in the RCE environment.",
"volume": NaN,
"year": "2008"
}
{
"@key": "conf/coopis/2004-1",
"@mdate": "2019-05-14",
"@publtype": NaN,
"address": NaN,
"author": NaN,
"booktitle": "CoopIS/DOA/ODBASE",
"cite": NaN,
"editor": [
"Robert Meersman",
"Zahir Tari"
],
"ee": "https://doi.org/10.1007/b102173",
"isbn": "3-540-23663-5",
"journal": NaN,
"note": NaN,
"number": NaN,
"pages": NaN,
"publisher": "Springer",
"school": NaN,
"series": {
"#text": "Lecture Notes in Computer Science",
"@href": "db/series/lncs/index.html"
},
"title": "On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE, OTM Confederated International Conferences, Agia Napa, Cyprus, October 25-29, 2004, Proceedings, Part I",
"url": "db/conf/coopis/coopis2004-1.html",
"volume": "3290",
"year": "2004"
}
{
"@key": "homepages/308/1681",
"@mdate": "2021-12-07",
"@publtype": NaN,
"author": "Scott Whiting",
"cite": NaN,
"crossref": NaN,
"editor": NaN,
"ee": NaN,
"note": NaN,
"title": "Home Page",
"url": NaN,
"year": NaN
}
[
{
"sameentity": "f",
"samename": "f",
"author1": "Said Hassan Ahmed",
"author2": "Jagdish Chandra Patra",
"key1": "conf/prib/AhmedF07",
"key2": "journals/jcc/PatraS09",
"p1type": "inproceedings",
"p1author": "Said Hassan Ahmed|Tor Flå",
"p1editor": "",
"p1title": "Estimation of Evolutionary Average Hydrophobicity Profile from a Family of Protein Sequences.",
"p1booktitle": "PRIB",
"p1booktitlefull": "",
"p1year": 2007,
"p1address": "",
"p1journal": "",
"p1journalfull": "",
"p1publisher": "",
"p1series": "",
"p1id": 793190,
"p1key": "conf/prib/AhmedF07",
"p2type": "article",
"p2author": "Jagdish Chandra Patra|Onkar Singh",
"p2editor": "",
"p2title": "Artificial neural networks-based approach to design ARIs using QSAR for diabetes mellitus.",
"p2booktitle": "",
"p2booktitlefull": "",
"p2year": 2009,
"p2address": "",
"p2journal": "Journal of Computational Chemistry",
"p2journalfull": "",
"p2publisher": "",
"p2series": "",
"p2id": 2093393,
"p2key": "journals/jcc/PatraS09"
},
{
"sameentity": "t",
"samename": "t",
"author1": "Jwu-E Chen",
"author2": "Jwu-E Chen",
"key1": "conf/vlsid/ChenCC95",
"key2": "journals/tcad/LuoCWCCW08",
"p1type": "inproceedings",
"p1author": "Yung-Yuan Chen|Ching-Hwa Cheng|Jwu-E Chen",
"p1editor": "",
"p1title": "An efficient switching network fault diagnosis for reconfigurable VLSI/WSI array processors.",
"p1booktitle": "VLSI Design",
"p1booktitlefull": "VLSI Design",
"p1year": 1995,
"p1address": "",
"p1journal": "",
"p1journalfull": "",
"p1publisher": "",
"p1series": "",
"p1id": 754984,
"p1key": "conf/vlsid/ChenCC95",
"p2type": "article",
"p2author": "Pei-Wen Luo|Jwu-E Chen|Chin-Long Wey|Liang-Chia Cheng|Ji-Jan Chen|Wen-Ching Wu",
"p2editor": "",
"p2title": "Impact of Capacitance Correlation on Yield Enhancement of Mixed-Signal/Analog Integrated Circuits.",
"p2booktitle": "",
"p2booktitlefull": "",
"p2year": 2008,
"p2address": "",
"p2journal": "IEEE Trans. on CAD of Integrated Circuits and Systems",
"p2journalfull": "",
"p2publisher": "",
"p2series": "",
"p2id": 2235193,
"p2key": "journals/tcad/LuoCWCCW08"
},
{
"sameentity": "t",
"samename": "t",
"author1": "Z. Sun",
"author2": "Z. Sun",
"key1": "conf/prozess/Sun88",
"key2": "conf/isnn/SunZLCS07",
"p1type": "inproceedings",
"p1author": "Z. Sun",
"p1editor": "",
"p1title": "Anwendung graphischer Darstellungen im Rahmen einer Spezifikationssprache für das Requirements Engineering.",
"p1booktitle": "Prozeßrechnersysteme",
"p1booktitlefull": "",
"p1year": 1988,
"p1address": "",
"p1journal": "",
"p1journalfull": "",
"p1publisher": "",
"p1series": "",
"p1id": 648625,
"p1key": "conf/prozess/Sun88",
"p2type": "inproceedings",
"p2author": "Z. Sun|M. J. Zhang|Xiao H. Liao|Wenchuan Cai|Yongduan Song",
"p2editor": "",
"p2title": "Neuro-Adaptive Formation Control of Multi-Mobile Vehicles: Virtual Leader Based Path Planning and Tracking.",
"p2booktitle": "ISNN (1)",
"p2booktitlefull": "",
"p2year": 2007,
"p2address": "",
"p2journal": "",
"p2journalfull": "",
"p2publisher": "",
"p2series": "",
"p2id": 519844,
"p2key": "conf/isnn/SunZLCS07"
},
{
"sameentity": "f",
"samename": "t",
"author1": "Abdul Sattar",
"author2": "Abdul Sattar",
"key1": "conf/pricai/BeaumontTSM04",
"key2": "conf/icip/SattarAS08",
"p1type": "inproceedings",
"p1author": "Matthew Beaumont|John Thornton|Abdul Sattar|Michael J. Maher",
"p1editor": "",
"p1title": "Solving Over-Constrained Temporal Reasoning Problems Using Local Search.",
"p1booktitle": "PRICAI",
"p1booktitlefull": "Pacific Rim International Conference on Artificial Intelligence",
"p1year": 2004,
"p1address": "",
"p1journal": "",
"p1journalfull": "",
"p1publisher": "",
"p1series": "",
"p1id": 645750,
"p1key": "conf/pricai/BeaumontTSM04",
"p2type": "inproceedings",
"p2author": "Abdul Sattar 0003|Yasser Aidarous|Renaud Séguier",
"p2editor": "",
"p2title": "GAGM-AAM: A genetic optimization with Gaussian mixtures for Active Appearance Models.",
"p2booktitle": "ICIP",
"p2booktitlefull": "International Conference on Image Processing",
"p2year": 2008,
"p2address": "",
"p2journal": "",
"p2journalfull": "",
"p2publisher": "",
"p2series": "",
"p2id": 380526,
"p2key": "conf/icip/SattarAS08"
}
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment