I need to create a network with a set of edges that include a SAME_AS
edge type and a NOT_SAME_AS
edge type for entity resolution to serve as training data to enable @tanmoyio to proceed with training an entity resolution model in #3.
DBLP is a database of scholarly research in computer science.
The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.
- DBLP Dataset is available at https://dblp.org/xml/dblp.xml.gz.
- DBLP Dataset 2 by Prof. Dr. Felix Naumann available in DBLP10k.csv is a set of 10K labels (5K true, 5K false) for pairs of authors. We use it to train our entity resoultion model.
The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__
via:
python -m graphlet.dblp
See the example data at: https://gist.github.com/rjurney/5acad373d485272b5c1f4352b1dd0fc6