Computationally Currated Corpus of Open Access Publications from All Disciplines
- 2,737,377 Unique Documents
- 3,916,477 Unique Authors
- 13,197 Journals
- 5,187 Publishers
- 70 Languages
Entries compiled from arXiv, PubMed OA, DOAJ, and a selection provided by OCLC.
Download English-only Corpus (ENGLISH)
version 1.2; July 20, 2014; 2,039,079 entries
(1.4GB compressed, 4.5GB uncompressed)
- v1.3: Updated links to point to Dropbox
- v1.2: Fixed URL character case, better handled duplicates using connected component algorithm, reducing total articles from 2788559 to 2737377 (98.1%).
- v1.1: First stable release.
Performed on ENGLISH.jul20.2014.pytxt.gz on November 6th, 2014. Tab-delimited file.
Download concept_cluster_ENGLISH.nov9.2014.tab.gz (327MB compressed)
Download 2000 sample (1MB)
Download labels only (93MB)
Download cluster IDs only (59MB)
Expert annotated sample, 498 documents (<1MB)
Numpy Matrix of Top 100 most similar Articles per Article (v1.0, Aug 5, 2014)
Download: Top 100 Edge List per Article as Numpy Matrix
Top 100 Edge Weight (B=10) as Numpy Matrix
Associated Article IDs in enumerated order
Processed Concept List per Article ID (v1.2, July 20, 2014)
Download: Concept List per Article
Random subset of 1,502,473 of 2,039,079 articles (73.7%) from ENGLISH.jul20.2014 (v1.2).
(209M compressed, 589M uncompressed)
.pytxt extension indicates format of one record per line in plain text python object (pickle) notation.
300 NLP-generated Concept Token Results with manual annotations
Top K=5 Most Similiar Documents by Topic for B=1, B=10, and B=Infinity
(W0, W1, and W2 respectively)
Sorted OAAC Article IDs as used in network
n=1459114 publication records from OAAC
Cluster labels are ordered by leaf -> root. All addresses end in "0", the root node.
Unlabeled, hierarchical cluster names (for network NBM_1459114n_15K_0.001E_1W)
Cluster Concept Labels for Leaves (for network NBM_1459114n_15K_0.001E_1W)