Skip to content

Instantly share code, notes, and snippets.

@andrewdyates
Last active July 31, 2021 16:04
Show Gist options
  • Save andrewdyates/9c91aa8b56ee3aac5c86 to your computer and use it in GitHub Desktop.
Save andrewdyates/9c91aa8b56ee3aac5c86 to your computer and use it in GitHub Desktop.

The Open Access Academia Corpus

Computationally Currated Corpus of Open Access Publications from All Disciplines

  • 2,737,377 Unique Documents
  • 3,916,477 Unique Authors
  • 13,197 Journals
  • 5,187 Publishers
  • 70 Languages

Entries compiled from arXiv, PubMed OA, DOAJ, and a selection provided by OCLC.

Download English-only Corpus (ENGLISH)
version 1.2; July 20, 2014; 2,039,079 entries
(1.4GB compressed, 4.5GB uncompressed)


Change Log

  • v1.3: Updated links to point to Dropbox
  • v1.2: Fixed URL character case, better handled duplicates using connected component algorithm, reducing total articles from 2788559 to 2737377 (98.1%).
  • v1.1: First stable release.

Clustering Results

Performed on ENGLISH.jul20.2014.pytxt.gz on November 6th, 2014. Tab-delimited file.

Download concept_cluster_ENGLISH.nov9.2014.tab.gz (327MB compressed)
Download 2000 sample (1MB)

Download labels only (93MB)
Download cluster IDs only (59MB)

Expert annotated sample, 498 documents (<1MB)


Additional Resources

Numpy Matrix of Top 100 most similar Articles per Article (v1.0, Aug 5, 2014)
Download: Top 100 Edge List per Article as Numpy Matrix
Top 100 Edge Weight (B=10) as Numpy Matrix
Associated Article IDs in enumerated order

Processed Concept List per Article ID (v1.2, July 20, 2014)
Download: Concept List per Article
Random subset of 1,502,473 of 2,039,079 articles (73.7%) from ENGLISH.jul20.2014 (v1.2).
(209M compressed, 589M uncompressed)

.pytxt extension indicates format of one record per line in plain text python object (pickle) notation.


Results of Manual Review

300 NLP-generated Concept Token Results with manual annotations

Top K=5 Most Similiar Documents by Topic for B=1, B=10, and B=Infinity
(W0, W1, and W2 respectively)


Links to Raw Clustering Results

Sorted OAAC Article IDs as used in network
n=1459114 publication records from OAAC

Hiearchical Cluster Labels

Cluster labels are ordered by leaf -> root. All addresses end in "0", the root node.

Unlabeled, hierarchical cluster names (for network NBM_1459114n_15K_0.001E_1W)

Cluster Concept Labels for Leaves (for network NBM_1459114n_15K_0.001E_1W)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment