andrewdyates/OA_Corpus.markdown

## OA_Corpus.markdown

      
    Raw
  

              OA_Corpus.markdown
            
          
    The Open Access Academia Corpus

Computationally Currated Corpus of Open Access Publications from All Disciplines

2,737,377 Unique Documents
3,916,477 Unique Authors
13,197 Journals
5,187 Publishers
70 Languages

Entries compiled from arXiv, PubMed OA, DOAJ, and a selection provided by OCLC.
Download English-only Corpus (ENGLISH)

version 1.2; July 20, 2014; 2,039,079 entries

(1.4GB compressed, 4.5GB uncompressed)

Change Log


v1.3: Updated links to point to Dropbox
v1.2: Fixed URL character case, better handled duplicates using connected component algorithm, reducing total articles from 2788559 to 2737377 (98.1%).
v1.1: First stable release.


Clustering Results

Performed on ENGLISH.jul20.2014.pytxt.gz on November 6th, 2014. Tab-delimited file.
Download concept_cluster_ENGLISH.nov9.2014.tab.gz (327MB compressed)

Download 2000 sample (1MB)
Download labels only (93MB)

Download cluster IDs only (59MB)
Expert annotated sample, 498 documents (<1MB)

Additional Resources

Numpy Matrix of Top 100 most similar Articles per Article (v1.0, Aug 5, 2014)

Download: Top 100 Edge List per Article as Numpy Matrix

Top 100 Edge Weight (B=10) as Numpy Matrix

Associated Article IDs in enumerated order
Processed Concept List per Article ID (v1.2, July 20, 2014)

Download: Concept List per Article

Random subset of 1,502,473 of 2,039,079 articles (73.7%) from ENGLISH.jul20.2014 (v1.2).

(209M compressed, 589M uncompressed)
.pytxt extension indicates format of one record per line in plain text python object (pickle) notation.

Results of Manual Review

300 NLP-generated Concept Token Results with manual annotations
Top K=5 Most Similiar Documents by Topic for B=1, B=10, and B=Infinity

(W0, W1, and W2 respectively)

Links to Raw Clustering Results

Sorted OAAC Article IDs as used in network

n=1459114 publication records from OAAC
Hiearchical Cluster Labels

Cluster labels are ordered by leaf -> root. All addresses end in "0", the root node.
Unlabeled, hierarchical cluster names (for network NBM_1459114n_15K_0.001E_1W)
Cluster Concept Labels for Leaves (for network NBM_1459114n_15K_0.001E_1W)