Heda Wang wangheda

## build_reference_file.py
# coding: utf-8
# python2.7

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import os.path
import random

## gist:568b11ba97a958604e54

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              3 stars
            
          
                wangheda
                / gist:568b11ba97a958604e54
            
            
              Last active
              December 10, 2021 23:42
            
              
                Downloading full CiteSeerX dataset
              
          
    Lately I need to use CiteSeerX dataset in my research, and I've been following the instructions [on this blog][1].
Sadly, due to the update of OAIHarvester, that is not working anymore. Here is my version that works.


Download [OAIHarvester2][2] from oclc.org, the latest version is 2-0.1.12 when I write this.


Run such command to download full CiteSeerX dataset:
java -classpath .:harvester2.jar:log4j-1.2.12.jar:xalan.jar:xercesImpl.jar:xml-apis.jar ORG.oclc.oai.harvester2.app.RawWrite -out citeseerx_alldata.xml  http://citeseerx.ist.psu.edu/oai2
	# coding: utf-8
	# python2.7

	from __future__ import absolute_import
	from __future__ import division
	from __future__ import print_function

	import json
	import os.path
	import random