Lately I need to use CiteSeerX dataset in my research, and I've been following the instructions on this blog.
Sadly, due to the update of OAIHarvester, that is not working anymore. Here is my version that works.
-
Download OAIHarvester2 from oclc.org, the latest version is 2-0.1.12 when I write this.
-
Run such command to download full CiteSeerX dataset:
java -classpath .:harvester2.jar:log4j-1.2.12.jar:xalan.jar:xercesImpl.jar:xml-apis.jar ORG.oclc.oai.harvester2.app.RawWrite -out citeseerx_alldata.xml http://citeseerx.ist.psu.edu/oai2
This is an automated script:
#!/bin/bash
wget http://pubserv.oclc.org/oaiharvester2/jars/dist/harvester2-0.1.12.tar.gz
tar zxf harvester2-0.1.12.tar.gz
cd harvester2-0.1.12
java -classpath .:harvester2.jar:log4j-1.2.12.jar:xalan.jar:xercesImpl.jar:xml-apis.jar ORG.oclc.oai.harvester2.app.RawWrite -out citeseerx_alldata.xml http://citeseerx.ist.psu.edu/oai2