Skip to content

Instantly share code, notes, and snippets.

@wangheda
Last active December 10, 2021 23:42
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wangheda/568b11ba97a958604e54 to your computer and use it in GitHub Desktop.
Save wangheda/568b11ba97a958604e54 to your computer and use it in GitHub Desktop.
Downloading full CiteSeerX dataset

Lately I need to use CiteSeerX dataset in my research, and I've been following the instructions on this blog.

Sadly, due to the update of OAIHarvester, that is not working anymore. Here is my version that works.

  1. Download OAIHarvester2 from oclc.org, the latest version is 2-0.1.12 when I write this.

  2. Run such command to download full CiteSeerX dataset:

    java -classpath .:harvester2.jar:log4j-1.2.12.jar:xalan.jar:xercesImpl.jar:xml-apis.jar ORG.oclc.oai.harvester2.app.RawWrite -out citeseerx_alldata.xml  http://citeseerx.ist.psu.edu/oai2
    

This is an automated script:

#!/bin/bash
wget http://pubserv.oclc.org/oaiharvester2/jars/dist/harvester2-0.1.12.tar.gz
tar zxf harvester2-0.1.12.tar.gz
cd harvester2-0.1.12
java -classpath .:harvester2.jar:log4j-1.2.12.jar:xalan.jar:xercesImpl.jar:xml-apis.jar ORG.oclc.oai.harvester2.app.RawWrite -out citeseerx_alldata.xml  http://citeseerx.ist.psu.edu/oai2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment