Skip to content

Instantly share code, notes, and snippets.

View wangheda's full-sized avatar

Heda Wang wangheda

  • Alibaba
  • Beijing, China
View GitHub Profile
@wangheda
wangheda / build_reference_file.py
Created October 9, 2017 05:47
Converting caption annotations into MSCOCO-style reference file (for validation in the image captioning task on challenger.ai)
# coding: utf-8
# python2.7
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import os.path
import random
@wangheda
wangheda / gist:568b11ba97a958604e54
Last active December 10, 2021 23:42
Downloading full CiteSeerX dataset

Lately I need to use CiteSeerX dataset in my research, and I've been following the instructions [on this blog][1].

Sadly, due to the update of OAIHarvester, that is not working anymore. Here is my version that works.

  1. Download [OAIHarvester2][2] from oclc.org, the latest version is 2-0.1.12 when I write this.

  2. Run such command to download full CiteSeerX dataset:

    java -classpath .:harvester2.jar:log4j-1.2.12.jar:xalan.jar:xercesImpl.jar:xml-apis.jar ORG.oclc.oai.harvester2.app.RawWrite -out citeseerx_alldata.xml  http://citeseerx.ist.psu.edu/oai2