Skip to content

Instantly share code, notes, and snippets.

@d2207197
Last active August 29, 2015 14:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save d2207197/e0fd9f1e4b7ef100a9da to your computer and use it in GitHub Desktop.
Save d2207197/e0fd9f1e4b7ef100a9da to your computer and use it in GitHub Desktop.

README

Original Data

    <record>
        <header>
           <identifier>oai:CiteSeerX.psu:10.1.1.1.1484</identifier>
           <datestamp>2009-05-24</datestamp>
        </header>
        <metadata>
          <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarc
          <dc:title>Winner-Take-All Network Utilising Pseudoinverse Reconstruction Subnets Demonstrates Robustness on the Handprinted Character Recognition Problem</dc:title>
          <dc:creator>J. Körmendy-rácz</dc:creator>
          <dc:creator>S. Szabó</dc:creator>
          <dc:creator>J. Lörincz</dc:creator>
          <dc:creator>G. Antal</dc:creator>
          <dc:creator>G. Kovács</dc:creator>
          <dc:creator>A. Lörincz</dc:creator>
          <dc:subject>Correspondence and offprint requests to</dc:subject>
          <dc:subject>J. Kormendy-Rácz</dc:subject>
          <dc:description>Wittmeyer’s pseudoinverse iterative algorithm is formulated&#13; as a dynamic connectionist Data Compression and Reconstruction (DCR) network, and subnets of this type are supplemented by the winner-take-all paradigm. The winner is selected upon the goodness-of-fit of
          <dc:contributor>The Pennsylvania State University CiteSeerX Archives</dc:contributor>
          <dc:publisher>Springer</dc:publisher>
          <dc:date>2009-05-24</dc:date>
          <dc:date>2007-11-19</dc:date>
          <dc:date>1999</dc:date>
          <dc:format>application/pdf</dc:format>
          <dc:type>text</dc:type>
          <dc:identifier>http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1484</dc:identifier>
          <dc:source>http://people.inf.elte.hu/lorincz/Files/publications/WTA_NCA.pdf</dc:source>
          <dc:language>en</dc:language>
          <dc:rights>Metadata may be used without restrictions as long as the oai identifier remains attached to it.</dc:rights>
        </oai_dc:dc>
      </metadata>
    </record>

Process Flow & File Description

  1. Program OAIHarvester2 DEMO: data downloader (the demo link). Instruction

    execute:

     $ java -classpath .:oaiharvester.jar:xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
    
  2. Data 7.8GB citeseerx_alldata.xml: original raw data

  3. Program extract_dc:descriptions.sh: extract dc:descriptions from citeseerx_alldata.xml

    execute:

     $ ./extract_dc:descriptions.sh citeseerx_alldata.xml > citeseerx_descriptions.txt
    
  4. Data 2.6GB citeseerx_descriptions.txt: extracted descriptions

  5. Program line_tokenizer.py: sentences tokenizer

    execute:

     $ cat  citeseerx_descriptions.txt |  parallel  -j 16 --keep-order --spreadstdin --block 20m ./line_tokenizer.py  > citeseerx_descriptions_sents.txt
    
  6. Data 2.6GB citeseerx_descriptions_sents.txt: sentences from descriptions

  7. Program geniatagger

    execute:

     $ cat citeseerx_descriptions_sents.txt | parallel -j 16 --keep-order --spreadstdin --block 20m geniatagger > citeseerx_descriptions_sents_genia.txt
    
  8. Data 9.4GB citeseerx_descriptions_sents_genia.txt: geniatagger tagged sentences

Editor

顏孜羲 joe@nlplab.cc

#!/bin/bash
grep '<dc:description>' "$@" | sed 's/^\s*<dc:description>//' | sed 's^</dc:description>\s*$^^'
#!/usr/bin/env python
import nltk.data
import fileinput
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for line in fileinput.input():
for tokenizedline in sent_detector.tokenize(line.strip()):
print tokenizedline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment