d2207197/README_citeseerx.md

## README_citeseerx.md

      
    Raw
  

              README_citeseerx.md
            
          
    README

Original Data

    <record>
        <header>
           <identifier>oai:CiteSeerX.psu:10.1.1.1.1484</identifier>
           <datestamp>2009-05-24</datestamp>
        </header>
        <metadata>
          <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarc
          <dc:title>Winner-Take-All Network Utilising Pseudoinverse Reconstruction Subnets Demonstrates Robustness on the Handprinted Character Recognition Problem</dc:title>
          <dc:creator>J. Körmendy-rácz</dc:creator>
          <dc:creator>S. Szabó</dc:creator>
          <dc:creator>J. Lörincz</dc:creator>
          <dc:creator>G. Antal</dc:creator>
          <dc:creator>G. Kovács</dc:creator>
          <dc:creator>A. Lörincz</dc:creator>
          <dc:subject>Correspondence and offprint requests to</dc:subject>
          <dc:subject>J. Kormendy-Rácz</dc:subject>
          <dc:description>Wittmeyer’s pseudoinverse iterative algorithm is formulated&#13; as a dynamic connectionist Data Compression and Reconstruction (DCR) network, and subnets of this type are supplemented by the winner-take-all paradigm. The winner is selected upon the goodness-of-fit of
          <dc:contributor>The Pennsylvania State University CiteSeerX Archives</dc:contributor>
          <dc:publisher>Springer</dc:publisher>
          <dc:date>2009-05-24</dc:date>
          <dc:date>2007-11-19</dc:date>
          <dc:date>1999</dc:date>
          <dc:format>application/pdf</dc:format>
          <dc:type>text</dc:type>
          <dc:identifier>http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1484</dc:identifier>
          <dc:source>http://people.inf.elte.hu/lorincz/Files/publications/WTA_NCA.pdf</dc:source>
          <dc:language>en</dc:language>
          <dc:rights>Metadata may be used without restrictions as long as the oai identifier remains attached to it.</dc:rights>
        </oai_dc:dc>
      </metadata>
    </record>
Process Flow & File Description


Program OAIHarvester2 DEMO: data downloader (the demo link). Instruction
execute:
 $ java -classpath .:oaiharvester.jar:xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml


Data 7.8GB citeseerx_alldata.xml: original raw data


Program extract_dc:descriptions.sh: extract dc:descriptions from citeseerx_alldata.xml
execute:
 $ ./extract_dc:descriptions.sh citeseerx_alldata.xml > citeseerx_descriptions.txt


Data 2.6GB citeseerx_descriptions.txt: extracted descriptions


Program line_tokenizer.py: sentences tokenizer
execute:
 $ cat  citeseerx_descriptions.txt |  parallel  -j 16 --keep-order --spreadstdin --block 20m ./line_tokenizer.py  > citeseerx_descriptions_sents.txt


Data 2.6GB citeseerx_descriptions_sents.txt: sentences from descriptions


Program geniatagger
execute:
 $ cat citeseerx_descriptions_sents.txt | parallel -j 16 --keep-order --spreadstdin --block 20m geniatagger > citeseerx_descriptions_sents_genia.txt


Data 9.4GB citeseerx_descriptions_sents_genia.txt: geniatagger tagged sentences


Editor

顏孜羲 joe@nlplab.cc

  
## extract_dc:descriptions.sh
#!/bin/bash

grep '<dc:description>' "$@" | sed 's/^\s*<dc:description>//'  | sed 's^</dc:description>\s*$^^'

## line_tokenizer.py
#!/usr/bin/env python
import nltk.data
import fileinput
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for line in fileinput.input():

    for tokenizedline in sent_detector.tokenize(line.strip()):
        print tokenizedline
	#!/bin/bash

	grep '<dc:description>' "$@" \| sed 's/^\s<dc:description>//' \| sed 's^</dc:description>\s$^^'
	#!/usr/bin/env python
	import nltk.data
	import fileinput
	sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
	for line in fileinput.input():

	for tokenizedline in sent_detector.tokenize(line.strip()):
	print tokenizedline