Skip to content

Instantly share code, notes, and snippets.

@evanmiltenburg
Last active May 25, 2016 10:39
Show Gist options
  • Save evanmiltenburg/a284461b3a9484cfc0674bd18589b2aa to your computer and use it in GitHub Desktop.
Save evanmiltenburg/a284461b3a9484cfc0674bd18589b2aa to your computer and use it in GitHub Desktop.

Training a Dutch parser

Steps

  1. Get the text data: wget http://kyoto.let.vu.nl/~miltenburg/public_data/wikicorpus/corpus/wikicorpus.txt.gz
  2. Get the code for the structured n-grams: wget https://github.com/wlin12/wang2vec/archive/master.zip
  3. Run unzip master.zip ; rm master.zip
  4. Build the word vector code: Run cd wang2vec-master/ ; make ; cd ..
  5. Train CBOW vectors: Run ./wang2vec-master/word2vec -train wikicorpus.txt -output cbow.vectors -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -iter 5 -cap 0 >> training.log 2>&1 &
  6. Train Structured skipngram vectors: Run ./wang2vec-master/word2vec -train wikicorpus.txt -output structured_ngram.vectors -type 3 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -iter 5 -cap 0 >> training_ssg.log 2>&1 &
  7. Get the code for the parser: Run wget https://github.com/elikip/bist-parser/archive/b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip
  8. Unzip the data: Run unzip b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip ; rm b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip
  9. And rename the folder: Run mv bist-parser-b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa bist_parser
  10. Get universal dependencies data: wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1699/ud-treebanks-v1.3.tgz?sequence=1&isAllowed=y
  11. Rename: mv ud-treebanks-v1.3.tgz\?sequence\=1 ud-treebanks-v1.3.tgz
  12. Unzip and remove: tar zxvf ud-treebanks-v1.3.tgz ; rm ud-treebanks-v1.3.tgz
  13. Make directories for parsing results: mkdir bist_parser/barchybrid/results_cbow ; mkdir bist_parser/barchybrid/results_ssg ; mkdir bist_parser/bmstparser/results_cbow ; mkdir bist_parser/bmstparser/results_ssg
  14. Remove all non-Dutch treebank data: cd ud-treebanks-v1.3/ ; ls | grep -vP "UD_Dut.*" | parallel rm -r ; cd ..
  15. Copy the training script to the subfolders: cp train_parser.sh bist_parser/barchybrid/ ; cp train_parser.sh bist_parser/bmstparser/

To do:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment