evanmiltenburg/dutchparser.md

## dutchparser.md

      
    Raw
  

              dutchparser.md
            
          
    Training a Dutch parser

Steps


Get the text data: wget http://kyoto.let.vu.nl/~miltenburg/public_data/wikicorpus/corpus/wikicorpus.txt.gz
Get the code for the structured n-grams: wget https://github.com/wlin12/wang2vec/archive/master.zip
Run unzip master.zip ; rm master.zip
Build the word vector code: Run cd wang2vec-master/ ; make ; cd ..
Train CBOW vectors: Run ./wang2vec-master/word2vec -train wikicorpus.txt -output cbow.vectors -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -iter 5 -cap 0 >> training.log 2>&1 &
Train Structured skipngram vectors: Run ./wang2vec-master/word2vec -train wikicorpus.txt -output structured_ngram.vectors -type 3 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -iter 5 -cap 0 >> training_ssg.log 2>&1 &
Get the code for the parser: Run wget https://github.com/elikip/bist-parser/archive/b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip
Unzip the data: Run unzip b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip ; rm b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip
And rename the folder: Run mv bist-parser-b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa bist_parser
Get universal dependencies data: wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1699/ud-treebanks-v1.3.tgz?sequence=1&isAllowed=y
Rename: mv ud-treebanks-v1.3.tgz\?sequence\=1 ud-treebanks-v1.3.tgz
Unzip and remove: tar zxvf ud-treebanks-v1.3.tgz ; rm ud-treebanks-v1.3.tgz
Make directories for parsing results: mkdir bist_parser/barchybrid/results_cbow ; mkdir bist_parser/barchybrid/results_ssg ; mkdir bist_parser/bmstparser/results_cbow ; mkdir bist_parser/bmstparser/results_ssg
Remove all non-Dutch treebank data: cd ud-treebanks-v1.3/ ; ls | grep -vP "UD_Dut.*" | parallel rm -r ; cd ..
Copy the training script to the subfolders: cp train_parser.sh bist_parser/barchybrid/ ; cp train_parser.sh bist_parser/bmstparser/

To do:

Train. https://github.com/elikip/bist-parser/tree/b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa