Skip to content

Instantly share code, notes, and snippets.

@wecing
Created September 6, 2012 00:59
Show Gist options
  • Save wecing/3649216 to your computer and use it in GitHub Desktop.
Save wecing/3649216 to your computer and use it in GitHub Desktop.
Tools & Data

Tools

(You could read the "line", "shell", and "regular expression" parts later.)

~/bin

You could create a folder at anywhere (better if under your home directory) with any name you like (usually "bin") to create your own shell commands. After creating the folder, add this line to ~/.profile.local(create it if it doesn't exist) :

PATH=$PATH:~/bin

Then put your own scripts to the folder (~/bin here) and after logging out and re-logging in, you would be able to run any commands under the folder from anywhere.

Similarly, you could use "decatur-pipeline" anywhere (but not "/path/to/decatur/pipeline/decatur-pipeline") once you put this line to your ~/.profile.local:

PATH=$PATH:/path/to/decatur/pipeline

The two lines I show you are in fact appending some paths to the PATH environment variable; so they have the same meaning as:

PATH=$PATH:~/bin:/path/to/decatur/pipeline

And similarly, for SRILM:

PATH=$PATH:/g/ssli/software/pkgs/SRILM-devel/bin
PATH=$PATH:/g/ssli/software/pkgs/SRILM-devel/bin/i686-m64
decatur

You could put these two scripts under ~/bin:

cn-decatur:

#!/bin/bash
SEG=stanford

if [ $# -eq 1 ]; then
	SEG=$1
elif [ $# -gt 1 ]; then
	echo "Usage: $0 [segmenter] < infile > outfile"
	exit
fi

decatur-pipeline --language zh --mode mt --segmenter $SEG

en-decatur:

#!/bin/bash
decatur-pipeline --language en --mode mt

Then you would be able to run something like "cn-decatur < input.txt > output.txt" or "cn-decatur ldc < input.txt > output.txt" to use decatur to process the file input.txt and write the processed file to output.txt.

But very importantly, you have to make the script executable before executing them:

chmod +x cn-decatur

I don't know which version of decatur are you using; but if it complains lacking of libraries, you may need to add this line to your .profile.local:

export PERL5LIB=/homes/binz/apps/lib/site_perl:$PERL5LIB
line

Sometimes you need to get a specific line of a file. Use this script:

line:

#!/bin/bash

if [ $# -lt 2 ]; then
    echo 'Usage:' $(basename ${0}) '[LINE]... FILE'
    exit 1
fi

args=("$@")
file=${args[$#-1]}

for (( i=0; i<$#-1; i++ )); do
    num=${args[${i}]}
    sed -n ${num},${num}p ${file}
done

Then you could use "line 24 35 16 input.txt" to see the 24th, 35th and 16th line of input.txt. (Don't forget to chmod +x it first.)

stanford tagger

The stanford tagger is a POS tagger; with it you could figure out which words are nouns/verbs in a file.

cn-tagger:

#!/bin/bash

TAGGER_DIR=/g/ssli/software/pkgs/stanford-postagger-	full-2012-03-09

if [ $# -ne 1 ]
then
    echo 'Usage:' $(basename ${0}) 'FILE'
    exit 1
else
    java -classpath ${TAGGER_DIR}/stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model ${TAGGER_DIR}/models/chinese.tagger -textFile $1
fi

Similarly, en-tagger:

#!/bin/bash

TAGGER_DIR=/g/ssli/software/pkgs/stanford-postagger-full-2012-03-09

TAGGER_MODEL=english-caseless-left3words-distsim.tagger 

if [ $# -ne 1 ]
then
    echo 'Usage:' $(basename ${0}) 'FILE'
    exit 1
else
    java -classpath ${TAGGER_DIR}/stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model ${TAGGER_DIR}/models/${TAGGER_MODEL} -textFile $1 -sentenceDelimiter newline -tokenize false
fi

Sample usage:

cn-tagger hello.txt > hello.txt.tagged

And, about meaning of the tags:

Chinese:

picture alt

English:

picture alt

Shell:

Pipe("|"): use output of the previous program as input of the next one.

# showing all files with "txt" in the filename; same as "ls *txt*".
ls | grep 'txt'

Redirect("<" and ">"): ">": write output of the previous command to a file; "<": use content in a file as input of a program.

cn-decatur < input.txt > output.txt
cn-tagger input.txt > output.txt

xargs: use output of the previous command as command-line arguments as the next program.

# show all lines contains "hello" in all .txt files under the current directory
ls *.txt | xargs grep 'hello'

# be careful when some files have space in their names. you would have to append other arguments to xargs to make it work.

grep: show all lines matches a pattern or contains a specific string in some files.

grep 'hello' input.txt
grep -n 'hello' input.txt # -n means showing the matching line number
grep -E '^[0-9]+' input.txt # regular expression

for loops:

for N in {1..7}; do echo $N; done
Regular expressions in Python

Examples:

import re

s = 'Tel: 000-123-8974'
P = re.compile(r'(?<=Tel: )([0-9]+)-([0-9]+)-([0-9]+)')
p = P.search(s)

print p.group(0) # "000-123-8974"
print p.group(1) # "000"
print p.group(2) # "123"
print p.group(3) # "8974"

# same as:
# 	P = re.compile(r'[0-9]+')
#   t = p.findall(s)
t = re.findall(r'[0-9]+', s)
# t is an array: ['000', '123', '8974']

Data

Output of the systems

(OMG I wrote so much words in the previous section)

Output of the translation systems are now put on SRI's server(SFTP, you could use FileZilla), and Jing is the one who generated them. The url is mpserv.speech.sri.com; username and password are "uw" and "5169a36f1c". The data is under /home/boltftp/upload/sri/MT/p1-dev, and its filename is combo.subset.tgz.

But the alignment Jing provided is messed up because of the pre-translated tokens, like "$eng {League of Legends}"; sometimes they are treated as a single token ($eng), and sometimes regonized as three words (League of Legends). Jing is now on vacation, so I tried to clean it up myself. The cleaned up data is still very messy, but much better than the original ones. They are under /homes/wangc8/t/ex_error_analysis/formal-informal/data/cleanedup/ (remember to use -r when copying over folders with cp).

There's no cleaned up data for sri3. Mari said we could just ignore it for now (and its alignment format is different from the other 3 as well).

Formal vs. Informal text

I trained two models, formal.ngram and informal.ngram; I put them under /homes/wangc8/t/informality. They are generated from the data in /homes/wangc8/t/informality/build/filtered, which is the cleaned up traning data (it will take some time to find the original copy…). Anyway, the two models are ready to use. I automized the generating process; you could take a look at the Makefile under the same directory.

The training data doesn't contain tokens like "$number" or "$eng"; so it's not perfect for the testing data. And the two models are generated with very simple rules, so they could be tuned to work better better as well.

classifying

The basic idea of classifying sentences is to see for model they have the lower perplexity. To generate ppl for each sentence, you need to run the command (make sure the input is segmented):

ngram -lm formal.ngram -ppl input.txt -debug 1 > formal.ppl
ngram -lm informal.ngram -ppl input.txt -debug 1 . informal.ppl

Then, you would have to use python to do the comparing for each sentence. You could take a look at classify.py which is under the same directory as formal.ngram. The basic idea is to use regex (regular expression) to get the perplexity, convert it to a number, do the comparsion, and finally write to file.

You may have to do POS tagging after classifying, because tagging will add tags to the words, which will make the language models unable to work. Remember the command line for loops I wrote before -- you need to use it to run the tagger on each file.


Update: misc

  1. I wrote a script for comparing different segmenters, which is under /homes/wangc8/t/error-analysis-archived/visualize_diff/. Note that the data it uses is not the cleaned up version. It is very very (I really want to add 10 more "very" here) messy; the core function is util.DiffObj.get_diff_parts().

    The algorithm used is dynamic programming. It's a variation of the "longest common sub sequence" problem (with state recording). The basic idea is:

    Suppose the two sequences are a and b; dp[i][j] is length of the longest common sub sequence of a[:i+1] and b[:j+1]. Set all value in dp to -1. Then, assuming the first item of both sequences are the same ("<s>" in this example), we have dp[0][0] = 1; then for each i and j, we have dp[i+1][j] = max(dp[i+1][j], dp[i][j]) and dp[i][j+1] = max(dp[i][j+1], dp[i][j]); if a[i+1] == b[j+1], we would have:

    dp[i+1][j+1] = max(1 + dp[i][j], dp[i+1][j+1]).

    Note that the order you use to iterate through dp is crucial; and each value in dp could be updated many times.

    BTW the script will show output as a HTML file to highlight the difference.

  2. The data used for error analysis over decatur, is under /homes/wangc8/t/error_analysis/data. full.data, as its name indicates, is the full data set; the original (not processed by decatur) text is put under orig/. Note that these original text files are very big; you could use the line command I wrote above to take a look at specific lines.

    You could read /homes/wangc8/t/error_analysis/mismatching_types/mm_neq.txt for examples of mismatching tokens. There are 14.56% sentence pairs have different total number of tokens on chn/eng side; and the frequency each token appears in chn/eng corpus is:

    chn: $url => 701, $number => 105371, $email => 356, $hour => 2633, $date => 52953

    eng: $url => 153, $number => 88914, $email => 8, $hour => 374, $date => 22522

    But the data is generated by a previous version of decatur; the numbers may be different now.

    There are less tokens on the English side.

    It's very helpful to start with those sentences pairs that have equal total number of tokens, but not exactly the same. e.g.:

    "… $number … $date …" vs. "… $number … $number …".

  3. I do have the classification&tagging scripts, but they are so messy that even I cannot bear them as well. In fact in the last few days working on the project I was trying to automize the process of cleaning up, classification, POS tagging… and stopped after getting the cleaned up data. And those scripts doesn't work with it; they are supposed to be fed with the original data Jing gave me.

  4. See this graph for what nouns and verbs are aligned to in formal/informal corpus:

    picture alt

    From the graph you can tell that verbs aligned to verbs less frequently, which seems weird. But as far as I remember, most of these mismatchings are either error of the POS tagger I'm using (Stanford tagger), or a just a flavor of translation. In short, they do make sense. The script I used to generate the graph is bargraph.pl, which is found online; you could use Gnuplot, sage or Mathematica to generate better graphs.

  5. I have a pretty simple (and naive) script for removing garbage data from the training data of decatur: /homes/wangc8/t/error-analysis-archived/wtf/garbage_filter/filter.py.

And about the other scripts…

Some scripts I wrote are not worth reusing. For example, the ones I used to clean up the data I was given, and the script to count what type of words are nouns aligned to (its output is used to generate the previous graph).

The others are just too simple to be reproduced, like /homes/wangc8/t/ex_error_analysis/old/remove_meta.py, which is used to remove the meta translation (something like "$eng {League of Legends}" (btw I don't play that game)).

In short, what I have not mentioned here are those I feel that you don't need to care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment