putheakhem/en-kh.md

## en-kh.md

      
    Raw
  

              en-kh.md
            
          
    Khmer MT


Moses is a statistical machine translation system that allows  you to automatically train machine translation models

Before Installing moses install the following packages

  sudo apt-get install g++ git subversion automake libtool zlib1g-dev libboost-all-dev libbz2-dev liblzma-dev python-dev libtcmalloc-minimal4

Make a directory where all works related to machine translation will be present.

  mdkir ~/MT
Installing Boost

cd ~/MT
wget https://dl.bintray.com/boostorg/release/1.64.0/source/boost_1_64_0.tar.gz
tar zxvf boost_1_64_0.tar.gz
cd boost_1_64_0/
./bootstrap.sh
./b2 -j5 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install || echo FAILURE

Installation Moses

Download Moses decoder from github and extract to the directory ~/MT
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/

Install moses

  ./bjam -j5

If you installed moses successfully, you will be able to see the options available with bjam

  ./bjam --help
  ./bjam --with-boost=~/MT/boost_1_64_0/ -j5

Install giza


Download giza

  git clone https://github.com/moses-smt/giza-pp.git
  cd giza-pp
  make 

Navigate into mosesdecoder directory and create tools in folder

  cd ~/MT/mosesedecoder
  mkdir tools

Copy components to the tools folder

  cp ../giza-pp/GIZA++-v2/GIZA++ ../giza-pp-master/GIZA++-v2/snt2cooc.out ../giza-pp-master/mkcls-v2/mkcls tools/
Installing SRILM


TODO

Training the Translation System


Make a new directory corpus in the main folder ~/MT/


Make a new directory training inside the folder `~/MT/corpus/


Add parralell data into ~/MT/corpus/training/. example : data.en, data.kh

Pre-Process Corpora

Tokenization


Navigate into corpus folder : cd ~/MT/corpus/


English Tokenization
bash ../mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < training/data.en > data.tok.en     
Khmer Tokenization
bash cp training/data.kh ~/MT/corpus/data.tok.kh     

Create Trucase model


Navigate into corpus folder : cd ~/MT/corpus/


English Truecase model

 ../mosesdecoder/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus data.tok.en

Khmer Truecase model (Skip)

Truecasing


Navigate into corpus folder : cd ~/MT/corpus/


Truecasing English

  ../mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en < data.tok.en > data.true.en

Truecasing Khmer

  cp data.tok.kh data.true.kh
Cleaning of English and Khmer

   ../mosesdecoder/scripts/training/clean-corpus-n.perl data.true en kh data.clean 1 80
Training


Naviagate into ~/MT and create a new folder model1

  cd ~/MT/
  mkdir model1
  cd model1
  ../mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus  ../corpus/data.clean -f en -e kh -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:~/MT/lm/data.blm.kh:8 -external-bin-dir ../mosesdecoder/tools >& training.out &