oconnoat/cdec.md

## cdec.md

      
    Raw
  

              cdec.md
            
          
    #Creating a Spanish-English Translator with CDEC
This tutorial (1) will guide you through the necessary steps to mostly get a translator system built. There are a few additional steps needed for an actual translation system, and I will outline those steps.
There is an alternative (2) tutorial which will take you through the process of builting a realtime decoder, which does different sort of training.
Once you have completed the above (1) tutorial, you need to compile the grammar and call the decoder. This is achieved with the following steps:


Concatenate the partial, zipped grammars together: cat dev.grammars/*.gz > combined_grammar.gz


identify the location of the MIRA weights mira.dev.lc-tok.es-en.<some-extension>/weights.final


run the decoder with the weights and grammar. Note the cdec.ini file needs full, absolute paths to work $CDEC_HOME/decoder/cdec -c cdec.ini -g my-grammar -w mira.dev.lc-tok.es-en.<some-extension>/weights.final


Then, when you type in a Spamish sentence, you get an English translation.
The Python version of this code looks like this:
#coding: utf8
import cdec
import gzip

# Create and configure a decoder object
decoder = cdec.Decoder(formalism='scfg',
        feature_function=['WordPenalty', 'KLanguageModel nc.klm'],
        add_pass_through_rules=True)
# Set weights for the language model features
decoder.weights['LanguageModel_OOV'] = -1
decoder.weights['LanguageModel'] = 0.1
# Read a SCFG from a file

decoder = cdec.Decoder(formalism='scfg')

decoder.read_weights('mira.dev.lc-tok.es-en.20140824-154702/weights.final')
print 'read weights'

with gzip.open('dev.grammars/compound-grammar.gz') as f:
    grammar = f.read()
print 'opened grammar'

# Translate the sentence; returns a translation hypergraph
hg = decoder.translate('seguridad nacional', grammar=grammar)
print 'translated'
# Extract the best hypothesis from the hypergraph
print(hg.viterbi())

I am extremely grateful to Chris Hokamp, Andy Way, Sandipan Dandapat, and Jian Zhang for replying to my help request. All errors in the text are my own. Any comments and improvements are welcome.