Skip to content

Instantly share code, notes, and snippets.

@oconnoat
Last active November 1, 2018 04:09
Show Gist options
  • Save oconnoat/02c8e7f45d2a6f4d336f to your computer and use it in GitHub Desktop.
Save oconnoat/02c8e7f45d2a6f4d336f to your computer and use it in GitHub Desktop.
CDEC Translator

#Creating a Spanish-English Translator with CDEC

This tutorial (1) will guide you through the necessary steps to mostly get a translator system built. There are a few additional steps needed for an actual translation system, and I will outline those steps. There is an alternative (2) tutorial which will take you through the process of builting a realtime decoder, which does different sort of training.

Once you have completed the above (1) tutorial, you need to compile the grammar and call the decoder. This is achieved with the following steps:

  1. Concatenate the partial, zipped grammars together: cat dev.grammars/*.gz > combined_grammar.gz

  2. identify the location of the MIRA weights mira.dev.lc-tok.es-en.<some-extension>/weights.final

  3. run the decoder with the weights and grammar. Note the cdec.ini file needs full, absolute paths to work $CDEC_HOME/decoder/cdec -c cdec.ini -g my-grammar -w mira.dev.lc-tok.es-en.<some-extension>/weights.final

Then, when you type in a Spamish sentence, you get an English translation.

The Python version of this code looks like this:

#coding: utf8
import cdec
import gzip

# Create and configure a decoder object
decoder = cdec.Decoder(formalism='scfg',
        feature_function=['WordPenalty', 'KLanguageModel nc.klm'],
        add_pass_through_rules=True)
# Set weights for the language model features
decoder.weights['LanguageModel_OOV'] = -1
decoder.weights['LanguageModel'] = 0.1
# Read a SCFG from a file

decoder = cdec.Decoder(formalism='scfg')

decoder.read_weights('mira.dev.lc-tok.es-en.20140824-154702/weights.final')
print 'read weights'

with gzip.open('dev.grammars/compound-grammar.gz') as f:
    grammar = f.read()
print 'opened grammar'

# Translate the sentence; returns a translation hypergraph
hg = decoder.translate('seguridad nacional', grammar=grammar)
print 'translated'
# Extract the best hypothesis from the hypergraph
print(hg.viterbi())

I am extremely grateful to Chris Hokamp, Andy Way, Sandipan Dandapat, and Jian Zhang for replying to my help request. All errors in the text are my own. Any comments and improvements are welcome.

Copy link

ghost commented Aug 9, 2015

The original tutorial by Chris Dyer does not mention to combine the grammar, although this is obviously a vital step to build a translator. Thank you for pointing that out!

@connectsarthak
Copy link

Hi there , could you please tell what is my-grammars in the 3rd step ?

@anushreeapk
Copy link

Your post is super helpful! I followed the steps in http://www.cdec-decoder.org/guide/tutorial.html and then combined my grammars and final weights like you've mentioned above for the europarl (Spanish-English) corpus. My decoder however does not translate very well (the translations have a lot of Spanish words and very few English) and the BLEU score is pretty low (in fact random ). Is there anything else I should be doing ? Any help would be greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment