Skip to content

Instantly share code, notes, and snippets.

@ymoslem
Last active April 16, 2024 08:09
Show Gist options
  • Star 18 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d to your computer and use it in GitHub Desktop.
Save ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d to your computer and use it in GitHub Desktop.
Example of translating a file with M2M-100 using CTranslate2
# This example uses M2M-100 models converted to the CTranslate2 format.
# Download CTranslate2 models:
# • M2M-100 418M-parameter model: https://bit.ly/33fM1AO
# • M2M-100 1.2B-parameter model: https://bit.ly/3GYiaed
import ctranslate2
import sentencepiece as spm
# [Modify] Set file paths of the source and target
source_file_path = "source_test.en"
target_file_path = "target_test.ja.mt"
# [Modify] Set paths to the CTranslate2 and SentencePiece models
ct_model_path = "m2m100_ct2/"
sp_model_path = "m2m100_ct2/sentencepiece.model"
# [Modify] Set language prefixes of the source and target
src_prefix = "__en__"
tgt_prefix = "__ja__"
# [Modify] Set the device and beam size
device = "cpu" # or "cuda" for GPU
beam_size = 5
# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)
# Open the source file
with open(source_file_path, "r") as source:
lines = source.readlines()
source_sents = [line.strip() for line in lines]
target_prefix = [[tgt_prefix]] * len(source_sents)
# Subword the source sentences
source_sents_subworded = sp.encode(source_sents, out_type=str)
source_sents_subworded = [[src_prefix] + sent for sent in source_sents_subworded]
print("First sentence:", source_sents_subworded[0])
# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix)
translations = [translation[0]['tokens'] for translation in translations]
# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_prefix):] for sent in translations_desubword]
print("First translation:", translations_desubword[0])
# Save the translations to the a file
with open(target_file_path, "w+", encoding="utf-8") as target:
for line in translations_desubword:
target.write(line.strip() + "\n")
print("Done! Target file saved at:", target_file_path)
@ymoslem
Copy link
Author

ymoslem commented Feb 16, 2022

M2M-100 Multilingual Neural Machine Translation Model

M2M-100 in CTranslate2 format

CTranslate2 is a fast inference engine for Transformer models. It supports models originally trained with OpenNMT-py, OpenNMT-tf, and FairSeq. CTranslate2 is preferred for its high efficiency. It is cross-platform, and can be used either on CPU or GPU.

You can download one of the M2M-100 models, converted to the CTranslate2 format:

How to convert an M2M-100 model to CTranslate2

Alternatively, you can convert an M2M-100 model to the CTranslate2 format yourself as follows:

  1. Install CTranslate2 and FairSeq. Also, install SentencePiece, which you will use during translation.
pip3 install ctranslate2 fairseq sentencepiece
  1. Download one of the FairSeq M2M-100 models available here.
  2. Run this command to convert the FairSeq M2M-100 model to the CTranslate2 format.
ct2-fairseq-converter --model_path $MODEL --data_dir $DictDir --fixed_dictionary $DictFile --output_dir $OUTPUT --quantization int8

Translation with M2M-100 models

You can use the script in this gist to translate a source file using M2M-100, as follows:

  1. Install CTranslate2, FairSeq, and SentencePiece.
pip3 install ctranslate2 fairseq sentencepiece
  1. Make sure you change the paths to the source file source_file_path, CTranslate2 model ct_model_path, SentencePiece model sp_model_path.

  2. M2M-100 uses a source language token, and target language token. The latter is used for prefix-constrained decoding, to generate the translation in the specified language. In the script, make sure you adjust src_prefix and tgt_prefix. The list of supported languages and their language codes can be found here.

  3. Now, run the Python script as usual, which should translate the source file, and generate the target file in the path specified with target_file_path.


Testing M2M-100 with English-to-Japanese

Test Dataset

M2M-100 418M-parameter model

  • Beam Size: 5
  • BLEU: 24.8

M2M-100 1.2B-parameter model

  • Beam Size: 5
  • BLEU: 26.4

Using M2M-100 models with a GUI

You can also use M2M-100 models in DesktopTranslator, a local cross-platform machine translation GUI. It also has stand-alone executables for Mac and Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment