Skip to content

Instantly share code, notes, and snippets.

@lyger
Created October 27, 2018 08:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lyger/e54db8e4032524c7c0e01b1823eb29fb to your computer and use it in GitHub Desktop.
Save lyger/e54db8e4032524c7c0e01b1823eb29fb to your computer and use it in GitHub Desktop.
import torch.utils.data as D
from myutils.data import EmbeddingLoader, CompositeConverter, TextDataset, \
ParallelDataset, ParallelCollate, lowercase, tokenize_with_bos_and_eos
emb_en = EmbeddingLoader('/cl/work/michael-l/multiembed/en.multiCCA.512.embedding', 512)
emb_de = EmbeddingLoader('/cl/work/michael-l/multiembed/de.multiCCA.512.embedding', 512)
converter = CompositeConverter(lowercase, tokenize_with_bos_and_eos)
news_en = emb_en.process_dataset(
TextDataset('/cl/work/michael-l/WMT18_data/newscom/news-commentary-v13.de-en.en.tok', converter)
)
news_de = emb_de.process_dataset(
TextDataset('/cl/work/michael-l/WMT18_data/newscom/news-commentary-v13.de-en.de.tok', converter)
)
train_loader = D.DataLoader(ParallelDataset(news_en, news_de),
collate_fn=ParallelCollate(emb_en.get_collate_fn(), emb_de.get_collate_fn()),
batch_size=128, num_workers=8, shuffle=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment