Skip to content

Instantly share code, notes, and snippets.

@malithjkmt
Last active August 4, 2017 00:41
Show Gist options
  • Save malithjkmt/9fcaba23f7635766c45de01349084105 to your computer and use it in GitHub Desktop.
Save malithjkmt/9fcaba23f7635766c45de01349084105 to your computer and use it in GitHub Desktop.
Shuffle a parallel corpus without loosing the alignment.
# To run: python corpusShuffler -src sourceCourpus.txt -tdt targetCorpus.txt
import argparse
import random
parser = argparse.ArgumentParser(description='## CORPUS SHUFLER ##')
parser.add_argument(
'-src', help='sorce language corpus to shuffle', required=True)
parser.add_argument(
'-tgt', help='target language corpus to shuffle', required=True)
args = parser.parse_args()
src = open(args.src, 'r')
tgt = open(args.tgt, 'r')
srcOut = open(args.src + '_shuffled', 'w')
tgtOut = open(args.tgt + '_shuffled', 'w')
srcData = src.readlines()
tgtData = tgt.readlines()
random.seed(7) # same seed for both files (to save the alignment)
random.shuffle(srcData)
random.seed(7) # same seed for both files (to save the alignment)
random.shuffle(tgtData)
open(args.src + '_shuffled', 'w').writelines(srcData)
open(args.tgt + '_shuffled', 'w').writelines(tgtData)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment