Skip to content

Instantly share code, notes, and snippets.

@soaxelbrooke
Last active September 2, 2018 22:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save soaxelbrooke/d5d0416e84a23e47f2fdfe92731d02f5 to your computer and use it in GitHub Desktop.
Save soaxelbrooke/d5d0416e84a23e47f2fdfe92731d02f5 to your computer and use it in GitHub Desktop.
Counts word frequencies in parallel, combining them.
# Need wf - install with `cargo install wf`
mkdir splits wfs
echo 'Splitting file into parts...'
split -a 5 -l 200000 $1 splits/split
ls splits/ | parallel 'echo "Counting {}..."; cat splits/{} | wf > wfs/{}_wf.txt'
echo 'Combining split counts...'
python -c 'from tqdm import tqdm; from functools import reduce; from glob import glob; from collections import Counter; of = open("wfs.txt", "w"); wf = reduce(lambda a, b: a + b, (Counter(dict((pair[0], int(pair[1])) for pair in (line.strip().split() for line in open(fpath)))) for fpath in tqdm(glob("wfs/*"))), Counter()); [of.write("{} {}\n".format(key, count)) for key, count in sorted(wf.items(), key=lambda p: -p[1])]'
rm -rf wfs splits
echo 'Word frequencies written to wfs.txt.'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment