Created

Embed URL

HTTPS clone URL

SSH clone URL

You can clone with HTTPS or SSH.

Download Gist

Search Google NGrams for total occurrences of pairs of words, in this case CRC32 collisions in dictionary files. Reads in a CSV of pairs of words, one pair per line. Run the shell script in a directory that has the g-zipped 1grams in it, it will slice out just the needed lines out of the ngrams. Then the perl script will total them up.

View slice_ngrams.sh
1
gunzip -c googlebooks-eng-all-1gram-*-[a-z].gz | pv | grep -P `cat reddit_words.csv | perl -e "while(<>) { s/\'s//g; chomp; push @words, split ','; } printf '^(%s)\\t', join '|', @words;"` > collision_stats.tsv
View slice_ngrams.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
my %table;
open(CS,'<collision_stats.tsv');
while(<CS>) {
($word, $year, $total, $books) = split;
$word =~ s/\'s$//;
$table{$word}{$year} = [$total, $books];
$table{$word}{'total'}[0] += $total;
$table{$word}{'total'}[1] += $books;
}
open(RW,'<word_pairs.csv');
while(<RW>) {
chomp;
@words = split;
@totaltotals = (0,0);
@texttots = ();
foreach $word (split ',') {
$word =~ s/\'s$//;
@totals = @{$table{$word}{'total'}};
$totaltotals[0] += $totals[0];
$totaltotals[1] += $totals[1];
push(@texttots, sprintf('%s (%d)',$word,$totals[0]));
}
printf "%d %s\n", $totaltotals[0], join(',',@texttots);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.