public
Created

Search Google NGrams for total occurrences of pairs of words, in this case CRC32 collisions in dictionary files. Reads in a CSV of pairs of words, one pair per line. Run the shell script in a directory that has the g-zipped 1grams in it, it will slice out just the needed lines out of the ngrams. Then the perl script will total them up.

  • Download Gist
slice_ngrams.sh
Shell
1
gunzip -c googlebooks-eng-all-1gram-*-[a-z].gz | pv | grep -P `cat reddit_words.csv | perl -e "while(<>) { s/\'s//g; chomp; push @words, split ','; } printf '^(%s)\\t', join '|', @words;"` > collision_stats.tsv
total.pl
Perl
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
my %table;
open(CS,'<collision_stats.tsv');
while(<CS>) {
($word, $year, $total, $books) = split;
$word =~ s/\'s$//;
$table{$word}{$year} = [$total, $books];
$table{$word}{'total'}[0] += $total;
$table{$word}{'total'}[1] += $books;
}
 
open(RW,'<word_pairs.csv');
while(<RW>) {
chomp;
@words = split;
@totaltotals = (0,0);
@texttots = ();
foreach $word (split ',') {
$word =~ s/\'s$//;
@totals = @{$table{$word}{'total'}};
$totaltotals[0] += $totals[0];
$totaltotals[1] += $totals[1];
push(@texttots, sprintf('%s (%d)',$word,$totals[0]));
}
printf "%d %s\n", $totaltotals[0], join(',',@texttots);
}

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.