Skip to content

Instantly share code, notes, and snippets.

@cincodenada
Created February 14, 2013 07:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cincodenada/4951149 to your computer and use it in GitHub Desktop.
Save cincodenada/4951149 to your computer and use it in GitHub Desktop.
Search Google NGrams for total occurrences of pairs of words, in this case CRC32 collisions in dictionary files. Reads in a CSV of pairs of words, one pair per line. Run the shell script in a directory that has the g-zipped 1grams in it, it will slice out just the needed lines out of the ngrams. Then the perl script will total them up.
gunzip -c googlebooks-eng-all-1gram-*-[a-z].gz | pv | grep -P `cat reddit_words.csv | perl -e "while(<>) { s/\'s//g; chomp; push @words, split ','; } printf '^(%s)\\t', join '|', @words;"` > collision_stats.tsv
my %table;
open(CS,'<collision_stats.tsv');
while(<CS>) {
($word, $year, $total, $books) = split;
$word =~ s/\'s$//;
$table{$word}{$year} = [$total, $books];
$table{$word}{'total'}[0] += $total;
$table{$word}{'total'}[1] += $books;
}
open(RW,'<word_pairs.csv');
while(<RW>) {
chomp;
@words = split;
@totaltotals = (0,0);
@texttots = ();
foreach $word (split ',') {
$word =~ s/\'s$//;
@totals = @{$table{$word}{'total'}};
$totaltotals[0] += $totals[0];
$totaltotals[1] += $totals[1];
push(@texttots, sprintf('%s (%d)',$word,$totals[0]));
}
printf "%d %s\n", $totaltotals[0], join(',',@texttots);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment