Skip to content

Instantly share code, notes, and snippets.

Created August 23, 2015 08:54
Show Gist options
  • Save lydell/c439049abac2c9226e53 to your computer and use it in GitHub Desktop.
Save lydell/c439049abac2c9226e53 to your computer and use it in GitHub Desktop.
English bigram and letter pair frequencies from the Google Corpus Data in JSON format

English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU by Peter Norvig is an analysis of English letter frequencies using the Google Corpus Data. Among other things it contains the frequency of all bigrams.

This gist contains a program that extracts those bigram frequencies into a easily usable JSON format.

It also contains the result of running that program (bigrams.json), as well as a version of it where the order of the letters of a bigram is not taken into account (pairs.json). The two JSON files were generated from a copy of the above article retrieved 2015-08-23.

To regenerate the JSON files:

$ curl >article.html
$ npm install
$ node extract <article.html >bigrams.json
$ node bigrams-to-pairs <bigrams.json >pairs.json

All of the files are in the public domain.

// By Simon Lydell 2015.
// This file is in the public domain.
var stdin = require("get-stdin")
var tools = require("text-frequencies-analysis")
var helpers = require("text-frequencies-analysis/lib/helpers")
stdin(function(text) {
function convert(bigrams) {
var pairMap = Object.create(null)
bigrams.forEach(function(tuple) {
var bigram = tuple[0]
var frequency = tuple[1]
var pair = bigram.split("").sort().join("")
if (pair in pairMap) {
pairMap[pair] += frequency
} else {
pairMap[pair] = frequency
return tools.sortTuples(helpers.objectToArray(pairMap))
// By Simon Lydell 2015.
// This file is in the public domain.
var cheerio = require("cheerio")
var stdin = require("get-stdin")
var tools = require("text-frequencies-analysis")
stdin(function(text) {
function extract(text) {
var $ = cheerio.load(text)
var bigrams = []
$('table').first().find('td').each(function(index, element) {
var $cell = $(element)
bigrams.push([$cell.text().trim().toLowerCase(), parse($cell.attr('title'))])
return tools.sortTuples(bigrams)
function parse(title) {
return Number(title.split(/\s+/)[2].replace(/,/g, ''))
"private": true,
"dependencies": {
"cheerio": "^0.19.0",
"get-stdin": "^4.0.1",
"text-frequencies-analysis": "2.0.0"
Copy link

Does anyone know where to find the frequencies of the bigrams involving non-alphanumeric characters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment