Skip to content

Instantly share code, notes, and snippets.

@lautarodragan
Last active October 30, 2022 02:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lautarodragan/5c91aa0fa40e2271df4479681d91c62d to your computer and use it in GitHub Desktop.
Save lautarodragan/5c91aa0fa40e2271df4479681d91c62d to your computer and use it in GitHub Desktop.
Letter Frequency

This mini-project compares frequencies of letters in the English language in general and an extract from Frank Herbert's Dune.

The average letter frequencies can be found in letter-frequencies.json. This file is a JSON object mapping letters to frequencies, and looks like this:

{
  "E": 11.1607,
  "M": 3.0129,
  "A": 8.4966,
  "H": 3.0034,
  "R": 7.5809,
  "G": 2.4705,
  "I": 7.5448,
  "B": 2.072,
  "O": 7.1635,
  ...
}

This data was obtained from this source, manually copy-pasted, and parsed with a function that can be found in fetchFrequencies.js.

The sample from the novel is the following:

I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.

This can be found in dune.txt.

Running this through textToCharFrequencies.js will output the following to the console:

[
  [ ' ', 60 ], [ 'e', 32 ], [ 't', 27 ],
  [ 'i', 25 ], [ 'a', 18 ], [ 'l', 17 ],
  [ 'n', 16 ], [ 'r', 16 ], [ 'h', 15 ],
  [ 'o', 12 ], [ 's', 11 ], [ '.', 8 ],
  [ 'm', 7 ],  [ 'w', 7 ],  [ 'f', 6 ],
  [ 'g', 5 ],  [ 'd', 4 ],  [ 'p', 4 ],
  [ 'u', 3 ],  [ 'b', 3 ],  [ 'y', 3 ],
  [ '-', 2 ],  [ 'k', 1 ],  [ 'c', 1 ],
  [ 'v', 1 ],  [ '\n', 1 ]
]

Ignoring spaces, this roughly matches the frequencies of characters in average English text.

Ordered by frequency, the top characters that show up in the novel and the source are as follows:

Dune   : E T I A L N R H O
Average: E M A H R G I B O

The letters E and O are exact matches. A is off by one, others are worse. The extract from the novel is a very small sample, which possibly explains the difference in order of appearance.

I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.
/*
Source text was copy-pasted from https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html.
*/
const sourceText = `E 11.1607% 56.88 M 3.0129% 15.36
A 8.4966% 43.31 H 3.0034% 15.31
R 7.5809% 38.64 G 2.4705% 12.59
I 7.5448% 38.45 B 2.0720% 10.56
O 7.1635% 36.51 F 1.8121% 9.24
T 6.9509% 35.43 Y 1.7779% 9.06
N 6.6544% 33.92 W 1.2899% 6.57
S 5.7351% 29.23 K 1.1016% 5.61
L 5.4893% 27.98 V 1.0074% 5.13
C 4.5388% 23.13 X 0.2902% 1.48
U 3.6308% 18.51 Z 0.2722% 1.39
D 3.3844% 17.25 J 0.1965% 1.00
P 3.1671% 16.14 Q 0.1962% (1)`
const parseText = (text) => text
.split('\n')
.map(line => line.split('\t'))
.map(line => [line.slice(0, 3), line.slice(3)])
.flat()
.map(([letter, frequency]) => [letter, parseFloat(frequency)])
.map(([letter, frequency]) => ({ [letter]: frequency }))
.reduce((acc, el) => ({ ...acc, ...el }), {})
const letterFrequencies = parseText(sourceText)
// Then, write letterFrequencies to ./letter-frequencies.json'.
// This is how that file was created.
{
"E": 11.1607,
"M": 3.0129,
"A": 8.4966,
"H": 3.0034,
"R": 7.5809,
"G": 2.4705,
"I": 7.5448,
"B": 2.072,
"O": 7.1635,
"F": 1.8121,
"T": 6.9509,
"Y": 1.7779,
"N": 6.6544,
"W": 1.2899,
"S": 5.7351,
"K": 1.1016,
"L": 5.4893,
"V": 1.0074,
"C": 4.5388,
"X": 0.2902,
"U": 3.6308,
"Z": 0.2722,
"D": 3.3844,
"J": 0.1965,
"P": 3.1671,
"Q": 0.1962
}
async function getCharFrequencies(url) {
const arrayBuffer = await fetch(url).then(_ => _.arrayBuffer())
const byteArray = new Uint8Array(arrayBuffer)
const map = new Map()
for (const byte of byteArray) {
const char = String.fromCharCode(byte).toLowerCase()
const existingEntry = map.get(char) || 0
map.set(char, existingEntry + 1)
}
return [...map.entries()].sort((a, b) => b[1] - a[1])
}
getCharFrequencies('https://gist.githubusercontent.com/lautarodragan/5c91aa0fa40e2271df4479681d91c62d/raw/8bbed6e8164781a7e4d9bb2553076322014207c3/dune.txt').then(console.log)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment