Skip to content

Instantly share code, notes, and snippets.

@Kapott
Created January 11, 2021 13:05
Show Gist options
  • Save Kapott/bd1cbdcb9d61920fb843075fe5089d5a to your computer and use it in GitHub Desktop.
Save Kapott/bd1cbdcb9d61920fb843075fe5089d5a to your computer and use it in GitHub Desktop.
Scans text for bigram, trigram...ngram frequency
const Tuple = (k, v) => ({
key: k,
val: v
})
const Ngram = n =>
({
n,
analyze: txt => {
const cleanText = txt.toLowerCase().replace(/[^0-9a-z]/gi,"")
let ngrams = []
for (let i = 0; i <= txt.length - n; i++) {
let thisNgram = cleanText.substring(i, i + n)
const curNgram = ngrams.filter(n => n.key == thisNgram)
if (curNgram.length > 0) {
curNgram.map(n => n.val++)
} else {
if (thisNgram.length > 0) {
newNgram = Tuple(thisNgram, 1)
ngrams.push(newNgram)
}
}
}
return ngrams.sort((a,b) => {
if (a.val < b.val) {
return 1
} else if (a.val > b.val) {
return -1
}
else {
return 0
}
})
}
})
const txt = [
["First", "A Partridge in a Pear Tree."],
["Second", "Two Turtle Doves, and"],
["Third", "Three French Hens,"],
["Fourth","Four Calling Birds,"],
["Fifth","Five Gold Rings,"],
["Sixth","Six Geese-a-Laying,"],
["Seventh", "Seven Swans-a-Swimming,"],
["Eighth","Eight Maids-a-Milking,"],
["Ninth","Nine Ladies Dancing,"],
["Tenth","Ten Lords-a-Leaping,"],
["Eleventh","Eleven Pipers Piping,"],
["Twelfth", "Twelve Drummers Drumming,"],
].join("")
const ns = Ngram(3).analyze(txt).filter(occurrence => occurrence.val > 2)
ns.forEach(n => console.log(`You can save ${(n.val * n.key.length) - n.val} bytes by substituting ${n.key}`))
@Kapott
Copy link
Author

Kapott commented Jan 11, 2021

This is very dirty, and will choke on large texts.
Should rewrite this sometime so it uses generators and promises and can perform in O(log n) time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment