Skip to content

Instantly share code, notes, and snippets.

@mekarpeles
Last active February 16, 2020 08:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mekarpeles/24df7cbc7f94f0e0a8eac6d252a73cd7 to your computer and use it in GitHub Desktop.
Save mekarpeles/24df7cbc7f94f0e0a8eac6d252a73cd7 to your computer and use it in GitHub Desktop.
Basic word frequency for book fulltext
import re
from collections import defaultdict
import string
STOP_WORDS = {'would', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'in\
to', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', \
'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yo\
urselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't'\
, 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than', 'new', 'his', 'her', 'one', 'two', 'three', 'also', 'like', 'could', 'many', 'see', 'may', 'ever', 'became', 'becaus\
e', 'far', 'well', 'among', 'things', 'seems', 'much', 'almost', 'around', 'often'}
def ngram(tokens, n=2):
ngrams = zip(*[tokens[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
def sanitize(fulltext):
return fulltext.lower().replace('\n-', '').replace('\n', ' ').translate(None, string.punctuation).decode('utf-8')
def sequence(fulltext, n=1):
"""Sequence the genome of this book"""
freqmap = defaultdict(int)
words = [w.strip() for w in sanitize(fulltext).split(' ') if len(w) > 1 and w not in STOP_WORDS]
corpus = words if n == 1 else ngram(words, n=n)
for s in corpus:
if s.isdigit():
freqmap[':number'] += 1
else:
freqmap[s] += 1
return sorted(freqmap, key=freqmap.get, reverse=True)
def fingerprint(fulltext_filename='glutmasteringinf00wrig_djvu.txt', n=1):
with open(fulltext_filename) as book:
return sequence(book.read(), n=n)
@mekarpeles
Copy link
Author

mekarpeles commented Feb 16, 2020

Example

In [1]: from sequencer import fingerprint; ', '.join(fingerprint()[:100])
Out[1]: u':number, library, information, system, books, social, world, book, human, web, libraries, great, systems, first, glut, work, old, classification, age, years, people, time, writing, knowledge, way, even, culture, memory, natural, began, computer, ing, history, press, form, art, written, hierarchies, technology, vision, ancient, texts, oral, kind, power, today, catalog, networks, nelson, hypertext, us, early, tion, popular, documents, wrote, op, bush, animals, science, species, life, cit, categories, op cit, later, literacy, printing, century, thought, de, word, taxonomies, folk, modern, works, personal, far, greek, might, text, public, still, individual, well, mind,
process, among, cultural, quoted, ibid, hierarchical, political, things, church, seems, roman, structure, much, rules'

In [2]: from sequencer import fingerprint; ', '.join(fingerprint(n=2)[:200])
Out[2]: u'op cit, natural world, folk taxonomies, information systems, printing press, art memory, december 2006, web wasnt, oral traditions, epigenetic rules, accessed december, ice age, dark ages, human beings, plants animals, age alphabets, networks hierarchies, world wide, classification systems, wide web, social networks, steam engine, astral power, dark age, human culture, great deal, classification system, power station, written word, university press, information retrieval, ted nelson, social political, oral culture, computer industry, information technology, roman church, middle ages, decimal system, symbolic expression, library congress, illuminating dark, scientific method, tree life, engine mind, van dam, dewey decimal, united states, illuminated manuscripts, recent years, thousands years, industrial library, library catalogs, alex wright, information explosion, nyce kahn, social groups, trees tree, social relationships, family trees, personal correspondence, human cultures, oral literate, electronic media, fine arts, public library, irish scribes, human knowledge, quoted ibid, associative trails, printed book, human brain, personal computer, public libraries, disposition toward, age information, prototype theory, networked information, history york, roman empire, printed books, thousand years, gutenberg revolution, ancient world, human mind, informa tion, monastic scriptoria, yates op, take shape, monastic art, found way, great library, literary machines, nineteenth century, correspondence alex, literate cultures, library catalog, library alexandria, moose roared, allowing users, classifica tion, indi vidual, readers writers, sys tem, years ago, homo sapiens, century bc, george landow, even though, subject headings, larger social, family relationships, paul kahn, bushs vision, human thought, natural selection, history information, word god, kahn op, hu man, schol ars, otlets vision, information science, san francisco, celtic church, lit eracy, great libraries, human social, cave paintings, dream machines, linnaean system, sys tems, 59 61, digital revolution, memories future, works classical, matthew battles, linnaean taxonomy, animal kingdom, social bonds, computer science, encyclopedic revolution, tribal societies, worlds first, come surprise, stephen jay, tens thousands, spoken word, intellectual capital, stock exchanges, card catalog, walled garden, computer scientists, com mercial, social organization, old hierarchies, word image, ideal forms, british museum, vannevar bush, francis bacon, hierarchical systems, classical texts, book production, writing emerged, roman libraries, cultural change, took shape, graphical user, melvil dewey, collection books, com puter, web browser, ancient greece, infor mation, unquiet history, every book, popular books, clay tablets, near east, rhodri lewis, political power, todays web, ulti mately, life forms, words images, battles library, manuscripts 8086, read write, mythic thought, first library, make way, royal library, institutional hierarchies, literate culture, durkheim mauss, colon classification, eighteenth century, greek civilization, digital age'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment