Skip to content

Instantly share code, notes, and snippets.

Last active June 12, 2019 09:19
Show Gist options
  • Save BrianHung/93a23bbdfe3315228cb37a0412a1c0b6 to your computer and use it in GitHub Desktop.
Save BrianHung/93a23bbdfe3315228cb37a0412a1c0b6 to your computer and use it in GitHub Desktop.
Given a text_block, return the n most frequent words excluding stop_words, punctuation, and optionally numbers.
import nltk'stopwords')
from nltk import ngrams
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
# Faster than looking at entire list; have to lowercase input.
stop_words = set(stopwords.words("english"))
# Faster than nltk's FreqDist.
from collections import Counter
def find_most_common_ngram(text_block, exclude_numeric=False, n=1):
Given a text_block, return a Counter of the most frequent n-grams excluding stop_words.
To find most frequent words, set n = 1 (default).
# Lowercase all characters in the input/
text_block = text_block.lower()
# Create regex to ignore contractions and optionally numbers.
if exclude_numeric:
tokenizer = RegexpTokenizer("[A-Za-z']+")
tokenizer = RegexpTokenizer("[\w']+")
# Split sentence into tokens.
text_token = tokenizer.tokenize(text_block)
# Remove tokens which are stop_words.
text_token = [w for w in text_token if w not in stop_words]
# Count the number of distinct tokens.
text_count = Counter(w for w in ngrams(text_token, n))
# Return the counter.
return text_count
Copy link

Running this script on HackerNew's most popular stories, we find

[('hn', 10348),
 ('show', 5499),
 ('google', 5379),
 ('new', 5155),
 ('ask', 4489),
 ('1', 3346),
 ('open', 2939),
 ('data', 2862),
 ('web', 2856),
 ('2', 2443),
 ('0', 2387),
 ('code', 2312),
 ('facebook', 2278),
 ('video', 2277),
 ('programming', 2208),
 ('pdf', 2200),
 ('using', 2180),
 ('c', 2164),
 ('source', 2140),
 ('apple', 2136)]

Top two entries make sense because Show HN: ... is a common title for stories posted. A quick browse of the titles containing numbers indicate that they are mostly noise, e.g. SHA-1, 2015, 14 year old, Machine Learning 101, $13.1B etc. which are safe to ignore.

Setting the exclude_numeric flag on, we find that

[('hn', 10349),
 ('show', 5499),
 ('google', 5379),
 ('new', 5155),
 ('ask', 4489),
 ('open', 2942),
 ('web', 2862),
 ('data', 2862),
 ('c', 2351),
 ('code', 2319),
 ('facebook', 2278),
 ('video', 2277),
 ('programming', 2208),
 ('pdf', 2202),
 ('using', 2180),
 ('source', 2141),
 ('apple', 2137),
 ('time', 2078),
 ('python', 2018),
 ('app', 1991)]

If you want to replicate the results, download all the stories posted to HN with more than 50 upvotes. You can do this via BigQuery.

Copy link

Updated the gist to support counting n-grams as well. Results for n=2 and n=3, respectively.

[(('show', 'hn'), 5235),
 (('ask', 'hn'), 4154),
 (('open', 'source'), 1687),
 (('machine', 'learning'), 612),
 (('silicon', 'valley'), 531),
 (('node', 'js'), 522),
 (('hacker', 'news'), 496),
 (('programming', 'language'), 494),
 (('yc', 'w'), 476),
 (('deep', 'learning'), 456),
 (('year', 'old'), 420),
 (('real', 'time'), 358),
 (('os', 'x'), 328),
 (('new', 'york'), 325),
 (('raspberry', 'pi'), 304),
 (('source', 'code'), 266),
 (('app', 'store'), 258),
 (('steve', 'jobs'), 252),
 (('san', 'francisco'), 247),
 (('tell', 'hn'), 240)]
[(('ask', 'hn', 'best'), 189),
 (('ask', 'hn', "what's"), 125),
 (('ask', 'hn', 'hiring'), 116),
 (('mac', 'os', 'x'), 95),
 (('new', 'york', 'times'), 88),
 (('ask', 'hn', 'anyone'), 88),
 (('self', 'driving', 'cars'), 84),
 (('pdf', 'show', 'hn'), 82),
 (('ask', 'hn', 'freelancer'), 79),
 (('hn', 'freelancer', 'seeking'), 76),
 (('freelancer', 'seeking', 'freelancer'), 70),
 (('ask', 'hn', 'good'), 69),
 (('self', 'driving', 'car'), 64),
 (('ask', 'hn', 'get'), 62),
 (('new', 'york', 'city'), 61),
 (('google', 'app', 'engine'), 59),
 (('open', 'source', 'software'), 58),
 (('show', 'hn', 'simple'), 57),
 (('show', 'hn', 'built'), 53),
 (('show', 'hn', 'made'), 51)]

An analysis of comments may be interesting as a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment