Skip to content

Instantly share code, notes, and snippets.

@BrianHung
Last active June 12, 2019 09:19
Show Gist options
  • Save BrianHung/93a23bbdfe3315228cb37a0412a1c0b6 to your computer and use it in GitHub Desktop.
Save BrianHung/93a23bbdfe3315228cb37a0412a1c0b6 to your computer and use it in GitHub Desktop.
Given a text_block, return the n most frequent words excluding stop_words, punctuation, and optionally numbers.
import nltk
nltk.download('stopwords')
from nltk import ngrams
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
# Faster than looking at entire list; have to lowercase input.
stop_words = set(stopwords.words("english"))
# Faster than nltk's FreqDist.
from collections import Counter
def find_most_common_ngram(text_block, exclude_numeric=False, n=1):
"""
Given a text_block, return a Counter of the most frequent n-grams excluding stop_words.
To find most frequent words, set n = 1 (default).
"""
# Lowercase all characters in the input/
text_block = text_block.lower()
# Create regex to ignore contractions and optionally numbers.
if exclude_numeric:
tokenizer = RegexpTokenizer("[A-Za-z']+")
else:
tokenizer = RegexpTokenizer("[\w']+")
# Split sentence into tokens.
text_token = tokenizer.tokenize(text_block)
# Remove tokens which are stop_words.
text_token = [w for w in text_token if w not in stop_words]
# Count the number of distinct tokens.
text_count = Counter(w for w in ngrams(text_token, n))
# Return the counter.
return text_count
@BrianHung
Copy link
Author

Running this script on HackerNew's most popular stories, we find

[('hn', 10348),
 ('show', 5499),
 ('google', 5379),
 ('new', 5155),
 ('ask', 4489),
 ('1', 3346),
 ('open', 2939),
 ('data', 2862),
 ('web', 2856),
 ('2', 2443),
 ('0', 2387),
 ('code', 2312),
 ('facebook', 2278),
 ('video', 2277),
 ('programming', 2208),
 ('pdf', 2200),
 ('using', 2180),
 ('c', 2164),
 ('source', 2140),
 ('apple', 2136)]

Top two entries make sense because Show HN: ... is a common title for stories posted. A quick browse of the titles containing numbers indicate that they are mostly noise, e.g. SHA-1, 2015, 14 year old, Machine Learning 101, $13.1B etc. which are safe to ignore.

Setting the exclude_numeric flag on, we find that

[('hn', 10349),
 ('show', 5499),
 ('google', 5379),
 ('new', 5155),
 ('ask', 4489),
 ('open', 2942),
 ('web', 2862),
 ('data', 2862),
 ('c', 2351),
 ('code', 2319),
 ('facebook', 2278),
 ('video', 2277),
 ('programming', 2208),
 ('pdf', 2202),
 ('using', 2180),
 ('source', 2141),
 ('apple', 2137),
 ('time', 2078),
 ('python', 2018),
 ('app', 1991)]

If you want to replicate the results, download all the stories posted to HN with more than 50 upvotes. You can do this via BigQuery.

@BrianHung
Copy link
Author

Updated the gist to support counting n-grams as well. Results for n=2 and n=3, respectively.

[(('show', 'hn'), 5235),
 (('ask', 'hn'), 4154),
 (('open', 'source'), 1687),
 (('machine', 'learning'), 612),
 (('silicon', 'valley'), 531),
 (('node', 'js'), 522),
 (('hacker', 'news'), 496),
 (('programming', 'language'), 494),
 (('yc', 'w'), 476),
 (('deep', 'learning'), 456),
 (('year', 'old'), 420),
 (('real', 'time'), 358),
 (('os', 'x'), 328),
 (('new', 'york'), 325),
 (('raspberry', 'pi'), 304),
 (('source', 'code'), 266),
 (('app', 'store'), 258),
 (('steve', 'jobs'), 252),
 (('san', 'francisco'), 247),
 (('tell', 'hn'), 240)]
[(('ask', 'hn', 'best'), 189),
 (('ask', 'hn', "what's"), 125),
 (('ask', 'hn', 'hiring'), 116),
 (('mac', 'os', 'x'), 95),
 (('new', 'york', 'times'), 88),
 (('ask', 'hn', 'anyone'), 88),
 (('self', 'driving', 'cars'), 84),
 (('pdf', 'show', 'hn'), 82),
 (('ask', 'hn', 'freelancer'), 79),
 (('hn', 'freelancer', 'seeking'), 76),
 (('freelancer', 'seeking', 'freelancer'), 70),
 (('ask', 'hn', 'good'), 69),
 (('self', 'driving', 'car'), 64),
 (('ask', 'hn', 'get'), 62),
 (('new', 'york', 'city'), 61),
 (('google', 'app', 'engine'), 59),
 (('open', 'source', 'software'), 58),
 (('show', 'hn', 'simple'), 57),
 (('show', 'hn', 'built'), 53),
 (('show', 'hn', 'made'), 51)]

An analysis of comments may be interesting as a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment