Last active
June 12, 2019 09:19
-
-
Save BrianHung/93a23bbdfe3315228cb37a0412a1c0b6 to your computer and use it in GitHub Desktop.
Given a text_block, return the n most frequent words excluding stop_words, punctuation, and optionally numbers.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import nltk | |
nltk.download('stopwords') | |
from nltk import ngrams | |
from nltk.tokenize import RegexpTokenizer | |
from nltk.corpus import stopwords | |
# Faster than looking at entire list; have to lowercase input. | |
stop_words = set(stopwords.words("english")) | |
# Faster than nltk's FreqDist. | |
from collections import Counter | |
def find_most_common_ngram(text_block, exclude_numeric=False, n=1): | |
""" | |
Given a text_block, return a Counter of the most frequent n-grams excluding stop_words. | |
To find most frequent words, set n = 1 (default). | |
""" | |
# Lowercase all characters in the input/ | |
text_block = text_block.lower() | |
# Create regex to ignore contractions and optionally numbers. | |
if exclude_numeric: | |
tokenizer = RegexpTokenizer("[A-Za-z']+") | |
else: | |
tokenizer = RegexpTokenizer("[\w']+") | |
# Split sentence into tokens. | |
text_token = tokenizer.tokenize(text_block) | |
# Remove tokens which are stop_words. | |
text_token = [w for w in text_token if w not in stop_words] | |
# Count the number of distinct tokens. | |
text_count = Counter(w for w in ngrams(text_token, n)) | |
# Return the counter. | |
return text_count |
Updated the gist to support counting n-grams as well. Results for n=2 and n=3, respectively.
[(('show', 'hn'), 5235),
(('ask', 'hn'), 4154),
(('open', 'source'), 1687),
(('machine', 'learning'), 612),
(('silicon', 'valley'), 531),
(('node', 'js'), 522),
(('hacker', 'news'), 496),
(('programming', 'language'), 494),
(('yc', 'w'), 476),
(('deep', 'learning'), 456),
(('year', 'old'), 420),
(('real', 'time'), 358),
(('os', 'x'), 328),
(('new', 'york'), 325),
(('raspberry', 'pi'), 304),
(('source', 'code'), 266),
(('app', 'store'), 258),
(('steve', 'jobs'), 252),
(('san', 'francisco'), 247),
(('tell', 'hn'), 240)]
[(('ask', 'hn', 'best'), 189),
(('ask', 'hn', "what's"), 125),
(('ask', 'hn', 'hiring'), 116),
(('mac', 'os', 'x'), 95),
(('new', 'york', 'times'), 88),
(('ask', 'hn', 'anyone'), 88),
(('self', 'driving', 'cars'), 84),
(('pdf', 'show', 'hn'), 82),
(('ask', 'hn', 'freelancer'), 79),
(('hn', 'freelancer', 'seeking'), 76),
(('freelancer', 'seeking', 'freelancer'), 70),
(('ask', 'hn', 'good'), 69),
(('self', 'driving', 'car'), 64),
(('ask', 'hn', 'get'), 62),
(('new', 'york', 'city'), 61),
(('google', 'app', 'engine'), 59),
(('open', 'source', 'software'), 58),
(('show', 'hn', 'simple'), 57),
(('show', 'hn', 'built'), 53),
(('show', 'hn', 'made'), 51)]
An analysis of comments may be interesting as a followup.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Running this script on HackerNew's most popular stories, we find
Top two entries make sense because
Show HN: ...
is a common title for stories posted. A quick browse of the titles containing numbers indicate that they are mostly noise, e.g.SHA-1, 2015, 14 year old, Machine Learning 101, $13.1B
etc. which are safe to ignore.Setting the
exclude_numeric
flag on, we find thatIf you want to replicate the results, download all the stories posted to HN with more than 50 upvotes. You can do this via BigQuery.