Last active
August 31, 2020 12:18
-
-
Save lizadaly/7071e0de589883a197433951bc7314c5 to your computer and use it in GitHub Desktop.
Quick bigram example in Python/NLTK
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import nltk | |
from nltk.corpus import stopwords | |
from collections import Counter | |
word_list = [] | |
# Set up a quick lookup table for common words like "the" and "an" so they can be excluded | |
stops = set(stopwords.words('english')) | |
# For all 18 novels in the public domain book corpus, extract all their words | |
[word_list.extend(nltk.corpus.gutenberg.words(f)) for f in nltk.corpus.gutenberg.fileids()] | |
# Filter out words that have punctuation and make everything lower-case | |
cleaned_words = [w.lower() for w in word_list if w.isalnum()] | |
# Ask NLTK to generate a list of bigrams for the word "sun", excluding | |
# those words which are too common to be interesing | |
sun_bigrams = [b for b in nltk.bigrams(cleaned_words) if (b[0] == 'sun' or b[1] == 'sun') \ | |
and b[0] not in stops and b[1] not in stops] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
HI,
I am quite new to the language processing and am stuck in the bigram counting process.
I have non-financial disclosure of 110 companies for 6 years (total of 660 reports)
I have already preprocessed my files and counted Negative and Positive words based on LM dictionary (2011).
I want to calculate the frequency of bigram as well, i.e. 2 years, upcoming period etc.
Is my process right-
Are there any available codes for this kind of process?
Thank you