Skip to content

Instantly share code, notes, and snippets.

@lizadaly
Last active August 31, 2020 12:18
Show Gist options
  • Save lizadaly/7071e0de589883a197433951bc7314c5 to your computer and use it in GitHub Desktop.
Save lizadaly/7071e0de589883a197433951bc7314c5 to your computer and use it in GitHub Desktop.
Quick bigram example in Python/NLTK
import nltk
from nltk.corpus import stopwords
from collections import Counter
word_list = []
# Set up a quick lookup table for common words like "the" and "an" so they can be excluded
stops = set(stopwords.words('english'))
# For all 18 novels in the public domain book corpus, extract all their words
[word_list.extend(nltk.corpus.gutenberg.words(f)) for f in nltk.corpus.gutenberg.fileids()]
# Filter out words that have punctuation and make everything lower-case
cleaned_words = [w.lower() for w in word_list if w.isalnum()]
# Ask NLTK to generate a list of bigrams for the word "sun", excluding
# those words which are too common to be interesing
sun_bigrams = [b for b in nltk.bigrams(cleaned_words) if (b[0] == 'sun' or b[1] == 'sun') \
and b[0] not in stops and b[1] not in stops]
@RajeshSharmaKU
Copy link

HI,

I am quite new to the language processing and am stuck in the bigram counting process.

I have non-financial disclosure of 110 companies for 6 years (total of 660 reports)

I have already preprocessed my files and counted Negative and Positive words based on LM dictionary (2011).

I want to calculate the frequency of bigram as well, i.e. 2 years, upcoming period etc.

Is my process right-

  1. I created bigram from original files (all 660 reports)
  2. I have a dictionary of around 35 bigrams
  3. Check the occurrence of bigram dictionary in the files (all reports)

Are there any available codes for this kind of process?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment