(assumes a working Anaconda installation)
- Create an
Anaconda
environment for this workshop:conda create -n scraping python=3.8
- Activate the
scraping
environment:conda activate scraping
- Install
nltk
to our environment:pip install nltk
- Instal
matplotlib
to our environment:pip install matplotlib
- Run
jupyter notebook
in our environment:jupyter notebook
- Open a new
Python 3
Notebook
- In a
notebook
cell:import nltk
- Download
nltk
book:nltk.download()
, find and selectbook
and hitdownload
- Get a sample text! I like using Project Gutenberg for this.
- In the search field, search for an older book that you think might be in Project Gutenberg. For example, one could find Herman Melville's Moby Dick or Lewis Carroll's Alice's Adventures in Wonderland. Take a moment to find a text to work with or, if you want, you can just follow along with me to Alice's Adventures in Wonderland
- We are looking for a
.txt
, or Plain Text, file. Once you have found it click on it to open it in your browser. For example: the Plain Text version of Alice's Adventures in Wonderland is here - Next we open our
Terminal
and usecurl
to download the.txt
file from the url, saving the output to a text file on our CPU:curl https://www.gutenberg.org/files/11/11-0.txt --output alice.txt
-
In Python Import
nltk
:import nltk
-
Import your text file into Python:
filename = "alice.txt" f = open(str(filename), 'r') corpus = f.read() f.close()
-
Before moving forward let's double check to make sure the text file looks right:
print(corpus)
-
A text is composed of tokens: words, sentences, punctuation, etc.
nltk
provides a variety of pre-built functions fortokenization
, but the two that you will see used most frequently areword
andsentence
tokenization.We can split our corpus into words with
from nltk.tokenize import sent_tokenize, word_tokenize print(word_tokenize(corpus))
or sentences with
from nltk.tokenize import sent_tokenize, word_tokenize print(sent_tokenize(corpus))
-
Run both of the tokenization processes again and store them to variables for usage later:
from nltk.tokenize import sent_tokenize, word_tokenize alice_sents = sent_tokenize(corpus) alice_words = word_tokenize(corpus)
-
Check the contents of
alice_sents
andalice_words
withprint
print(alice_sents)
print(alice_words)
-
nltk
has a built-in funciton for analyzing the frequency words occur in a corpus:.FreqDist()
. Run it on your list of words and store it to a variable calledfreq
:freq = nltk.FreqDist(alice_words)
-
check the output of
FreqDist()
:print(freq)
-
for a more specific example of the problem here:
freq.most_common(5)
the code above results in the following (assuming you are using Alice's Adventures in Wonderland as input):[(',', 2574), ('the', 1686), ('“', 1118), ('”', 1114), ('.', 911)]
-
We can modify our code a bit to try to condition, or clean, our data:
from nltk.corpus import stopwords sr= stopwords.words('english') clean_tokens = alice_words for token in alice_words: if token in stopwords.words('english'): msg = ' '.join(['removed', str(token), 'from corpus']) print(msg) clean_tokens.remove(token) freq = nltk.FreqDist(clean_tokens) for key,val in freq.items(): print(str(key) + ':' + str(val)) freq.plot(20, cumulative=False)
Even when we filter out stopwords
, though, we still have a ton of symbols
left over to clean out. Perhaps another approach would be better!
- Run the parts of speech tagger in
nltk
and store the output to a variable calledalice_pos
:alice_pos = nltk.pos_tag(alice_words)
- Check to make sure that the output looks sane:
print(alice_pos)
- The parts of speech tagger outputs a
tuple
, an immutable (un-changeable) datatype that resembles a list. It's a lot easier to deal with lists in Python than tuples (just trust me on this) so the first step is to split the tuple into a 2-dimensionallist
:alice_pos = list(map(list, alice_pos))
- Next, we repurpose our previous example to identify every instance of a period, which we eventually want to remove, but will test first with the following:
clean_pos = alice_pos for token in alice_pos: if token[1] == ".": msg = ' '.join([str(token[0]), "is a ."]) print(msg)
- The above code reliably identifies ends of sentences (apparently also includes
!
and?
), so lets go ahead and remove each token that meets this criteria fromclean_pos
:clean_pos = alice_pos for token in alice_pos: if token[1] == ".": clean_pos.remove(token)
- Check to make sure our output looks sane (there shouldn't be any
.
):clean_pos
- So right now we could iteratively update the symbol to look for, or we could rewrite our code slightly. In the following we have a list of
symbols
and check every token against that list. If a token matches anything on a symbol list we remove it from ourclean_pos
:clean_pos = alice_pos symbols = ['$',"''",'(',')',',','--','.',':','SYM',"``", '#'] for token in alice_pos: for symbol in symbols: if symbol == token[1]: clean_pos.remove(token)
- Check to make sure our output looks sane (there should be substantially less symbols):
clean_pos
- And now lets remove all stopwords:
from nltk.corpus import stopwords cleaner_pos = clean_pos for token in clean_pos: if token[0] in stopwords.words('english'): msg = ' '.join([str(token[0]), "is a stopword"]) print(msg) cleaner_pos.remove(token)
- And now let's unzip our list and trying running
FreqDist()
again and graphing our results:a,b=zip(*cleaner_pos) list(a) freq = nltk.FreqDist(a) freq.plot(20, cumulative=False)
- Still getting some information in the output that we don't really want, lets try throwing out everything but nouns next:
only_nouns = [] nouns = [ 'NN', 'NNP', 'NNPS', 'NNS'] for token in cleaner_pos: for symbol in nouns: if symbol == token[1]: only_nouns.append(token[0])
- Finally, lets graph our noun results:
freq = nltk.FreqDist(only_nouns) freq.plot(20, cumulative=False)
Thanks, your description is very useful :)