shawngraham/intro-text-analysis-topic-modelling.md

## intro-text-analysis-topic-modelling.md

      
    Raw
  

              intro-text-analysis-topic-modelling.md
            
          
  title
  author
  date
  
  
  A Gentle Introduction to Text Analysis and Topic Modeling
  Shawn Graham
  February 3rd, 2016
  
  
Introduction

So much data has been made available online for historians - everything from court trials (The Old Bailey Online) to newspaper articles (Scissors and Paste), to all 71 volumes of the Jesuit Relations, to over 3.9 million tweets sent during the last election (Ian Milligan & Nick Ruest).
How do we begin to deal with this data? We do it the same way we do with all of our historical information: we consider its context and the patterns we find within it. Happily, we don't have to do this alone: we can 'not read' this information and see what patterns stand out. In a way, it's a bit like those 'magic-eye' cartoons the newspapers used to print. If you squinted the right way, patterns would suddenly jump out and you'd be able to see the sense in all the noise.
The problem for us as historians is figuring out how to ask the right questions of all this information, of how to get our computers to do the 'squinting' for us. Happily, it does not require too much effort on our part to get the data into various programs that can do that for us. What does require effort on our part is to understand and contextualize the patterns so discovered.
Just because you're going to use the computer to do some of the heavy lifting in this exercise does not absolve you of having to use your historian's craft!
Techniques from text analysis and 'natural language processing' allow us to ask our computers to find and highlight interesting patterns. In this exercise, you will search for interesting patterns in the 71 volumes of the Jesuit Relations from the 17th and 18th centuries.
If you do not have a computer

If you do not have a computer, you can use the History Department's computers in the Underhill Research Room (beside the kitchen). These exercises can also be done on the library computers since you are not installing any software. Otherwise, please consult Dr. Nelles on suitable alternatives.
A small bit of background

It is astonishing how far you can get just by counting words. When we count up words used in historical documents, and consider not just how many different words are used, but also their position relative to other words, 'latent' patterns in the texts can be discovered. English and most European languages communicate meaning through word order: Elvis left the building means something quite different than the building left Elvis. So if we consider not just counts of words, but the words' positioning with regard to the words closest to them, and indeed, every other word in the entire corpus, the computer is able to detect patterns in word use that you or I would not spot. We can call this 'distant reading'. Once we find interesting patterns, we dive back in for our more accustomed 'close' reading. Then we re-adjust our expectations, zoom back out, run the computer again, and so on: a kind of virtuous cycle.
When your computer is running, trying to detect 'topics', it cycles through this procedure again and again, until it finds that the probabilities can't be further refined. In the end, you have 'bags' of words that we humans can interpret as a topic. Then we can look at each document in your corpus and see to what degree the words within it (and their positioning) suggest that different topics were drawn upon to write it.
Tools we'll be using

In exercise 1, for dealing with patterns in word use, and local positioning of words (ie, looking at the context of a word's use) we will shortly be exploring the online resource, Voyant Tools.
In exercise 2, for understanding the kinds of patterns of thought within a corpus, we do something a bit more complicated. I refer you to our Graham, Milligan and Weingart's book sections on 'topic modeling' and on the tool we're going to be using. The tool we'll use is called the 'Topic Modeling Tool'. It is written in Java, and can be run on any computer without having to be 'installed'. You simply unzip it, then double click on the icon to get started.
But first: Voyant Tools.
Exercise 1: Counting Words to get a Distant View

Voyant Tools was built by Stefan Sinclair and Geoffrey Rockwell. The main screen interface lets you cut and paste text into it, or the address of a website which it will then read it.
You can also upload a zip folder containing your documents by clicking the 'upload' button. As it happens, I have already uploaded the Jesuit Relations for you at http://voyant-tools.org/?corpus=1454606740871.6583. Click on that link to get started.
The top left window contains a 'word cloud' of the most frequently occuring words. You've seen this kind of thing before no doubt. A word cloud is just a histogram where instead of bars on a graph to represent the frequency, the words themselves are arranged so that the most frequent words are biggest. At the moment, words like 'the', 'and', 'if', 'of', 'but' etc are the biggest. These are what are called 'stopwords'. We want to filter these out because they occur so often in English that they occlude patterns.

click on the cogwheel icon in the word cloud (also known as 'cirrus') tool panel.
in the dropdown menu, select 'English (Taporware)'. This is a dictionary of the most common English words.
tick off the radio box that says 'apply stopwords globally'.
click 'ok'

The word cloud will reload - words like 'god', 'father', 'france', 'iroquois'. But there are also words like 'nbsp', '#', 'vol', 'page'. These words have crept in because the source text had underlying html (the markup language which displays webpages) still in it! So let's filter those words out.

click on the cogwheel icon in the Cirrus tool window.
in the dropdown menu, select 'English (Taporware)'.
click on 'edit stop words'.
in the dialogue box that opens, put your cursor before the ! and hit enter.
in the empty space type in nbsp. hit enter. Type in #,  vol, and page also, each word on its own line.
at the bottom of the dialogue box, type in a name for this stopword list you've just created: mylist
Your list now appears as the selected stopword list in the dialogue box. Click 'apply stopwords globally'. Hit 'ok'

Q1.1 What does this word cloud tell you about the contents of the Jesuit Relations?
Look at the Summary box. This box tells you how many words are in the corpus, and how many words appear only once. It will tell you which documents are longest, and which are shortest. It will tell you which document has the highest density of words (numbers of unique words when compared to the total length of the document) and which one is least. If you click on any of the document names (eg relations_45) the text will be brought up in the central panel. The blue 'spark lines' are a mini graphs representing the length of the entire document (left hand side is the beginning, right hand side the end), showing how the pattern changes. If you click the 'maximize' icon in the top right of the 'Summary' box, you will get this panel opened in a new window.
Q1.2 What do you glean about the contents of the Jesuit Relations based on the distinctive words in the corpus? Remember also that the Vol 1 comes earlier in time than Vol 71.
Look at the Words in the Entire Corpus box. You might have to click on the arrow button to open it up. If you're seeing 'the', 'of', 'to', 'and' etc, you'll have to click on the cogwheel icon for this box to select the 'stopwords' list to filter these out. This box shows the most frequently occuring words, with a little sparkline to show how the word appears across the entire corpus. Find the word iroquois. Tick the radio box beside it.
Woah! A new box called Word Trends opens in the top right of your browser. Find the 'maximize' button at the top of this tool's title bar. Click it. The graph you're now looking at shows the relative frequencies of 'iroquois' per volume of the Jesuit Relations. Sometimes though it's not the frequency of one word that is interesting, but rather the frequency in relation to some other word. Go back to the Words in the Entire Corpus box in your original Voyant browser tab. Tick off the huron box, the savages box, and the poor box. In the Word Trends box click maximize so that you can see the patterns clearly.
Q1.3 What pattern do you see? From this distance, how are the Huron being discussed? Make a note of which volumes seem significant to you.
Finally, if in your main Voyant interface, you click on any of the points in the Word Trends window, a new window will open beneath called Keywords in Context. This is also sometimes known as a concordance. This will show your keyword ('hurons', perhaps) and the words to the immediate left and right of it. You can use this tool to drill deeper into the corpus to see if the patterns that seemed evident before are substantiated.
One last thing - every window in Voyant has a 'save' icon. If you click on this, you will be presented with a variety of options for preserving your data - from a unique URL (website address) that you can share with your reader to image files (.png format) or data tables (for spreadsheets; that is, .csv files).
Q1.4 Now that your are familiar with the Voyant interface, explore the Jesuit Relations. Identify patterns that seem curious to you. Explain what you are seeing and why it is of note. Explain how you could use this observation to begin a research project. Max. 250 words; include screenshots and URLs.
Exercise 2: Looking for Topics

Let's now take a global view of the patterns in the Jesuit Relations.

Download a local copy of the Jesuit Relations. Unzip that folder.
Download the Topic Modeling Tool from this link
Double click the file you just downloaded to run it.
In the dialogue box, set the number of 'topics' you think might be in the corpus. 20 is a good number to start with.
Click the 'select input File or Dir'. Selected the folder containing the Jesuit Relations, hit 'choose' when you've got the folder highlighted.
Click on 'select output dir' to select where the results of the analysis will be put.
Click 'Learn Topics.'  The 'console' window will fill up with status messages as the program runs. The program starts off by making a best guess as to how the words should be sorted into different topics, and then it compares the guess to the document to see if the guess could work. It uses the results of that comparison to make another guess, and so progressively works out a likely distribution of words in topics. (See 'The Macroscope' for a fuller explanation).

When the console reports 'PROCESS COMPLETE' it writes the output to two folders in the location you selected in step 6. One of these folders contains 'csv' files (tables of data) that give the percentage composition of each topic for each volume of the Jesuit Relations.
Look at the console window. There will be a list of numbers 0 - 20, with strings of words beside them. These are the key words for each topic. One topic that appears when I run this on my computer is faith god christians death which seems to be a topic relating to the Jesuits' spiritual concerns. If you open the 'output_csv' folder, and the 'TopicsInDocs.csv' file, you'll see a table where the rows are the individual volumes of the Jesuit Relations, and the columns are the percentage that each topic contributes to that volume. With this information, I could graph the rise and fall of faith god christians death through the corpus - which might tell a very interesting story indeed!
If you look at the 'output_html' folder, click on the file 'all_topics.html'. this will open up a webpage with the twenty topics and their keywords listed. If you click on a topic, you'll then be presented with a ranked list of the volumes of the Jesuit Relations where that topic was the most important (biggest contribution).
If you then click on a document in this list (remember, each .txt file is a complete volume of the Jesuit Relations) you'll be presented with a little preview window of the text, and then the full break down for that document of all the contributing topics.
In this way, you can cycle between a distant and close reading.
Q2.1. What topic strikes you as most 'important' in the Jesuit Relations? Why is it important? What does reading these documents 'distantly' tell you? Is there a compelling story you've discovered in the volumes? What is it?
Another set of questions to reflect on is, what would happen if you split the documents up by pages, or year, or by some other criterion other than volume? What happens if you ask the computer to find 40 topics? 80 topics? 10 topics? What number captures the thematic variety in this corpus of material?
Exercise 3: Writing History

Observing interesting patterns leads to asking interesting questions. The most interesting question of all is 'why?'
Q3. In max 750 words, what interesting patterns do you see in these materials when you use both Voyant Tools and the Topic Modeling Tool? Why are they interesting? How do these methods transform your historian's craft? What do you think you'd need to know more about in order to incorporate these tools into your research more fully - can you trust them?