Skip to content

Instantly share code, notes, and snippets.

@jlooper
Last active December 3, 2020 01:45
Show Gist options
  • Save jlooper/ce4d102efd057137bc000db796bfd671 to your computer and use it in GitHub Desktop.
Save jlooper/ce4d102efd057137bc000db796bfd671 to your computer and use it in GitHub Desktop.
Chapter-1-Emily

Emily Dickinson and the Meter of Mood

I tie my Hat—I crease my Shawl—
Life's little duties do—precisely—
As the very least  
Were infinite—to me—
   
I put new Blossoms in the Glass—
And throw the old—away—
I push a petal from my gown  
That anchored there—I weigh  
The time 'twill be till six o'clock  
I have so much to do—
And yet—Existence—some way back—
Stopped—struck—my ticking—through— 

dickinson

Daguerrotype from Mount Holyoke

The enigmatic young lady staring directly into our eyes in the famous daguerrotype of Emily Dickinson challenges us. What is she thinking, with her slightly pursed lips, small nosegay and plain dress? Perhaps she is composing another of the nearly 1800 poems she wrote in her lifetime. Perhaps she is thinking of her garden, for which she was more famous than for her poetry during her life.  Perhaps she is pondering the many things she has to do before six o'clock. 

Dickinson is, of course, drawing on the "carpe diem" trope in the poem above. As Robert Pinsky noted, the poet is well known for her somber, "steely perception" that time runs on. Her style and vocabulary have led readers of her work to portray her as a negative, austere spinster poet. Critics have also dismissed her as a hysterical or depressive recluse. Known as the "lady in white", Dickinson was better known in her lifetime for her garden than for her poetry. It is easy to read through some of her poems and find darkness: 

A train went through a burial gate,
A bird broke forth and sang,
And trilled, and quivered, and shook his throat
Till all the churchyard rang;

And then adjusted his little notes,
And bowed and sang again.
Doubtless, he thought it meet of him
To say good-by to men.

As a writer who prized her privacy over publicity, Dickinson became known only when those who outlived her recognized her genius. Her relatives and friends began publishing her poetry in the early 20th century. She is now recognized as one of America's great poets, on a par with Walt Whitman.  Picking through a large corpus of  poetry to derive meaning and mindset can however lead to faulty first impressions. Coupled with bias against the type of Victorian women poets parodied by Mark Twain, much of what we know or remember of Dickinson is ripe for re-evaluation.

About Emily Dickinson's Poetics

According to the Emily Dickinson Lexicon, the poet wrote "over 1,789 poems from 1850-1886. She wrote over 1,046 letters from 1842-1886. The collected poems contain over 9,275 unique words and nearly 100,000 word occurrences." Her most prolific period as a poet was from 1858-65. 

While some of her poems appeared during her life, most were published after her death. Her younger sister, Lavinia, discovered the bulk of her poetry in a box of papers. Her poems are written on scraps of paper, bound into booklets ("fascicles"), or might even have wrapped a bouquet from her celebrated garden. The digitized collection kept by Amherst College shows the varied formats of the writing. Early partial editions, heavily edited, were published in the 1890s and again in the 1920s. The first scholarly edition of her poetry is the 1955 edition of The Complete Poems of Emily Dickinson, edited by Thomas H. Johnson. 

From a stylistic standpoint, Dickinson's poetics were revolutionary. She was prone to writing poetry without titles and with nonstandard punctuation. Her poetry is probably best known for its fascinating use of 'slant rhyme', pairing line endings by either shared consonants or vowels. In the sample below, "rides" pairs with "is" by the shared 's' consonant. The effect is often dissonant and startling, but not without charm:

A narrow fellow in the grass Occasionally rides; You may have met him,-- did you not? His notice sudden is.

Scholarly reception of the poet has evolved over the years just as the editions of the poetry have evolved. Early critics dismissed the work after its initial publication. Dickinson's rejection of 19th century form, however, led more recent literary critics to label her a modernist. Feminist readings such as that of Adrienne Rich have raised her as an iconic woman writer. Currently, a large body of articles continues to uncover new facets of this fascinating author. The Emily Dickinson Journal is dedicated to her scholarship and is sponsored by the Emily Dickinson International Society (EDIS).

Questions to Reconsider

What questions can we ask of this historical poetic corpus? How can data mining and machine learning techniques help us unlock new aspects of an author whose work defied categorization? 

Let's start with the data. This poet's data is comprised of thousands of words in nineteenth-century American English. If we could discover, scrape, or otherwise gather a good dataset of the nearly 1800 poems, it would produce an interesting exercise in data mining. But what thorny question about Dickinson can data mining answer?

Can data mining help answer controversial questions about Dickinson's state of mind as reflected in her poetry? An interesting line of inquiry is suggested by John McDermott, MD. In an article in the American Journal of Psychiatry, the author suggests that Dickinson may have suffered from bipolar and seasonal affective disorders. He proposes that she was strongly affected by changes in the seasons and that her moods are reflected in her output. Laying aside the risk in doing a disservice to the artist by attempting to analyze her mental health purely by text analysis, let's see if, by means of data mining and machine learning, we can predict, by the sentiment of a poem, in what season of the year it might have been authored.

Acquiring the Dataset

The problem with data science is...it requires data! To work with this poetry in a digital setting, you will need to find a way to scrape an adequate dataset from a reliable resource. Not all the large amount of Emily Dickinson poetry available on the internet is appropriate to use for our purpose. 

Ideally, there would exist a high-quality online web site containing an authoritative version of each poem. The researcher can then use a web scraping technology to download the poems in a consistent format. Various web sites, however, offer content of varied quality.

The PoetryDB API is a handy tool that allow a user to gather oems via an endpoint provided by the API (Application Programming Interface). APIs are a useful connection between a database and a web browser. To get a listing of several poems by Emily Dickinson, [visit the poetrydb](https://poetrydb.org/author/Emily Dickinson) in a web browser with an appropriately-formatted URL. This database, however, appears to be crowdsourced; it is unclear from where the poems were acquired. The purpose of this database is to inspire today's poets, not necessarily to provide scholars with datasets.

The 1891 Loomis edition of poetry is available via Project Gutenberg as an HTML page. However, since it is well documented that the poems were heavily edited in this edition, it is not as useful for the data scientist bent on analyzing vocabulary.

The 1924 Bianchi edition available online contains 593 poems, thus it is incomplete. However it is available at Bartleby in scrapable format.

The 1955 Johnson edition was the first modern edition of the full corpus of poems. Importantly, Johnson attempted to assign a chronology by year to the poetry. A scrapable dataset is available but it would need considerable cleaning to render it usable. It is presented in one flat file and contains typographical errors.

The first truly scholarly edition of the poems is the three-volume 1998 Variorum edition by R. W. Franklin. It is available to the would-be data scientist by means of the brilliant  edickinson.org project. This online database  contains a wealth of data obtained from the several editions and most importantly the Variorum. It includes metadata by Franklin who attempted to assign seasons and years to many of the poems. This is a treasure trove!

Unfortunately, the dynamic panel layout of the edickinson web site does not lend itself to being scraped easily. In addition, the Variorum edition includes all the variants of the poems and many notes, so the data would need to be cleaned to reflect an authoritative version for linguistic analysis. Still, given the quality and completeness of this edition and the web site, it is a critical tool for our project.

R. W. Franklin, most importantly for our purposes, has attempted to assemble and tentatively date the poems. His edition is as accurate as possible, given Dickinson's habits of rewriting poems and destroying the originals. Franklin, who dedicated years to untangling this poetic corpus, attempted to assign seasons or parts of a year to the poems based on observation of minutiae such as stamps on papers. McDermott relied on this edition to help him plot the rise and fall in output, season over season, of Dickinson as a writer. 

Our method, then, to determine whether Dickinson's vocabulary reflects seasonal mood changes, will have to rely on the scrapable yet incomplete non-scholarly digital edition of her poetry on Bartleby.com. We can then cross-check these poems against the Variorum edition via edickinson.org. In doing so we can test whether perceived 'negative' or 'positive' vocabulary, as determined by a machine learning algorithm, can be plotted against Franklin's estimation of the season in which a poem was written.

Now that we have decided which dataset we will use, and how we will cross-check it for periodicity, we can start the process of data mining.

Scraping for data

Note: web scraping can cause problems for servers as they amount to a burst of traffic from one source. Some web maintainers deliberately block scrapers using a robots.txt file placed at the root of a web site; if this file exists, you might have to ask permission to scrape. Be a responsible scraper: try to get your script right the first time so as to not inconvenience other web viewers. Also, make sure that the data you are scraping can be used for your intended purpose. Read through a web site's terms of service before scraping.

To scrape a web site, there is some open source software that comes in handy. It may be tempting to try a browser extension or other service, but they can become costly and often do not work well. Scrapy is relatively easy to use as long as you learn how to manage the Python scripts underlying it. 

Scrapy is free and well-documented, and can crawl through a web site if permitted. It can output a file in .json format that you can then convert to a .csv for data mining. 

First, make sure that the tool is installed on your computer, using these instructions for installation. Once it is installed, you can initialize a new Scrapy project. In your computer's terminal or command line, type: scrapy startproject emily. Several files are created, including a .cfg config file and a folder called emily within the base emily folder.

Open this project in the code editor of your choice. You can use the free Visual Studio Code, which includes a Python editor and can be used to run Python scripts from its built-in terminal.

Once Scrapy generates the project, go to the /spiders folder and create a new file: emily-spider.py. Inside this new file, add the following code (you can copy from GitHub):

import scrapy  
from scrapy.spiders import CrawlSpider, Rule  
from scrapy.linkextractors import LinkExtractor  

# command to run: scrapy runspider emily/spiders/emily-spider.py -o emily.json  

class PoemItem(scrapy.Item):  
   poem_title = scrapy.Field()  
   poem_text = scrapy.Field()  

class PoemSpider(CrawlSpider):  
   name = "poems"  
   allowed_domains = ["www.bartleby.com"]  
   start_urls = ["https://www.bartleby.com/113/indexlines.html"]  

   rules = (Rule(LinkExtractor(allow=("113")), callback="parse_item"),)  

def parse_item(self, response):  
       item = PoemItem()  
       item["poem_title"] = response.xpath("//title/text()").get()  
       item["poem_text"] = response.xpath(  
"//table/tbody/tr/td/table/tbody/tr/td/text()"  
       ).getall()  

return item

Running the command scrapy runspider emily/spiders/emily-spider.py -o emily.json from within the outside /emily folder will generate a .json file with all the Dickinson poems on the Bartleby site. You will need to clean the data, as there are stray carriage returns (\n) and other unnecessary text scraped from the HTML.

It is useful to understand how Scrapy works so that you can generate datasets with this tool. In your browser, you can right-click on a web page and choose 'Inspect'. This will open Developer Tools and you can inspect the HTML markup code that is used to build the site. If you right-click on an HTML element that you want to scrape, a popup menu will open and you can choose 'Copy > Copy XPath'. The 'XPath' is the path of the HTML element in the selected tag, and it is copied to your clipboard to use in Scrapy. It can serve as a helper for your Python code, but will probably need to be edited. The line here:  item["poem_text"] = response.xpath( "//table/tbody/tr/td/table/tbody/tr/td/text()").getall() gets all the elements in the table as the scraper automatically clicks links and gets the poem.

Clean the data

Use the find and replace tool in Visual Studio Code to remove elements such as . \u201c and lines like "\n ", "\n ", "\n ", "\n ", "Emily Dickinson\u00a0", "\u00a0\u00a0Complete Poems.\u00a0\u00a0", . \u201d Part Two: Nature. Dickinson, Emily. 1924\. Complete Poems

Remove Unicode characters by converting them back to their human-readable counterparts. \u00a0\u00a0 can be removed entirely as it is a non-breaking space. Convert \u2014 to an em-dash: -. Convert \u2019, \u2018, \u201d and '\u201cto'. Search the data for any stray \u` codes and convert them. 

Remove the numbers in the title by using Regex, or Regular Expressions. In the Replace box type [0-9] and choose the Regex button: .*. This will clear all numbers from the dataset. Ensure that the JSON has "", at the beginning and end of each poem.

To finish cleaning the dataset, you can convert the .json file so you can open it in in Excel for easier cleaning. Convert the .json to .csv by uploading the file to this useful tool: json-csv.com.  If your data has been cleaned enough up to now, you should have a file with two columns.

The scraper has taken the title of the page and appended it to the dataset. Dickinson did not give her poems titles, but you can use this line as the first line of the poem itself since the scraper is not able to parse the embedded HTML in the first line of the text.  The next step to clean this dataset is to collapse the first column into the second.

To do this in Excel, click the column where you want the combined data to go. Type =, then click column 1. Type &, then click column 2. Press enter, and the data will be copied to a third column. At this point, you can delete the first two columns by copying the concatenated column and selecting Paste>Values. You now have a spreadsheet of clean poems.

The last step in cleaning this dataset includes rolling up all the poems, which are currently stored on different lines, into one line per poem stored in one row. This is best done using a routine in a Visual Basic Script in Excel. Following this tutorial, create a button in Excel to select each poem and roll it up into one row. By the end, you will have 593 rows in your spreadsheet, neatly rolled up.

Add the seasons

Once you have your Excel file ready with one poem per row, add a column called id and number the poems. To do this quickly, add a few sequential numbers in the column at the top, starting with 1. Select the column's data and drag the data to the bottom of the dataset. Excel will add sequential numbers as you drag.

Next is a somewhat tedious and manual task: adding a column called 'seasons' and populating it with data from the Variorum edition. You will need to search for each poem in this edition as listed on the edickinson.org project. Once you find the poem, drill down into its metadata and add Thompson's best guess as to the season in which it was written. The terms he uses are:

  • Early in the year (we surmise Jan-March)
  • Second half of the year (we surmise Oct-Dec)
  • Late in the year (we surmise Nov-Dec)
  • Spring
  • Summer
  • Autumn
  • Winter

You can use the terms 'early/second/late/spring/summer/autumn/winter' for consistency, and 'none' where there is no guess.

A clean dataset is available for you at github.com/jlooper/humanists-guide/emily

Now that you have a dataset neatly cleaned comprised of almost 600 Dickinson poems and a guess as to the poem's season, we can start asking questions of it. Let's analyze the poet's language using some common data science tools for evaluating and quantifying linguistic patterns.

Working with the Data in a Notebook

Data scientists make use of "Notebooks" to analyze data, rather than writing HTML CSS or JavaScript and building web apps as a web developer does. You can create and run notebooks using your Python environment locally or in an online environment like a Google 'Colab' online notebook. You can also host notebooks on Kaggle, an excellent repository for data science activities. Kaggle allows you access not only to community-generated code, but also to the datasets that they generate, which can be both fun and useful. Working in an online environment also gives you access to heftier compute power than is likely available to you locally. On Kaggle, for example, you can create a GPU-powered environment where needed to speed up training. 

For now, and to get used to working in notebooks and the Python code that powers them, you can continue to work in Visual Studio Code. Make sure that you download the free Jupyter extension for VS Code by Microsoft. This extension makes it easy to run a notebook locally. 

On your local computer, perhaps in the same folder you did data scraping, create a file called emily.ipynb. Create a folder called input and add your Excel spreadsheet, saved as a .csv file. Your local python environment should start, and the notebook will show a block with an arrow to run a chunk of code.

At the top of the notebook, you can import useful libraries for your work. You will use this notebook to analyze word frequency in a dataset, a numeric process, so you need the numpy library. You will also work with a .csv file, so import pandas which helps in data processing. Import your .csv file, and save the poems and seasons columns in separate variables.

import numpy as np

import pandas as pd

data = pd.read_csv('emily-flattened.csv')

# Keep only the neccessary columns

poems = data['poem']

seasons = data['season']

Next, visualize the data to get an idea of its nature. Use MatPlotLib to show the most common words in your two columns in a bar chart:

from matplotlib import pyplot as plt

%matplotlib inline

def plotWordFrequency(input):

    data = sorted([(w, input.count(w)) for w in set(input)], key = lambda x:x[1], reverse=True)[:40] 

    most_words = [x[0] for x in data]

    times_used = [int(x[1]) for x in data]

    plt.figure(figsize=(20,10))

    plt.bar(x=most_words, height=times_used, color = 'pink', edgecolor = 'red',  width=.5)

    plt.xticks(rotation=45, fontsize=18)

    plt.yticks(rotation=0, fontsize=18)

    plt.xlabel('Most Common Words:', fontsize=18)

    plt.ylabel('Number of Occurences:', fontsize=18)

    plt.title('Most Commonly Used Words', fontsize=24)

    plt.show()

In this code, the data is sorted by how often words are found. Then, the graphs are drawn to screen with colors, labels and fonts specified. Experiment with changing fonts and colors to make the chart more readable.

To show the graph, you need to invoke the methods that you just set up. But if you do that, you will show words such as a, and or the, as they will indeed be the most common. To avoid this, import one more package: nltk. This library is a great resource for natural language processing.

import nltk

nltk.download('stopwords')

from nltk.tokenize import RegexpTokenizer, word_tokenize

from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))

wordsFiltered = []

seasonsFiltered = []

def createPoemString(data):

    words = ' '.join(data)

    split_words = " ".join([word for word in words.split()])

    tokenizer = RegexpTokenizer(r'\w+')

    cleaned_words = tokenizer.tokenize(split_words.lower())

    for word in cleaned_words:

        if word not in stopWords and word is not '':

            wordsFiltered.append(word)

def createSeasonString(data):

    words = ' '.join(data)

    tokens = nltk.tokenize.word_tokenize(words.lower())

    for word in tokens:

       seasonsFiltered.append(word)

createPoemString(poems)

createSeasonString(seasons)

plotWordFrequency(wordsFiltered)

plotWordFrequency(seasonsFiltered)

Here, you have imported libraries that help you filter out any 'stop words' as defined by these packages. Words are also lower-cased and the language is specified. The poetry is 'tokenized' into an array of individual words. Punctuation is filtered out. The filtered words are then fed to the plotWordFrequency method so the graph can be drawn.

word frequency

season frequency

This is a great exercise when trying to discover the 'flavor' or 'feel' of a corpus, as long as the language is not too archaic. Stopwords exist for many languages or you can create your own; try the technique on literature in other languages. 

The fascinating outcome of this exercise is two-fold. 

Keep in mind that this dataset is just a third of the entire corpus in size. 

First, you might notice which words are most common in the dataset. You also might note the 'flatness' of the graph. The  graph is flattened due to the richness of Dickinson's vocabulary. A great comparison is the 250-odd song lyric Beatles analysis, where the graph is quite steep. The Beatles tended to repeat words like 'love' and had a reduced vocabulary, compared to Dickinson's rich, varied stock of words. 

The other interesting outcome of this exercise is the actual words most common to Dickinson: day, sun, time, life, and heaven all precede night, death, god and soul. In this corpus, life is more often invoked than death. The most common word here is 'like' - probably because of Dickinson's extensive and rich use of metaphor. Does the vocabulary here equate to darkness and depression?

Another interesting chart is the seasons chart which helps us understand how much we do NOT know about the season in which the poem was written. The seasonality that McDermott notes in his article is confirmed here: as the year wears on, the number of poems diminishes. But does the nature of the poetry itself change? Determining this is our next task.

Using Python scripts to skim through literary datasets is a great way to introduce yourself to the vocabulary employed by an author. 

Sentiment Analysis using Cognitive Services

The next step as we analyze this poetry is to assign each poem a 'sentiment' - an idea of the positive, negative, or neutral tone of the poem. This can be done by hand using natural language processing techniques. More easily, use a Cognitive Service such as Microsoft's Text Analytics. 

Set up an instance of a Text Analytics service by following this tutorial. Since your data is already in a spreadsheet, you can use Power Automate to skim through the poems in their rows and assign an integer to each poem based on an analysis of its sentiment. While this type of service is generally used to gather product feedback, it's interesting, and sometimes enlightening, to try it on other types of literature.

Power Automate is a low-code tool that allows you to set up 'flows' to perform automated tasks on data. To use it, you will need to add your spreadsheet to OneDrive, a cloud storage provider, so that Power Automate can find it. Convert your spreadsheet to a table in Excel by selecting the data and choosing Insert> Table. A table with column headers will be created. 

Then, open Power Automate and create a 'flow' to append sentiment from Text Analytics. The flow will go through the spreadsheet, line by line, and assign a perceived sentiment (positive, negative, or neutral) for each poem. 

Power Automate

Using the flow builder, create a three part flow. First, use the 'Manually Trigger a Flow" block. Attach that to the "List Rows Present in a Table" block. In this block, specify the location of your spreadsheet and the table to analyze in the spreadsheet.

Add one more block: "Apply to each". In this block, use an 'AI Builder' block to add 'Analyze Positive or Negative Sentiment in a Text". Specify the language as English and the text to be 'poem', referring to the poem column header.

Attach a block to that AI Builder block, 'Update a row'. Specify the key column as 'id' and the key value the 'id' column from your spreadsheet. In the 'sentiment' area of this block, specify 'overall text sentiment' as the value you want the flow to append to your spreadsheet's sentiment column.

Save and run the flow using the Test panel. For this dataset, it will take several minutes for the flow to run. When it is complete, your 'sentiment' column should be populated with the words positive, negative, mixed, or neutral.

Tip, you may need to run this flow in batches to ensure that all data is processed

With your updated spreadsheet, you can now do some more data mining in your notebook to determine if any patterns can be detected to correlate sentiment with seasonality.

Seasonality and sentiment patterns

Because the spreadsheet contains text, rather than integers, plotting its data in a chart other than the bar charts you created prior is not feasable. To determine comparable patterns, a line chart is preferable. With a line chart, you can visualize several groups of data, superimposed on each other. We want to see if we can find a pattern of seasonality based on a group of poems, batched by season. We can sort these poems by sentiment as determined by the text analytics 'sentiment' cognitive services. 

You can use the pandas package to create a dataframe of poems, grouped by season. In your notebook, sort the data by season and by sentiment:

# build a dataframe

import pandas as pd

data = pd.read_csv('../input/597-poems-by-emily-dickinson/emily-flattened.csv')

# spring

ds1 = data[(data['season'] == 'spring') & (data['sentiment'] == 'negative')]

dsnegative = ds1.id.count()

ds2 = data[(data['season'] == 'spring') & (data['sentiment'] == 'positive')]

dspositive = ds2.id.count()

ds3 = data[(data['season'] == 'spring') & (data['sentiment'] == 'mixed')]

dsmixed = ds3.id.count()

ds4 = data[(data['season'] == 'spring') & (data['sentiment'] == 'neutral')]

dsneutral = ds3.id.count()

# summer

dsum1 = data[(data['season'] == 'summer') & (data['sentiment'] == 'negative')]

dsumnegative = dsum1.id.count()

dsum2 = data[(data['season'] == 'summer') & (data['sentiment'] == 'positive')]

dsumpositive = dsum2.id.count()

dsum3 = data[(data['season'] == 'summer') & (data['sentiment'] == 'mixed')]

dsummixed = dsum3.id.count()

dsum4 = data[(data['season'] == 'summer') & (data['sentiment'] == 'neutral')]

dsumneutral = dsum4.id.count()

# autumn

da1 = data[(data['season'] == 'autumn') & (data['sentiment'] == 'negative')]

danegative = da1.id.count()

da2 = data[(data['season'] == 'autumn') & (data['sentiment'] == 'positive')]

dapositive = da2.id.count()

da3 = data[(data['season'] == 'autumn') & (data['sentiment'] == 'mixed')]

damixed = da3.id.count()

da4 = data[(data['season'] == 'autumn') & (data['sentiment'] == 'neutral')]

daneutral = da4.id.count()

# winter

dw1 = data[(data['season'] == 'winter') & (data['sentiment'] == 'negative')]

dwnegative = dw1.id.count()

dw2 = data[(data['season'] == 'winter') & (data['sentiment'] == 'positive')]

dwpositive = dw2.id.count()

dw3 = data[(data['season'] == 'winter') & (data['sentiment'] == 'mixed')]

dwmixed = dw3.id.count()

dw4 = data[(data['season'] == 'winter') & (data['sentiment'] == 'neutral')]

dwneutral = dw4.id.count()

# early

de1 = data[(data['season'] == 'early') & (data['sentiment'] == 'negative')]

denegative = de1.id.count()

de2 = data[(data['season'] == 'early') & (data['sentiment'] == 'positive')]

depositive = de2.id.count()

de3 = data[(data['season'] == 'early') & (data['sentiment'] == 'mixed')]

demixed = de3.id.count()

de4 = data[(data['season'] == 'early') & (data['sentiment'] == 'neutral')]

deneutral = de4.id.count()

# second

dsec1 = data[(data['season'] == 'second') & (data['sentiment'] == 'negative')]

dsecnegative = dsec1.id.count()

dsec2 = data[(data['season'] == 'second') & (data['sentiment'] == 'positive')]

dsecpositive = dsec2.id.count()

dsec3 = data[(data['season'] == 'second') & (data['sentiment'] == 'mixed')]

dsecmixed = dsec3.id.count()

dsec4 = data[(data['season'] == 'second') & (data['sentiment'] == 'neutral')]

dsecneutral = dsec4.id.count()

# late

dlate1 = data[(data['season'] == 'late') & (data['sentiment'] == 'negative')]

dlatenegative = dlate1.id.count()

dlate2 = data[(data['season'] == 'late') & (data['sentiment'] == 'positive')]

dlatepositive = dlate2.id.count()

dlate3 = data[(data['season'] == 'late') & (data['sentiment'] == 'mixed')]

dlatemixed = dlate3.id.count()

dlate4 = data[(data['season'] == 'late') & (data['sentiment'] == 'neutral')]

dlateneutral = dlate4.id.count()

# none

dn1 = data[(data['season'] == 'none') & (data['sentiment'] == 'negative')]

dnnegative = dn1.id.count()

dn2 = data[(data['season'] == 'none') & (data['sentiment'] == 'positive')]

dnpositive = dn2.id.count()

dn3 = data[(data['season'] == 'none') & (data['sentiment'] == 'mixed')]

dnmixed = dn3.id.count()

dn4 = data[(data['season'] == 'none') & (data['sentiment'] == 'neutral')]

dnneutral = dn4.id.count()

This process allows you to gather together all the poems for each season, if it is known, and sort them by their assigned sentiment. Create a dataframe that looks like a two-dimensional array:

dfinal = {'spring' : pd.Series([dsnegative,dspositive,dsmixed,dsneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'summer' : pd.Series([dsumnegative,dsumpositive,dsummixed,dsumneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'autumn' : pd.Series([danegative,dapositive,damixed,daneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'winter' : pd.Series([dwnegative,dwpositive,dwmixed,dwneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'early' : pd.Series([denegative,depositive,demixed,deneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'second' : pd.Series([dsecnegative,dsecpositive,dsecmixed,dsecneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'late' : pd.Series([dlatenegative,dlatepositive,dlatemixed,dlateneutral], index=['negative', 'positive', 'mixed', 'neutral']),

          'none' : pd.Series([dnnegative,dnpositive,dnmixed,dnneutral], index=['negative', 'positive', 'mixed', 'neutral'])

         } 

df = pd.DataFrame(dfinal)

print(df)

It prints as a table:

spring  summer  autumn  winter  early  second  late  none  
negative      18      32      18       1     33      16    16   129  
positive       3       9       9       0     11       6     7    54  
mixed         20      29      12       0     19      27    10    55  
neutral       20       9       1       0      7       2     4    31

Finally, you can create a line plot with one line per season, to show the ebb and flow of sentiment through a year:

from matplotlib import pyplot 

# x-axis values 

numpoems = ['Negative', 'Positive', 'Mixed', 'Neutral'# y-axis values 

spring = [dsnegative,dspositive,dsmixed,dsneutral]

summer = [dsumnegative,dsumpositive,dsummixed,dsumneutral]

autumn = [danegative,dapositive,damixed,daneutral]

winter = [dwnegative,dwpositive,dwmixed,dwneutral]

early = [denegative,depositive,demixed,deneutral]

second = [dsecnegative,dsecpositive,dsecmixed,dsecneutral]

late = [dlatenegative,dlatepositive,dlatemixed,dlateneutral]

none = [dnnegative,dnpositive,dnmixed,dnneutral]

pyplot.plot(numpoems, spring, color = 'green', label = 'Spring')

pyplot.plot(numpoems, summer, color = 'blue', label = 'Summer')

pyplot.plot(numpoems, autumn, color = 'red', label = 'Autumn')

pyplot.plot(numpoems, winter, color = 'orange', label = 'Winter')

pyplot.plot(numpoems, early, color = 'yellow', label = 'Early')

pyplot.plot(numpoems, second, color = 'black', label = 'Second')

pyplot.plot(numpoems, late, color = 'purple', label = 'Late')

#pyplot.plot(numpoems, none, color = 'pink', label = 'None')

pyplot.legend(loc='upper left', frameon=True)

pyplot.show() 

emily-images/line-chart.png

The result is consistent: no matter what season, the sentiment of the poems evolve in basically the same pattern. We find similar proportions of negative, positive, mixed or neutral poems written regardless of the time frame. The outlier is 'winter' because there is not enough data - only one poem is assigned 'winter' by the editor. There are many poems not assigned a season (the 'none' dataset)  but even this larger group follows the same general curve of the corpus's sentiment track. 

Artificial intelligence is not needed to understand that, by and large, the predominant sentiment in Dickinson's corpus, no matter the season, was indeed negative, although in the Spring, specifically, she produced a broader mix of 'mixed' and 'neutral' sentiments in her poetry. To get a more exact, detailed analysis against which the dataset categorized without season might be compared, we could turn again to Sentiment Analysis cognitive services to get a sentence-by-sentence reading of the poems which would clear some of the 'mixed' determination.

There are many questions that remain unanswered in this exercise. Was Thompson influenced by the content of the poems in his task of determining their seasons? Did Bianchi, in her edition, cherry-pick a meaningful subset of poetry based on a particular aesthetic? Is a Cognitive Service that is designed to analyze business-oriented text the proper tool for a poetic corpus? Does Dickinson's experimentation with sentence form infuence or confuse the Cognitive Service? Would a more custom solution for stop words and sentiment analysis work better for this dataset?

When applying data science and machine learning techniques towards an historical dataset, all these questions should be kept in mind. The curious humanist is well-served by researching the nature of the dataset and accounting for its shape before trying these techniques. Dickinson, famously, asked whether her poetry was 'alive', seeking proof of her own existence through her pen:

I am alive—because
I do not own a House—
Entitled to myself—precise—
And fitting no one else—

If we can be excused for trying to fit the literary endeavors of this enigmatic and extraordinary woman into just such a house, we still must allow that certain patterns can be determined. Fascinatingly, her vocabulary points to less dark imagery than would be expected given the negative overtones. The patterns that emerge when charting a season's sentiment, however, show a remarkable similarity of sentiment regardless of season, at least as calculated by a pretrained text analytics tool. 

(Partial) Bibliography

Hallen, Cynthia L. “At Home in Language: Emily Dickinson’s Rhetorical Figures." Emily Dickinson at Home: Proceedings of the Third International Conference of the EDIS. Eds. Gudrun M. Grabher and Martina Antretter. Innsbruck, Austria: Wissenschaftlicher Verlag Trier, 2001, 201-222.

James McDermott, "Emily Dickinson Revisited: A Study of Periodicity in Her Work" https://ajp.psychiatryonline.org/doi/full/10.1176/appi.ajp.158.5.686

Novy, Marianne (1990). Women's Re-visions of Shakespeare: On the Responses of Dickinson, Woolf, Rich, H.D., George Eliot, and Others. University of Illinois Press. p. 117.

Adrienne Rich, "Vesuvius at Home: The Power of Emily Dickinson" 1975

Lena Christianson, Editing Emily Dickinson

R. W. Franklin, Editing of Emily Dickinson: A Reconsideration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment