nealcaren/Times Gender

## Times Gender
{
 "metadata": {
  "name": "Word Counts"
 },
 "name": "Word Counts",
 "nbformat": 2,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "source": "Using Python to see how the *Times* writes about men and women\n----------\nNeal Caren - University of North Carolina, Chapel Hill\n[mail](mailto:neal.caren@unc.edu)\n[web](http://nealcaren.web.unc.edu)\n[twitter](http://twitter.com/HaphazardSoc)\n[scholar](http://scholar.google.com/citations?user=cy0u16kAAAAJ&hl=en)\n\nDo men and women come up in different contexts in the newspaper? One quick way to answer that \nquestion\nis to compare the words in sentences that discuss women with the words in sentences that\ndiscuss men. Here's an example of how to do this sort of analysis using Python. \n\nThe data comes from last week's (February 27, 2013-March 6, 2013) *New York Times*. I downloaded\nall the articles available through LexisNexis excluding only the corrections and paid\nobituaries. This totals 1,379 articles, or about 200 per day. Using a modified version of an old Python \n[script](http://nealcaren.web.unc.edu/cleaning-up-lexisnexis-files/), I removed all the\nmetadata. put the text\nof each article in its own file, and placed all of the text files in a folder called `articles`. \nIt is not the most efficient way to go about it, but sometimes the text data comes that way\nso I thought I would be useful to set it up that way for didactic purposes. "
    },
    {
     "cell_type": "markdown",
     "source": "We begin by loading a few modules. \nThe only modules that you might need to install is [`nltk`](http://nltk.org),\nwhich is a powerful suite for text processing and analysis. \nFor this analysis, I'm only using the `NLTK` function that splits text into sentences.\n`glob` is a useful module for \nretrieving the contents of a directory, and `string.punctuation` is just a string with all the \nASCII punctuation marks, that is `!\"#$%&'()*+,-/:;<=>?@[\\]^_`{|}~. "
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "from __future__ import division\n\nimport glob\nimport nltk\nfrom string import punctuation\n\ntokenizer = nltk.data.load('tokenizers/punkt/english.pickle')",
     "language": "python",
     "outputs": [],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "source": "The heart of the analysis will be figuring out whether a sentence is talking about\na man, woman, both or neither. As a first pass, I'm going to assume that the sentence\nis  talking about man if it uses\nwords like, \"he\", \"dad\" or \"Mr.\", and is probably talking about a woman if it uses\nwords like, \"she\", \"mother\", or \"Ms.\". It isn't perfect, but depending on the text, it can be quite useful.\nRather than start from scratch, I build off of  [Danielle Sucher](http://twitter.com/DanielleSucher)'s list from her \n[Jailbreak the Patriarchy](https://github.com/DanielleSucher/Jailbreak-the-Patriarchy)\nbrowser plugin."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "#Two lists  of words that are used when a man or woman is present, based on Danielle Sucher's https://github.com/DanielleSucher/Jailbreak-the-Patriarchy\nmale_words=set(['guy','spokesman','chairman',\"men's\",'men','him',\"he's\",'his','boy','boyfriend','boyfriends','boys','brother','brothers','dad','dads','dude','father','fathers','fiance','gentleman','gentlemen','god','grandfather','grandpa','grandson','groom','he','himself','husband','husbands','king','male','man','mr','nephew','nephews','priest','prince','son','sons','uncle','uncles','waiter','widower','widowers'])\nfemale_words=set(['heroine','spokeswoman','chairwoman',\"women's\",'actress','women',\"she's\",'her','aunt','aunts','bride','daughter','daughters','female','fiancee','girl','girlfriend','girlfriends','girls','goddess','granddaughter','grandma','grandmother','herself','ladies','lady','lady','mom','moms','mother','mothers','mrs','ms','niece','nieces','priestess','princess','queens','she','sister','sisters','waitress','widow','widows','wife','wives','woman'])",
     "language": "python",
     "outputs": [],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "source": "I'm storing them as sets rather than lists because later on I want to look at whether or not words in\na sentence overlap with these words, and Python will return the intersection of sets, but not lists.\n\nThe function below \ntakes a work list and returns the gender of the person\nbeing talked about, if any, based on the number of words a sentence has in common\nwith either the male or female word lists."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "def gender_the_sentence(sentence_words):\n    mw_length=len(male_words.intersection(sentence_words))\n    fw_length=len(female_words.intersection(sentence_words))\n\n    if mw_length>0 and fw_length==0:\n        gender='male'\n    elif mw_length==0 and fw_length>0: \n        gender='female'\n    elif mw_length>0 and fw_length>0: \n        gender='both'\n    else:\n        gender='none'\n    return gender",
     "language": "python",
     "outputs": [],
     "prompt_number": 37
    },
    {
     "cell_type": "markdown",
     "source": "I don't really care about proper nouns, especially people's names (e.g. it is boring that 'Boehner' is always male), so I need a way to identify them. To do that, \nI'm going to count how many times a word's first letter is capitalized and how many times\nit isn't. With a large enough text and if you ignore the first words of sentences, this is \npretty robust way to identify proper nouns."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "def is_it_proper(word):\n        if word[0]==word[0].upper():\n            case='upper'\n        else:\n            case='lower'\n        \n        word_lower=word.lower()\n        try:\n            proper_nouns[word_lower][case] = proper_nouns[word_lower].get(case,0)+1\n        except Exception,e:\n            #This is triggered when the word hasn't been seen yet\n            proper_nouns[word_lower]= {case:1}",
     "language": "python",
     "outputs": [],
     "prompt_number": 38
    },
    {
     "cell_type": "markdown",
     "source": "Note that here I'm using `.get()` to retrieve the values stored the proper noun dictionary.\nThis is one way to avoid error messages when the key isn't in the dictionary. Here, `proper_nouns[word_lower].get(case,0)`\n returns the value of `word_lower` if that combination of word and capitalization has been seen before and \n0 if has not. The `except` is only triggered when the word hasn't been seen yet.\n\nI'm going to keep track of each the words in each sentence with a couple of counters. This\nfunction \ndoesn't return anything but it does increment the `word_freq`, `word_counter`, \nand `sentence_counter` dictionaries."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "def increment_gender(sentence_words,gender):\n    sentence_counter[gender]+=1\n    word_counter[gender]+=len(sentence_words)\n    for word in sentence_words:\n        word_freq[gender][word]=word_freq[gender].get(word,0)+1",
     "language": "python",
     "outputs": [],
     "prompt_number": 39
    },
    {
     "cell_type": "markdown",
     "source": "And so we begin. I set up the counters to store the various quantities of interest. These \nare the ones that modified in the `increment_gender` function. Some\nof the values probably don't need to be entered now, particularly  for the word and sentence\ncounters, but starting with zeroes helps remind me what they are for. "
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "sexes=['male','female','none','both']\nsentence_counter={sex:0 for sex in sexes}\nword_counter={sex:0 for sex in sexes}\nword_freq={sex:{} for sex in sexes}\nproper_nouns={}",
     "language": "python",
     "outputs": [],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "source": "I've stored all the files at text files in a directory called articles and I wanted to grab all their names."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "file_list=glob.glob('articles/*.txt')",
     "language": "python",
     "outputs": [],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "source": "The basic idea is to read each file, split it into sentences, and then process each sentence.\nThe processing begins by splitting the sentence into words and removing punctuation. Then for each word\nthat doesn't begin the sentence, I figure out if it is capitalized or not as part of the \nhunt for proper nouns. Then, I estimate whether the sentence is likely talking about a man or a woman,\nbased on the occurrences of the various gender lists. Finally, I add word that is used to the \nappropriate gender word frequencies counter. So the sentence, \"She is lovely.\" would add 'she','is', and 'lovely'\nto our count of words used when talking about a female. It would also increment the lower case counters for\n'is' and 'lovely'."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "\nfor file_name in file_list:\n    #Open the file\n    text=open(file_name,'rb').read()\n    \n    #Split into sentences\n    sentences=tokenizer.tokenize(text)\n    \n    for sentence in sentences:\n        #word tokenize and strip punctuation\n            sentence_words=sentence.split()\n            sentence_words=[w.strip(punctuation) for w in sentence_words \n                            if len(w.strip(punctuation))>0]\n            \n            #figure out how often each word is capitalized\n            [is_it_proper(word) for word in sentence_words[1:]]\n\n            #lower case it\n            sentence_words=set([w.lower() for w in sentence_words])\n            \n            #Figure out if there are gendered words in the sentence by computing the length of the intersection of the sets\n            gender=gender_the_sentence(sentence_words)\n\n            #Increment some counters\n            increment_gender(sentence_words,gender)",
     "language": "python",
     "outputs": [],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "source": "After all the articles are parsed, it is time to start analyzing the word frequencies.\n\nFirst, I create a set consisting of all words which were capitalized more often than not."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "proper_nouns=set([word for word in proper_nouns if  \n                  proper_nouns[word].get('upper',0) / \n                  (proper_nouns[word].get('upper',0) + \n                   proper_nouns[word].get('lower',0))>.50])",
     "language": "python",
     "outputs": [],
     "prompt_number": 43
    },
    {
     "cell_type": "markdown",
     "source": "I don't really care about rare words, so I select the top 1,000 words, \nbased on frequencies, from both the male and female word dictionaries. From that list, \nI subtract the words used to identify the sentence as either male or female along with the proper nouns."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "common_words=set([w for w in sorted (word_freq['female'],\n                                     key=word_freq['female'].get,reverse=True)[:1000]]+[w for w in sorted (word_freq['male'],key=word_freq['male'].get,reverse=True)[:1000]])\n\ncommon_words=list(common_words-male_words-female_words-proper_nouns)",
     "language": "python",
     "outputs": [],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "source": "I compute how likely the word appears in a male subject sentence versus a \nfemale subject sentence. (My first instinct was to create ratios, but they are undefined when\na word is not used to talk about the sex used in the denominator.) I also need to control for the fact\nthat there is likely an imbalance in how many words are written about men and women. If 'hair' is \nmentioned in 10 male-subjected sentences and 10 female-subject sentences, that could be taken as a\nsign of parity, but not if there a total of 20 female-subject (50%) sentences and 100 male-subject\nsentences (10%). I'll score 'hair' as a 16.6% male, which is (10%)/(50%+10%). Later on, if we want,\nwe can recover the ratios by computing `(100-16.6)/16.6`, which is 5x, the same as `50%/10%`."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "male_percent={word:(word_freq['male'].get(word,0) / word_counter['male']) \n              / (word_freq['female'].get(word,0) / word_counter['female']+word_freq['male'].get(word,0)/word_counter['male']) for word in common_words}",
     "language": "python",
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "markdown",
     "source": "We can print out some basic statistics based on our counters about overall rates of coverage."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print '%.1f%% gendered' % (100*(sentence_counter['male']+sentence_counter['female'])/\n                           (sentence_counter['male']+sentence_counter['female']+sentence_counter['both']+sentence_counter['none']))\nprint '%s sentences about men.' % sentence_counter['male']\nprint '%s sentences about women.' % sentence_counter['female']\nprint '%.1f sentences about men for each sentence about women.' % (sentence_counter['male']/sentence_counter['female'])",
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "25.9% gendered\n19681 sentences about men.\n6242 sentences about women.\n3.2 sentences about men for each sentence about women."
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "source": "Finally, I print out the words that are disproporately found in the male and female subject\nsentences. For the 50 distincitve female and male words, I print the ratio of gendered %s \nalong with the\ncount of the number of male-subject and female-subject sentences that had the word. This script isn't \nparticularly pretty, but it gets the job done."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "header ='Ratio\\tMale\\tFemale\\tWord'\nprint 'Male words'\nprint header\nfor word in sorted (male_percent,key=male_percent.get,reverse=True)[:50]:\n    try:\n        ratio=male_percent[word]/(1-male_percent[word])\n    except:\n        ratio=100\n    print '%.1f\\t%02d\\t%02d\\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)\n\nprint '\\n'*2\nprint 'Female words'\nprint header\nfor word in sorted (male_percent,key=male_percent.get,reverse=False)[:50]:\n    try:\n        ratio=(1-male_percent[word])/male_percent[word]\n    except:\n        ratio=100\n    print '%.1f\\t%01d\\t%01d\\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)",
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Male words\nRatio\tMale\tFemale\tWord\n11.2\t72\t02\tprime\n10.8\t70\t02\tbaseball\n9.5\t92\t03\tofficial\n9.5\t61\t02\tcapital\n9.5\t61\t02\tgovernor\n5.8\t75\t04\tfans\n5.3\t120\t07\tminister\n5.3\t51\t03\tsequester\n5.2\t118\t07\tleague\n4.5\t58\t04\tfailed\n4.4\t57\t04\tcardinals\n4.2\t54\t04\tfinance\n4.0\t78\t06\treporters\n3.9\t50\t04\twinning\n3.8\t73\t06\tfinally\n3.6\t116\t10\tplayers\n3.5\t56\t05\tacknowledged\n3.5\t67\t06\taddress\n3.4\t66\t06\tattack\n3.3\t108\t10\topposition\n3.3\t54\t05\trest\n3.3\t53\t05\tcamp\n3.2\t52\t05\tcosts\n3.1\t91\t09\tgoal\n3.1\t50\t05\tcrowd\n3.0\t118\t12\tbank\n2.9\t57\t06\treferring\n2.9\t66\t07\tsports\n2.9\t56\t06\tsurgery\n2.9\t56\t06\tmissed\n2.8\t55\t06\tpressure\n2.8\t64\t07\tteammates\n2.8\t91\t10\teconomy\n2.8\t54\t06\trelease\n2.7\t123\t14\tpope\n2.7\t130\t15\tmeeting\n2.6\t84\t10\tvictory\n2.6\t58\t07\tveteran\n2.5\t226\t28\tpolitical\n2.5\t104\t13\tspending\n2.5\t64\t08\teffect\n2.5\t56\t07\tspend\n2.5\t72\t09\tcontinue\n2.5\t95\t12\tforeign\n2.4\t71\t09\tinjury\n2.4\t94\t12\telection\n2.4\t78\t10\trunning\n2.4\t116\t15\tmanager\n2.4\t54\t07\telected\n2.4\t99\t13\ttax\n\n\n\nFemale words\nRatio\tMale\tFemale\tWord\n100.0\t0\t29\tpregnant\n100.0\t0\t17\thusband's\n51.6\t1\t16\tsuffrage\n40.3\t2\t25\tbreast\n12.9\t4\t16\tgender\n11.8\t6\t22\tpregnancy\n6.8\t10\t21\tdresses\n5.7\t13\t23\tbirth\n5.5\t13\t22\tmemoir\n4.8\t25\t37\tbaby\n4.7\t17\t25\tdisease\n4.6\t14\t20\tinterviewed\n4.6\t12\t17\tabortion\n4.6\t24\t34\tdress\n4.5\t23\t32\tmarried\n4.3\t12\t16\tactivist\n4.3\t25\t33\tauthor\n4.1\t14\t18\tdrama\n3.9\t30\t36\thair\n3.8\t18\t21\trape\n3.6\t24\t27\tdog\n3.6\t19\t21\tnovel\n3.5\t99\t108\tchildren\n3.4\t16\t17\tstatue\n3.4\t17\t18\tvictim\n3.4\t51\t53\tcancer\n3.3\t41\t42\tviolence\n3.2\t32\t32\tyounger\n3.2\t20\t20\tfestival\n3.1\t34\t33\tstudy\n3.1\t30\t29\tteacher\n3.1\t27\t26\tsex\n3.1\t43\t41\tfashion\n3.1\t20\t19\topera\n3.0\t18\t17\tsinging\n3.0\t62\t57\tchild\n2.8\t23\t20\twear\n2.8\t30\t26\tnative\n2.6\t34\t27\tdance\n2.6\t29\t23\tgraduated\n2.5\t33\t26\twriter\n2.5\t23\t18\tfavor\n2.5\t41\t32\teyes\n2.5\t22\t17\tbecomes\n2.5\t47\t36\tkids\n2.5\t21\t16\teat\n2.4\t29\t22\tdomestic\n2.4\t29\t22\ttraditional\n2.4\t77\t58\tparents\n2.4\t32\t24\tdrug"
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "source": "My quick interepretation: If your knowledge of men's and women's roles in society came just from\nreading last week's *New York Times*, you would think that men play sports and run the government. Women do\nfeminine and domestic things. To be honest, I was a little shocked at how stereotypical\nthe words used in the women subject sentences were. \n\nNow this is only data from one week, and certainly some of the findings\nare driven by that. Coverage of suffrage, for example, was presumably driven by the\n100th anniversary of the \n[Woman Suffrage Procession](http://en.wikipedia.org/wiki/Woman_Suffrage_Parade_of_1913). Similarly,\nthe male list is also tied to recent news events, as one one would expect from data from a newspaper. \nThese lists also just reported the extreme words, many of which were only used in a handful of articles. A \nmore rigorous analysis would probably look at the complete distribution of words.\n\nI should also add that after I ran this analysis for the first time, I noticed a \nfew words, like 'spokesman' and 'actress' that should have been included on the original lists. \n\nIf you wanted to output the full table, you could easily write it to a tab delimited file."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "outfile_name='gender.tsv'\ntsv_outfile=open(outfile_name,'wb')\nheader='percent_male\\tmale_count\\tfemalecount\\tword\\n'\ntsv_outfile.write(header)\nfor word in common_words:\n    row = '%.2f\\t%01d\\t%01d\\t%s\\n' % (100*male_percent[word],word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)\n    tsv_outfile.write(row)\ntsv_outfile.close()",
     "language": "python",
     "outputs": [],
     "prompt_number": 48
    },
    {
     "cell_type": "markdown",
     "source": "As an addendum, we can look at the most popular words. In this case, \nwe will look at the 100 most frequently used words, and then compare what proportion of male subject sentences had those words and what proportion of female subject sentences had those words."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "all_words=[w for w in word_freq['none']]+[w for w in word_freq['both']]+[w for w in word_freq['male']]+[w for w in word_freq['female']]\nall_words={w:(word_freq['male'].get(w,0)+word_freq['female'].get(w,0)+word_freq['both'].get(w,0)+word_freq['none'].get(w,0)) for w in set(all_words)}\n\nprint 'word\\tMale\\tFemale'\nfor word in sorted (all_words,key=all_words.get,reverse=True)[:100]:\n    print '%s\\t%.1f%%\\t%.1f%%' % (word,100*word_freq['male'].get(word,0)/sentence_counter['male'],100*word_freq['female'].get(word,0)/sentence_counter['female'])",
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "word\tMale\tFemale\nthe\t66.4%\t63.1%"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nand\t41.3%\t43.0%\nto\t42.7%\t40.3%\na\t45.2%\t44.8%\nof\t40.0%\t39.0%\nin\t38.7%\t37.7%\nthat\t23.8%\t21.7%\nfor\t18.6%\t19.4%\nis\t14.1%\t16.0%\non\t17.2%\t14.7%\nwith\t16.0%\t15.5%\nsaid\t24.6%\t20.6%\nwas\t18.9%\t16.4%\nat\t12.4%\t13.2%\nhe\t48.3%\t0.0%\nit\t10.3%\t10.5%\nas\t12.5%\t12.2%\nby\t8.8%\t9.4%\nbut\t10.7%\t9.3%\nfrom\t9.8%\t9.3%\nhis\t32.5%\t0.0%\nan\t9.2%\t9.7%\nbe\t7.5%\t7.5%\nhave\t6.4%\t6.9%\nare\t4.6%\t5.5%\nnot\t8.2%\t7.7%\nhas\t8.8%\t7.0%\nthis\t5.7%\t5.7%\nwho\t9.4%\t9.8%\ni\t6.6%\t7.6%\nthey\t3.9%\t4.5%\nmr\t21.9%\t0.0%\nor\t3.8%\t4.4%\nhad\t8.4%\t7.6%\nmore\t4.8%\t4.3%\nabout\t5.8%\t6.2%\none\t5.5%\t5.6%\nwill\t4.3%\t3.8%\ntheir\t3.1%\t4.6%\nwhich\t4.7%\t4.5%\nwould\t5.5%\t4.3%\nnew\t4.3%\t4.3%\nwere\t3.7%\t4.6%\nwhen\t6.2%\t5.8%\nwe\t3.6%\t3.4%\nits\t2.6%\t2.4%\nyou\t2.7%\t3.4%\nbeen\t4.6%\t4.2%\nshe\t0.0%\t41.6%\nthan\t3.1%\t3.0%\nif\t3.6%\t3.4%\nup\t3.7%\t3.6%\nafter\t5.3%\t3.8%\nout\t4.1%\t3.7%\nher\t0.0%\t33.9%\nall\t3.0%\t3.5%\nlike\t3.4%\t4.0%\nthere\t2.9%\t3.1%\nalso\t3.3%\t3.4%\nother\t2.8%\t2.8%\nwhat\t3.3%\t3.3%\ntwo\t3.4%\t3.2%\nno\t2.9%\t2.6%\nsome\t2.8%\t2.8%\nso\t3.0%\t3.3%\ncan\t2.1%\t2.4%\nlast\t3.6%\t2.4%\ninto\t3.3%\t3.1%\nfirst\t3.6%\t4.1%\nit's\t1.9%\t2.7%\ntime\t3.2%\t2.9%\nover\t2.9%\t2.2%\nyears\t3.2%\t3.1%\npeople\t2.5%\t2.5%\njust\t2.5%\t2.6%\nthrough\t2.0%\t1.9%\ncould\t2.8%\t2.6%\np.m\t0.5%\t0.7%\nyear\t2.3%\t2.0%\nthem\t2.1%\t2.3%\nmost\t2.2%\t1.9%\ndo\t1.9%\t1.9%\nnow\t2.5%\t2.2%\nbecause\t2.6%\t2.6%\neven\t2.2%\t1.9%\nmy\t2.2%\t3.7%\nmany\t1.9%\t2.0%\nonly\t2.2%\t2.1%\nhim\t8.1%\t0.0%\nhow\t1.9%\t2.0%\nwhere\t2.3%\t2.4%\nthose\t1.5%\t1.4%\nbefore\t2.6%\t2.0%\nget\t1.7%\t1.6%\npercent\t0.9%\t0.8%\nwork\t1.7%\t3.1%\nmake\t1.7%\t1.8%\nthen\t1.9%\t1.9%\nmade\t2.2%\t2.1%\nway\t1.7%\t1.8%"
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "source": "While there's a couple of interesting findings here, for the most part, the basic building blocks\nof sentences are fairly similarly in the male and female subject sentences. Now, this is just based on\nword frequencies, and a more nuanced examination would probably discover additionally findings of interest.\nFor example, my guess is that that 'work', near the bottom of this list, is used not only more\nfrequently in the female subject sentences, but also in a different context and as a different\npart of speech (e.g. 'men work', 'women juggle home and work responsibilities.'). Comparing\nword frequencies only gets you so far, but it is pretty quick and easy way to \nconduct some preliminary data analysis."
    }
   ]
  }
 ]
}