Skip to content

Instantly share code, notes, and snippets.

@rjweiss
Last active December 26, 2015 13:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjweiss/7158902 to your computer and use it in GitHub Desktop.
Save rjweiss/7158902 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "Python_sentiment.ipynb"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Sentence-level text analysis \n",
"\n",
"## Sentiment analysis in Python\n",
"\n",
"We're going to take advantage of a library that comes with a pretrained model, [SASA](https://code.google.com/p/sasa-tool/). This is produced by the University of Southern California. It's a perfectly suitable Naive Bayes classifier, but it is rather slow. I don't recommend it for large datasets (read: millions of sentences) unless you are prepared to do some distributed processing.\n",
"\n",
"Sentence-level annotations presumably translates well to other short forms of texts, such as microblogging (Twitter, Facebook statuses, etc).\n",
"\n",
"We're going to use a very small random sample of the very famous `sentiment polarity dataset v1.0` by Pang and Lee, available [here](http://www.cs.cornell.edu/people/pabo/movie-review-data/). This consists of thousands of processed sentences/snippets of movie reviews from the Rotten Tomatoes website."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import os\n",
"import csv\n",
"#os.chdir('/path/to/wherever/you/downloaded/data/from/textcleaning')\n",
"os.chdir('/Users/rweiss/Dropbox/presentations/MozFest2013/data/')\n",
"\n",
"review_data = []\n",
"labels = set()\n",
"tsvfile = open('../sentiment_examples/reviews_sample.csv', 'r') # take a look at what this .csv looks like and check the file structure\n",
"csv_reader = csv.DictReader(tsvfile, delimiter='\\t')\n",
"for line in csv_reader:\n",
" temp = {line['label'] : line['content']}\n",
" review_data.append(temp)\n",
" labels.add(line['label'])\n",
"tsvfile.close()\n",
"\n",
"print 'There are ' + str(len(review_data)) + ' total reviews.'\n",
"\n",
"print 'The labels are '+ ', '.join(labels) + '.'"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"There are 100 total reviews.\n",
"The labels are positive, negative.\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from random import shuffle\n",
"x = [review_data[i] for i in range(len(review_data))]\n",
"shuffle(x)\n",
"\n",
"#shuffling just to show you how to\n",
"\n",
"review_data = [''.join(el.values()) for el in x] \n",
"target_labels = [''.join(el.keys()) for el in x]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!/usr/bin/python\n",
"# code adapted from the SASA script classifyFromCmdLine.py\n",
"import sys\n",
"\n",
"try:\n",
" from sasa.classifier import Classifier\n",
"except ImportError:\n",
" raise ImportError(\"Did you try to run '. setup.env'?\\n\" + \n",
" \"(or add the sasa-tool directory to PYTHONPATH, ie export PYTHONPATH=<this director>)?\")\n",
"\n",
"classifier = Classifier() \n",
"classified_reviews = []\n",
"\n",
"for line in review_data:\n",
" print \"classifying %s\" % line.strip()\n",
" sentiment, valence, posterior = classifier.classifyFromText(line)\n",
" if valence >= 0:\n",
" score = \"positive\"\n",
" elif valence <0:\n",
" score = \"negative\"\n",
" classified_reviews.append({score: line}) \n",
"\n",
"outfile = open('../sentiment_examples/reviews_sample_3.csv', 'w')\n",
"\n",
"for line in classified_reviews:\n",
" body = ''.join(line.values())\n",
" label = ''.join(line.keys())\n",
" body = body.decode('iso-8859-1') # this might not be the encoding of your data!\n",
" body = body.encode('utf-8') # utf-8 is usually the best encoding to use\n",
" label = label.encode('utf-8')\n",
" try:\n",
" outfile.write(label + '\\t' + body + '\\n')\n",
" except UnicodeDecodeError:\n",
" print \"Unicode error\" + line\n",
"outfile.close()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"classifying the verdict : two bodies and hardly a laugh between them .\n",
"classifying . . . a guiltless film for nice evening out .\n",
"classifying it's a beautifully accomplished lyrical meditation on a bunch of despondent and vulnerable characters living in the renown chelsea hotel . . .\n",
"classifying at just over an hour , home movie will leave you wanting more , not to mention leaving you with some laughs and a smile on your face .\n",
"classifying given the fact that virtually no one is bound to show up at theatres for it , the project should have been made for the tube .\n",
"classifying those outside show business will enjoy a close look at people they don't really want to know .\n",
"classifying parts seem like they were lifted from terry gilliam's subconscious , pressed through kafka's meat grinder and into bu\ufffduel's casings\n",
"classifying in the long , dishonorable history of quickie teen-pop exploitation , like mike stands out for its only partly synthetic decency .\n",
"classifying aspires for the piquant but only really achieves a sort of ridiculous sourness .\n",
"classifying crammed with incident , and bristles with passion and energy .\n",
"classifying the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .\n",
"classifying maid in manhattan proves that it's easier to change the sheets than to change hackneyed concepts when it comes to dreaming up romantic comedies .\n",
"classifying bouquet gives a performance that is masterly .\n",
"classifying everything -- even life on an aircraft carrier -- is sentimentalized .\n",
"classifying solondz is so intent on hammering home his message that he forgets to make it entertaining .\n",
"classifying despite all the closed-door hanky-panky , the film is essentially juiceless .\n",
"classifying an oddity , to be sure , but one that you might wind up remembering with a degree of affection rather than revulsion .\n",
"classifying enough may pander to our basest desires for payback , but unlike many revenge fantasies , it ultimately delivers .\n",
"classifying jir\ufffd hubac's script is a gem . his characters are engaging , intimate and the dialogue is realistic and greatly moving . the scope of the silberstein family is large and we grow attached to their lives , full of strength , warmth and vitality . .\n",
"classifying i loved the look of this film .\n",
"classifying about as cutting-edge as pet rock : the movie .\n",
"classifying strangely comes off as a kingdom more mild than wild .\n",
"classifying the film is old-fashioned , occasionally charming and as subtle as boldface .\n",
"classifying it tells more than it shows .\n",
"classifying to work , love stories require the full emotional involvement and support of a viewer . that is made almost impossible by events that set the plot in motion .\n",
"classifying \" the time machine \" is a movie that has no interest in itself . it doesn't believe in itself , it has no sense of humor\ufffdit's just plain bored .\n",
"classifying a pleasant enough movie , held together by skilled ensemble actors .\n",
"classifying it's a bizarre curiosity memorable mainly for the way it fritters away its potentially interesting subject matter via a banal script , unimpressive acting and indifferent direction .\n",
"classifying this may not have the dramatic gut-wrenching impact of other holocaust films , but it's a compelling story , mainly because of the way it's told by the people who were there .\n",
"classifying nakata's technique is to imply terror by suggestion , rather than the overuse of special effects .\n",
"classifying the pitch must have read like a discarded house beautiful spread .\n",
"classifying a film that presents an interesting , even sexy premise then ruins itself with too many contrivances and goofy situations .\n",
"classifying seeing seinfeld at home as he watches his own appearance on letterman with a clinical eye reminds you that the key to stand-up is to always make it look easy , even though the reality is anything but .\n",
"classifying as underwater ghost stories go , below casts its spooky net out into the atlantic ocean and spits it back , grizzled and charred , somewhere northwest of the bermuda triangle .\n",
"classifying happily , some things are immune to the folly of changing taste and attitude . for proof of that on the cinematic front , look no further than this 20th anniversary edition of the film that spielberg calls , retrospectively , his most personal work yet .\n",
"classifying cute , funny , heartwarming digitally animated feature film with plenty of slapstick humor for the kids , lots of in-jokes for the adults and heart enough for everyone .\n",
"classifying between the drama of cube ? s personal revelations regarding what the shop means in the big picture , iconic characters gambol fluidly through the story , with charming results .\n",
"classifying cineasts will revel in those visual in-jokes , as in the film's verbal pokes at everything from the likes of miramax chief harvey weinstein's bluff personal style to the stylistic rigors of denmark's dogma movement .\n",
"classifying feels like nothing quite so much as a middle-aged moviemaker's attempt to surround himself with beautiful , half-naked women .\n",
"classifying the latest adam sandler assault and possibly the worst film of the year .\n",
"classifying a didactic and dull documentary glorifying software anarchy .\n",
"classifying i was impressed by how many tit-for-tat retaliatory responses the filmmakers allow before pulling the plug on the conspirators and averting an american-russian armageddon .\n",
"classifying this follow-up seems so similar to the 1953 disney classic that it makes one long for a geriatric peter ."
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"classifying an intriguing look at the french film industry during the german occupation ; its most delightful moments come when various characters express their quirky inner selves .\n",
"classifying the cartoon that isn't really good enough to be on afternoon tv is now a movie that isn't really good enough to be in theaters .\n",
"classifying the comedy death to smoochy is a rancorous curiosity : a movie without an apparent audience .\n",
"classifying watching beanie and his gang put together his slasher video from spare parts and borrowed materials is as much fun as it must have been for them to make it .\n",
"classifying all the pieces fall together without much surprise , but little moments give it a boost .\n",
"classifying one of those films that seems tailor made to air on pay cable to offer some modest amusements when one has nothing else to watch .\n",
"classifying it doesn't do the original any particular dishonor , but neither does it exude any charm or personality .\n",
"classifying it may not be history \ufffd but then again , what if it is ? \ufffd but it makes for one of the most purely enjoyable and satisfying evenings at the movies i've had in a while .\n",
"classifying rifkin no doubt fancies himself something of a hubert selby jr . , but there isn't an ounce of honest poetry in his entire script ; it's simply crude and unrelentingly exploitative .\n",
"classifying the film proves unrelentingly grim -- and equally engrossing .\n",
"classifying \" auto focus \" works as an unusual biopic and document of male swingers in the playboy era\n",
"classifying this is a very fine movie -- go see it .\n",
"classifying a standard police-oriented drama that , were it not for de niro's participation , would have likely wound up a tnt original .\n",
"classifying a film that plays things so nice 'n safe as to often play like a milquetoast movie of the week blown up for the big screen .\n",
"classifying the salton sea has moments of inspired humour , though every scrap is of the darkest variety .\n",
"classifying instead of a hyperbolic beat-charged urban western , it's an unpretentious , sociologically pointed slice of life .\n",
"classifying diane lane's sophisticated performance can't rescue adrian lyne's unfaithful from its sleazy moralizing .\n",
"classifying it's both sitcomishly predictable and cloying in its attempts to be poignant .\n",
"classifying there are some wonderfully fresh moments that smooth the moral stiffness with human kindness and hopefulness .\n",
"classifying a spunky , original take on a theme that will resonate with singles of many ages .\n",
"classifying in the spirit of the season , i assign one bright shining star to roberto benigni's pinocchio -- but i guarantee that no wise men will be following after it .\n",
"classifying a masterful film from a master filmmaker , unique in its deceptive grimness , compelling in its fatalist worldview .\n",
"classifying an improvement on the feeble examples of big-screen poke-mania that have preceded it .\n",
"classifying anyone who's ever suffered under a martinet music instructor has no doubt fantasized about what an unhappy , repressed and twisted personal life their tormentor deserved . these people are really going to love the piano teacher .\n",
"classifying an utterly compelling 'who wrote it' in which the reputation of the most famous author who ever lived comes into question .\n",
"classifying while tattoo borrows heavily from both seven and the silence of the lambs , it manages to maintain both a level of sophisticated intrigue and human-scale characters that suck the audience in .\n",
"classifying the best that can be said about the work here of scottish director ritchie . . . is that he obviously doesn't have his heart in it .\n",
"classifying this is the sort of burly action flick where one coincidence pummels another , narrative necessity is a drunken roundhouse , and whatever passes for logic is a factor of the last plot device left standing .\n",
"classifying a perfect example of rancid , well-intentioned , but shamelessly manipulative movie making .\n",
"classifying a preposterously melodramatic paean to gang-member teens in brooklyn circa 1958 .\n",
"classifying a negligible british comedy .\n",
"classifying hugh grant , who has a good line in charm , has never been more charming than in about a boy .\n",
"classifying simone is not a bad film . it just doesn't have anything really interesting to say .\n",
"classifying this formulaic chiller will do little to boost stallone's career .\n",
"classifying rich in detail , gorgeously shot and beautifully acted , les destinees is , in its quiet , epic way , daring , inventive and refreshingly unusual .\n",
"classifying many of benjamins' elements feel like they've been patched in from an episode of miami vice .\n",
"classifying . . . creates a visceral sense of its characters' lives and conflicted emotions that carries it far above . . . what could have been a melodramatic , lifetime channel-style anthology .\n",
"classifying try as i may , i can't think of a single good reason to see this movie , even though everyone in my group extemporaneously shouted , 'thank you ! ' when leguizamo finally plugged an irritating character late in the movie .\n",
"classifying austin powers in goldmember is a cinematic car wreck , a catastrophic collision of tastelessness and gall that nevertheless will leave fans clamoring for another ride .\n",
"classifying as an actor's showcase , hart's war has much to recommend it , even if the top-billed willis is not the most impressive player . as a story of dramatic enlightenment , the screenplay by billy ray and terry george leaves something to be desired .\n",
"classifying for a film about action , ultimate x is the gabbiest giant-screen movie ever , bogging down in a barrage of hype .\n",
"classifying serving sara should be served an eviction notice at every theater stuck with it .\n",
"classifying imagine a really bad community theater production of west side story without the songs .\n",
"classifying [washington's] strong hand , keen eye , sweet spirit and good taste are reflected in almost every scene ."
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"classifying sex is one of those films that aims to confuse .\n",
"classifying suffers from a flat script and a low budget .\n",
"classifying schmaltzy and unfunny , adam sandler's cartoon about hanukkah is numbingly bad , little nicky bad , 10 worst list bad .\n",
"classifying as an entertainment , the movie keeps you diverted and best of all , it lightens your wallet without leaving a sting .\n",
"classifying one of those unassuming films that sneaks up on you and stays with you long after you have left the theatre .\n",
"classifying has far more energy , wit and warmth than should be expected from any movie with a \" 2 \" at the end of its title .\n",
"classifying . . . comes alive only when it switches gears to the sentimental .\n",
"classifying the film is really not so much bad as bland .\n",
"classifying you see robin williams and psycho killer , and you think , hmmmmm . you see the movie and you think , zzzzzzzzz .\n",
"classifying as a good old-fashioned adventure for kids , spirit : stallion of the cimarron is a winner .\n",
"classifying it's refreshing to see a girl-power movie that doesn't feel it has to prove anything .\n",
"classifying somehow ms . griffiths and mr . pryce bring off this wild welsh whimsy .\n",
"classifying i admire the closing scenes of the film , which seem to ask whether our civilization offers a cure for vincent's complaint .\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Big picture questions:\n",
"1. These models all have baseline accuracies measured against very famous, annotated datasets. Therefore we have an estimate of how \"accurate\" a model should be. \n",
"2. Go through the resulting predicted sentiment labels and examine whether you agree or disagree with them. Count the proportion of values you agree with and then compare your agreement ratio agains the measured baseline accuracy. How similar is it?\n",
"3. How appropriate is this model for this kind of data? \n",
"4. What was this model trained on? \n",
"5. How similar is the language of that training data against this movie review data?\n",
"6. How do these results compare against the other model's results?\n",
"7. We created a very simple bipolar classification. By default, SASA will do positive, negative, neutral, and unsure. Other models will do 5pt classification (very positive-very negative). Still others will do discrete, categorical sentiment (see Wiebe's subjectivity lexicon). There are many, many ways to label sentiment. Which do you prefer?"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment