rajat404/gist:34cdbd541c84781a9525

## gistfile1.txt
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<center><h1><u><b>Textual Analysis for Detection & Removal of Duplicates</b></u></h1></center>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "source": [
      "<h2><center>About</center></h2>\n",
      "<li>The aim of this project is to find and remove duplicate or near-duplicate from text\n",
      "<li>Here we are taking the specific case of tweets (from Twitter)\n",
      "<li>This project is aimed to reduce the amount of redundant data we see across the internet, primarily to converse time\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Authentication</h2>\n",
      "We shall use the access token and API secrets in the file keys.txt"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import twitter\n",
      "#import urllib2\n",
      "#import requests\n",
      "import itertools\n",
      "import re\n",
      "from time import time\n",
      "from datetime import datetime\n",
      "from pprint import pprint\n",
      "from hr import hr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 112
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pymongo\n",
      "from pymongo import MongoClient\n",
      "client = MongoClient()\n",
      "db = client.dedup\n",
      "collection = db.refined"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 113
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "authval = json.load(open(\"keys.txt\"))\n",
      "CONSUMER_KEY = authval['CONSUMER_KEY']\n",
      "CONSUMER_SECRET = authval['CONSUMER_SECRET']\n",
      "OAUTH_TOKEN = authval['OAUTH_TOKEN'] \n",
      "OAUTH_TOKEN_SECRET = authval['OAUTH_TOKEN_SECRET']\n",
      "auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,CONSUMER_KEY, CONSUMER_SECRET)\n",
      "t = twitter.Twitter(auth=auth)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 114
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Function to check the credentials of the User\n",
      "def verify():\n",
      "    verificationDetails = t.account.verify_credentials()\n",
      "    print \"Name: \", verificationDetails['name']\n",
      "    print \"Screen Name: \", verificationDetails['screen_name']\n",
      "    \n",
      "verify()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Name:  Rajat Goyal\n",
        "Screen Name:  rajat404\n"
       ]
      }
     ],
     "prompt_number": 115
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#testTweet = t.statuses.home_timeline()[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 116
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "getLast = db.last.find({}).sort([('_id', -1)]).limit(1)\n",
      "#sinceCounter = None\n",
      "for item in getLast:\n",
      "    sinceCounter = item['lastTweet']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 117
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print sinceCounter"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "572492397210411008\n"
       ]
      }
     ],
     "prompt_number": 118
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "t1 = time()\n",
      "completeTimeline = t.statuses.home_timeline(count=200, since_id=sinceCounter)\n",
      "t2 = time()\n",
      "print \"Time taken to load tweets: \", t2-t1\n",
      "print \"Number of tweets fetched: \", len(completeTimeline)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Time taken to load tweets:  6.10851597786\n",
        "Number of tweets fetched:  197\n"
       ]
      }
     ],
     "prompt_number": 119
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lastTweet = completeTimeline[-1]['id']\n",
      "endTweet = {'lastTweet':lastTweet, 'created_on':datetime.now()}\n",
      "db.last.insert(endTweet)\n",
      "#db.last.ensure_index([(\"id\" , pymongo.ASCENDING), (\"unique\" , True), (\"dropDups\" , True)])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 120,
       "text": [
        "ObjectId('54f5174b44356233cba2319e')"
       ]
      }
     ],
     "prompt_number": 120
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Sanitization</h2>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Tweets contains huge amount of metadeta. We need to extract the useful components"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from string import punctuation\n",
      "set_punct = set(punctuation)\n",
      "set_punct = set_punct - {\"@\"}\n",
      "#set_punct = set_punct - {\"_\", \"@\"}\n",
      "\n",
      "def sanitize(text, set_excludes):\n",
      "    \"\"\"\n",
      "    Return a `sanitized` version of the string `text`.\n",
      "    \"\"\"\n",
      "    text = text.lower()\n",
      "    text = \" \".join([ w for w in text.split() if not (\"http://\" in w) ])\n",
      "    letters_noPunct = [ (\" \" if c in set_excludes else c) for c in text ]\n",
      "    text = \"\".join(letters_noPunct)\n",
      "    words = text.split()\n",
      "    long_enuf_words = [w.strip() for w in words if len(w)>1]\n",
      "    return \" \".join(long_enuf_words)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 122
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Characters that will be removed from the tweets:\\n\", set_punct "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Characters that will be removed from the tweets:\n",
        "set(['!', '#', '\"', '%', '$', \"'\", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '[', ']', '\\\\', '_', '^', '`', '{', '}', '|', '~'])\n"
       ]
      }
     ],
     "prompt_number": 123
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#List of stop words\n",
      "stop = \"about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount and another any anyhow anyone anything anyway anywhere are around as at back became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but call can cannot cant computer con could couldnt cry describe detail do done down due during each eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fify fill find fire first five for former formerly forty found four from front full further get give had has hasnt have hence her here hereafter hereby herein hereupon hers him his how however hundred indeed interest into its keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my name namely neither never nevertheless next nine nobody none noone nor not nothing now nowhere off often once one only onto other others otherwise our ours ourselves out over own part per perhaps please put rather same see seem seemed seeming seems serious several she should show side since sincere six sixty some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus together too top toward towards twelve twenty two under until upon very via was well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 124
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def Refine(raw_tweet):\n",
      "    simple = {}\n",
      "    simple['text'] = raw_tweet['text']\n",
      "    simple['cleanText'] = sanitize(raw_tweet['text'], set_punct)\n",
      "    words = simple['cleanText']\n",
      "    set_words = set(words.split())\n",
      "    cleanWords = list(set_words)\n",
      "    for term in cleanWords:\n",
      "        if term in stop:\n",
      "            cleanWords.remove(term)\n",
      "    simple['cleanWords'] = set(cleanWords)\n",
      "    simple['id'] = raw_tweet['id']\n",
      "    simple['user_screen_name'] = raw_tweet['user']['screen_name']\n",
      "    #LATER\n",
      "    simple['created_at'] = raw_tweet['created_at']\n",
      "    simple['timestamp'] = datetime.now()\n",
      "    simple['is_active'] = True\n",
      "    simple['is_tested'] = False\n",
      "    # try:\n",
      "    #     temp = (requests.get(str(raw_tweet['entities']['urls'][0]['expanded_url'])).url)\n",
      "    #     simple['cleanUrl'] = temp\n",
      "    # except:\n",
      "    #     #print raw_tweet['entities']['urls']\n",
      "    #     simple['cleanUrl'] = None\n",
      "    try:\n",
      "        simple['urls'] = raw_tweet['entities']['urls']\n",
      "    \tsimple['cleanUrl'] = raw_tweet['entities']['urls'][0]['expanded_url']\n",
      "    except:\n",
      "    \tsimple['cleanUrl'] = None\n",
      "    return simple"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 125
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "refinedTweet = []\n",
      "t1 = time()\n",
      "for tweet in completeTimeline:\n",
      "    refinedTweet.append(Refine(tweet))\n",
      "t2 = time()\n",
      "# for tweet in completeTimeline:\n",
      "#     y = tweet['entities']['urls']\n",
      "#     if y != []:\n",
      "#         refinedTweet.append(Refine(tweet))\n",
      "\n",
      "#print json.dumps(refinedTweet, sort_keys=True, indent=2)\n",
      "#data = json.dumps(refinedTweet)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 126
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Caching of Tweets</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "All the tweets after sanitization, are cached in MongoDB, so that they can be used at a later time"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#refining for inserting in MongoDB -- converting the set to list!\n",
      "import copy\n",
      "mongoRefined = copy.deepcopy(refinedTweet)\n",
      "for item in mongoRefined:\n",
      "    item['cleanWords'] = list(item['cleanWords'])\n",
      "\n",
      "#Refined Tweets are Cached in MongoDB\n",
      "for item in mongoRefined:\n",
      "    db.refined.insert(item)\n",
      "\n",
      "#In order to avoid duplicates\n",
      "db.refined.ensure_index([(\"id\" , pymongo.ASCENDING), (\"unique\" , True), (\"dropDups\" , True)])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 127,
       "text": [
        "u'id_1_unique_True_dropDups_True'"
       ]
      }
     ],
     "prompt_number": 127
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Now we shall fetch ALL the tweets gathered so far\n",
      "allTweets = db.refined.find()\n",
      "data = []\n",
      "for item in allTweets:\n",
      "    data.append(item)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 128
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 129,
       "text": [
        "197"
       ]
      }
     ],
     "prompt_number": 129
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 130,
       "text": [
        "{u'_id': ObjectId('54f5176844356233cba2322c'),\n",
        " u'cleanText': u'rt @jonoyeong @sarajchipps first experience with hardware is fantastic am god listening to the new codenewbie podcast',\n",
        " u'cleanUrl': u'http://marchisformakers.com/?utm_content=buffera6071&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer',\n",
        " u'cleanWords': [u'fantastic',\n",
        "  u'listening',\n",
        "  u'god',\n",
        "  u'am',\n",
        "  u'experience',\n",
        "  u'codenewbie',\n",
        "  u'hardware',\n",
        "  u'@sarajchipps',\n",
        "  u'@jonoyeong',\n",
        "  u'new',\n",
        "  u'podcast',\n",
        "  u'first'],\n",
        " u'created_at': u'Mon Mar 02 23:05:28 +0000 2015',\n",
        " u'id': 572533209621114880L,\n",
        " u'is_active': True,\n",
        " u'is_tested': False,\n",
        " u'text': u'RT @JonoYeong: @SaraJChipps first experience with hardware is fantastic \"I am a god\". Listening to the new #codenewbie podcast! http://t.co\\u2026',\n",
        " u'timestamp': datetime.datetime(2015, 3, 3, 7, 37, 30, 186000),\n",
        " u'urls': [{u'display_url': u'marchisformakers.com/?utm_content=b\\u2026',\n",
        "   u'expanded_url': u'http://marchisformakers.com/?utm_content=buffera6071&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer',\n",
        "   u'indices': [139, 140],\n",
        "   u'url': u'http://t.co/PjMVhln9ZJ'}],\n",
        " u'user_screen_name': u'shanselman'}"
       ]
      }
     ],
     "prompt_number": 130
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "refData = copy.deepcopy(data)\n",
      "for item in refData:\n",
      "    item['cleanWords'] = set(item['cleanWords'])\n",
      "\n",
      "print refData[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{u'cleanWords': set([u'fantastic', u'@jonoyeong', u'god', u'am', u'experience', u'codenewbie', u'hardware', u'@sarajchipps', u'listening', u'new', u'podcast', u'first']), u'user_screen_name': u'shanselman', u'cleanUrl': u'http://marchisformakers.com/?utm_content=buffera6071&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer', u'cleanText': u'rt @jonoyeong @sarajchipps first experience with hardware is fantastic am god listening to the new codenewbie podcast', u'text': u'RT @JonoYeong: @SaraJChipps first experience with hardware is fantastic \"I am a god\". Listening to the new #codenewbie podcast! http://t.co\\u2026', u'created_at': u'Mon Mar 02 23:05:28 +0000 2015', u'is_active': True, u'is_tested': False, u'urls': [{u'url': u'http://t.co/PjMVhln9ZJ', u'indices': [139, 140], u'expanded_url': u'http://marchisformakers.com/?utm_content=buffera6071&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer', u'display_url': u'marchisformakers.com/?utm_content=b\\u2026'}], u'timestamp': datetime.datetime(2015, 3, 3, 7, 37, 30, 186000), u'_id': ObjectId('54f5176844356233cba2322c'), u'id': 572533209621114880L}\n"
       ]
      }
     ],
     "prompt_number": 131
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#collections of only the 'clean' words of the tweets\n",
      "documents_straight = []\n",
      "for item in refData:\n",
      "    documents_straight.append(item['cleanWords'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 132
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Jaccard Similarity</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To find the nearly duplicate tweets"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def jaccard_set(s1, s2):\n",
      "    u = s1.union(s2)\n",
      "    i = s1.intersection(s2)\n",
      "    if len(u) != 0:\n",
      "        return float(len(i))/float(len(u))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 135
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "combinations = list( itertools.combinations([x for x in range(len(documents_straight))], 2) )\n",
      "# print(\"combinations=%s\") %(combinations)\n",
      "# compare each pair in combinations tuple of the sets of their words\n",
      "t3 = time()\n",
      "dupList1 = []\n",
      "dupList2 = []\n",
      "#dupJson = []\n",
      "for c in combinations:\n",
      "    i1 = c[0]\n",
      "    i2 = c[1]\n",
      "    jac = jaccard_set(documents_straight[i1], documents_straight[i2])\n",
      "    if jac == 1:\n",
      "        #print(\"%s : %s,%s : jaccard=%s\") %(c, shingles[i1],shingles[i2],jac)\n",
      "        dupList2.append(c)\n",
      "        #later\n",
      "        #dupJson.append({'c':c,'jac':jac,\n",
      "        #print(\"%s : jaccard=%s\") %(c,jac)\n",
      "    elif jac < 1 and jac >= 0.5:\n",
      "        dupList1.append(c)\n",
      "t4 = time()\n",
      "print \"time taken:\", t4-t3\n",
      "print \"number of exact duplicate pairs:\", len(dupList2)\n",
      "print \"number of near duplicate pairs:\", len(dupList1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "time taken: 0.0809350013733\n",
        "number of exact duplicate pairs: 1\n",
        "number of near duplicate pairs: 3\n"
       ]
      }
     ],
     "prompt_number": 136
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dupList2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 142,
       "text": [
        "[(88, 123)]"
       ]
      }
     ],
     "prompt_number": 142
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import networkx as nx\n",
      "\n",
      "g1 = nx.Graph(dupList1)\n",
      "o1= nx.connected_components(g1)\n",
      "duplicates1 = []\n",
      "for item in o1:\n",
      "    duplicates1.append(item)\n",
      "\n",
      "g2 = nx.Graph(dupList2)\n",
      "o2= nx.connected_components(g2)\n",
      "duplicates2 = []\n",
      "for item in o2:\n",
      "    duplicates2.append(item)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 137
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(duplicates2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 138,
       "text": [
        "1"
       ]
      }
     ],
     "prompt_number": 138
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 139,
       "text": [
        "[[2, 7], [195, 155], [168, 190]]"
       ]
      }
     ],
     "prompt_number": 139
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#view all near-duplicates!\n",
      "testing = duplicates1[:2]\n",
      "for item in testing:\n",
      "    for i in range(len(item)):\n",
      "        print item[i], '\\n--------'\n",
      "        print \"ID: \", refData[item[i]]['id'], \"\\nOriginal Tweet: \", refData[item[i]]['text'] ,'\\n\\nURL:', refData[item[i]]['cleanUrl'] ,'\\n\\nPosted By:',\\\n",
      "        refData[item[i]]['user_screen_name'] ,'\\n', refData[item[i]]['cleanWords']\n",
      "        hr('-')\n",
      "    hr()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2 \n",
        "--------\n",
        "ID:  572578700962623488 \n",
        "Original Tweet:  RT @IndianGuru: Rocking #golang - Cross compilation just got a whole lot better in Go 1.5 http://t.co/bL2aGoVzpQ @davecheney \n",
        "\n",
        "URL: http://dave.cheney.net/2015/03/03/cross-compilation-just-got-a-whole-lot-better-in-go-1-5 \n",
        "\n",
        "Posted By: GopherConIndia \n",
        "set([u'golang', u'just', u'compilation', u'better', u'@indianguru', u'lot', u'go', u'got', u'rocking', u'@davecheney'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "7 \n",
        "--------\n",
        "ID:  572578011423379456 \n",
        "Original Tweet:  Cross compilation just got a whole lot better in Go 1.5 http://t.co/v8pRstw57R \n",
        "\n",
        "URL: http://dave.cheney.net/2015/03/03/cross-compilation-just-got-a-whole-lot-better-in-go-1-5 \n",
        "\n",
        "Posted By: newsycombinator \n",
        "set([u'just', u'compilation', u'better', u'lot', u'go', u'got'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "####################################################################################################################################################################\n",
        "195 \n",
        "--------\n",
        "ID:  572515270104899585 \n",
        "Original Tweet:  February in Africa: All the tech news you shouldn\u2019t miss from the past month http://t.co/rWh0HYtDws http://t.co/pF44KZfB9f \n",
        "\n",
        "URL: http://tnw.me/jjS5v3H \n",
        "\n",
        "Posted By: TheNextWeb \n",
        "set([u'february', u'africa', u'shouldn\\u2019t', u'past', u'tech', u'news', u'month', u'miss'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "155 \n",
        "--------\n",
        "ID:  572528661754195968 \n",
        "Original Tweet:  February in Latin America: All the tech news you shouldn\u2019t miss from the past month http://t.co/7X6sxQDLpN http://t.co/i5ixEAdbXe \n",
        "\n",
        "URL: http://tnw.me/ePFGryk \n",
        "\n",
        "Posted By: TheNextWeb \n",
        "set([u'february', u'latin', u'america', u'shouldn\\u2019t', u'past', u'tech', u'news', u'month', u'miss'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "####################################################################################################################################################################\n"
       ]
      }
     ],
     "prompt_number": 140
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 141,
       "text": [
        "[[88, 123]]"
       ]
      }
     ],
     "prompt_number": 141
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 149,
       "text": [
        "[[2, 7], [195, 155], [168, 190]]"
       ]
      }
     ],
     "prompt_number": 149
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#view all duplicates!\n",
      "testing = duplicates1\n",
      "for item in testing:\n",
      "    for i in range(len(item)):\n",
      "        print item[i], '\\n--------'\n",
      "        print \"Original Tweet: \", refData[item[i]]['text'] ,'\\n\\nURL:', refData[item[i]]['cleanUrl'] ,'\\n\\nPosted By:',\\\n",
      "        refData[item[i]]['user_screen_name'] ,'\\n', refData[item[i]]['cleanWords']\n",
      "        hr('-')\n",
      "    hr()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2 \n",
        "--------\n",
        "Original Tweet:  RT @IndianGuru: Rocking #golang - Cross compilation just got a whole lot better in Go 1.5 http://t.co/bL2aGoVzpQ @davecheney \n",
        "\n",
        "URL: http://dave.cheney.net/2015/03/03/cross-compilation-just-got-a-whole-lot-better-in-go-1-5 \n",
        "\n",
        "Posted By: GopherConIndia \n",
        "set([u'golang', u'just', u'compilation', u'better', u'@indianguru', u'lot', u'go', u'got', u'rocking', u'@davecheney'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "7 \n",
        "--------\n",
        "Original Tweet:  Cross compilation just got a whole lot better in Go 1.5 http://t.co/v8pRstw57R \n",
        "\n",
        "URL: http://dave.cheney.net/2015/03/03/cross-compilation-just-got-a-whole-lot-better-in-go-1-5 \n",
        "\n",
        "Posted By: newsycombinator \n",
        "set([u'just', u'compilation', u'better', u'lot', u'go', u'got'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "####################################################################################################################################################################\n",
        "195 \n",
        "--------\n",
        "Original Tweet:  February in Africa: All the tech news you shouldn\u2019t miss from the past month http://t.co/rWh0HYtDws http://t.co/pF44KZfB9f \n",
        "\n",
        "URL: http://tnw.me/jjS5v3H \n",
        "\n",
        "Posted By: TheNextWeb \n",
        "set([u'february', u'africa', u'shouldn\\u2019t', u'past', u'tech', u'news', u'month', u'miss'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "155 \n",
        "--------\n",
        "Original Tweet:  February in Latin America: All the tech news you shouldn\u2019t miss from the past month http://t.co/7X6sxQDLpN http://t.co/i5ixEAdbXe \n",
        "\n",
        "URL: http://tnw.me/ePFGryk \n",
        "\n",
        "Posted By: TheNextWeb \n",
        "set([u'february', u'latin', u'america', u'shouldn\\u2019t', u'past', u'tech', u'news', u'month', u'miss'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "####################################################################################################################################################################\n",
        "168 \n",
        "--------\n",
        "Original Tweet:  What it's like to need hardly any sleep (via @NYMag) http://t.co/KWirPlqium \n",
        "\n",
        "URL: http://f-st.co/VRwogs2 \n",
        "\n",
        "Posted By: FastCompany \n",
        "set([u'via', u'like', u'to', u'sleep', u'@nymag', u'need', u'hardly'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "190 \n",
        "--------\n",
        "Original Tweet:  What It\u2019s Like to Need Hardly Any Sleep http://t.co/clgBUaT5X5 \n",
        "\n",
        "URL: http://nymag.com/scienceofus/2015/02/what-its-like-to-need-hardly-any-sleep.html \n",
        "\n",
        "Posted By: newsycombinator \n",
        "set([u'need', u'hardly', u'it\\u2019s', u'sleep', u'like'])\n",
        "--------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "####################################################################################################################################################################\n"
       ]
      }
     ],
     "prompt_number": 150
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}