psychemedia/tw_esp.ipynb

## tw_esp.ipynb
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Twitter - Emergent Social Positioning\n",
      "\n",
      "A notebook for generate emergent social positioning networks for Twitter users, as described in [How to map your social network](http://www.bbc.co.uk/blogs/blogcollegeofjournalism/posts/How-to-map-your-social-network) and discussed in [Communities and Connections: Social Interest Mapping](http://www.open.edu/openlearn/science-maths-technology/engineering-and-technology/technology/communities-and-connections-social-interest-mapping). In brief - map out who is commonly followed by the followers of a particular Twitter user.\n",
      "\n",
      "\n",
      "*Significant portions of the code for this script are adapted from the iPython notebook files produced to support [Mining the Social Web, 2nd Edition](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition) by Matthew A. Russell. The original files can be found here: [Chapter 9: Twitter Cookbook](http://nbviewer.ipython.org/github/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%209%20-%20Twitter%20Cookbook.ipynb)*.\n",
      "\n",
      "**This notebook has been tested using the virtual machine defined to support the *Mining the Social Web, 2nd Edition]* book.**\n",
      "\n",
      "\n",
      "### Copyright and Licensing\n",
      "\n",
      "You are free to use or adapt this notebook for any purpose you'd like.\n",
      "\n",
      "Elements of this notebook are *Copyright (c) 2013, Matthew A. Russell*\n",
      "All rights reserved. License: [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt).\n",
      "\n",
      "Tony Hirst - Jan. 2014 *[ouseful.info](http://blog.ouseful.info)*"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Accessing Twitter's API for development purposes"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import twitter\n",
      "\n",
      "def oauth_login():\n",
      "    # XXX: Go to http://twitter.com/apps/new to create an app and get values\n",
      "    # for these credentials that you'll need to provide in place of these\n",
      "    # empty string values that are defined as placeholders.\n",
      "    # See https://dev.twitter.com/docs/auth/oauth for more information \n",
      "    # on Twitter's OAuth implementation.\n",
      "    \n",
      "    CONSUMER_KEY = ''\n",
      "    CONSUMER_SECRET = ''\n",
      "    OAUTH_TOKEN = ''\n",
      "    OAUTH_TOKEN_SECRET = ''\n",
      "    \n",
      "    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n",
      "                               CONSUMER_KEY, CONSUMER_SECRET)\n",
      "    \n",
      "    twitter_api = twitter.Twitter(auth=auth)\n",
      "    return twitter_api\n",
      "\n",
      "# Sample usage\n",
      "twitter_api = oauth_login()    \n",
      "\n",
      "# Nothing to see by displaying twitter_api except that it's now a\n",
      "# defined variable\n",
      "\n",
      "print twitter_api"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Saving and accessing JSON data with MongoDB"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import pymongo # pip install pymongo\n",
      "\n",
      "def insert_into_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):\n",
      "    \n",
      "    # Connects to the MongoDB server running on \n",
      "    # localhost:27017 by default\n",
      "    \n",
      "    client = pymongo.MongoClient(**mongo_conn_kw)\n",
      "    \n",
      "    # Get a reference to a particular database\n",
      "    \n",
      "    db = client[mongo_db]\n",
      "    \n",
      "    # Reference a particular collection in the database\n",
      "    \n",
      "    coll = db[mongo_db_coll]\n",
      "    \n",
      "    # Perform a bulk insert and  return the IDs\n",
      "    \n",
      "    return coll.insert(data)\n",
      "\n",
      "#If we have an _id pre-exists, insert_into_mongo raises an error\n",
      "#save_to_mongo will create a new document if the _id does not exist, or replace the old doc with the new one if it does\n",
      "def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):\n",
      "    \n",
      "    # Connects to the MongoDB server running on \n",
      "    # localhost:27017 by default\n",
      "    \n",
      "    client = pymongo.MongoClient(**mongo_conn_kw)\n",
      "    \n",
      "    # Get a reference to a particular database\n",
      "    \n",
      "    db = client[mongo_db]\n",
      "    \n",
      "    # Reference a particular collection in the database\n",
      "    \n",
      "    coll = db[mongo_db_coll]\n",
      "    \n",
      "    return coll.save(data)\n",
      "    \n",
      "    \n",
      "def load_from_mongo(mongo_db, mongo_db_coll, return_cursor=False,\n",
      "                    criteria=None, projection=None, **mongo_conn_kw):\n",
      "    \n",
      "    # Optionally, use criteria and projection to limit the data that is \n",
      "    # returned as documented in \n",
      "    # http://docs.mongodb.org/manual/reference/method/db.collection.find/\n",
      "    \n",
      "    # Consider leveraging MongoDB's aggregations framework for more \n",
      "    # sophisticated queries.\n",
      "    \n",
      "    client = pymongo.MongoClient(**mongo_conn_kw)\n",
      "    db = client[mongo_db]\n",
      "    coll = db[mongo_db_coll]\n",
      "    \n",
      "    if criteria is None:\n",
      "        criteria = {}\n",
      "    \n",
      "    if projection is None:\n",
      "        cursor = coll.find(criteria)\n",
      "    else:\n",
      "        cursor = coll.find(criteria, projection)\n",
      "\n",
      "    # Returning a cursor is recommended for large amounts of data\n",
      "    \n",
      "    if return_cursor:\n",
      "        return cursor\n",
      "    else:\n",
      "        return [ item for item in cursor ]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Some mongo utility functions, useful howtos, etc.\n",
      "\n",
      "def mongo_dbs(**mongo_conn_kw):\n",
      "    mc= pymongo.MongoClient(**mongo_conn_kw)\n",
      "    #c = Connection()\n",
      "    print mc.database_names()\n",
      "#mongo_dbs()\n",
      "\n",
      "def getCollections_in_mongo(mongo_db, **mongo_conn_kw):\n",
      "    client = pymongo.MongoClient(**mongo_conn_kw)\n",
      "    db = client[mongo_db]\n",
      "    return db.collection_names()\n",
      "\n",
      "#getCollections_in_mongo('twitter')[:10]\n",
      "## Drop a database\n",
      "#from pymongo import Connection\n",
      "#c = Connection()\n",
      "#c.drop_database('twitter')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Sample usage:\n",
      "#getCollections_in_mongo('twitter')[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**NEED A CODE REVIEW HERE ON DOWN - MAKE SURE SCREEN_NAME SAVED TO DB IS CLEAN; _SCREEN_NAME FOR CLEAN**"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Making robust Twitter requests"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import sys\n",
      "import time\n",
      "from urllib2 import URLError\n",
      "from httplib import BadStatusLine\n",
      "import json\n",
      "import twitter\n",
      "\n",
      "def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw): \n",
      "    \n",
      "    # A nested helper function that handles common HTTPErrors. Return an updated\n",
      "    # value for wait_period if the problem is a 500 level error. Block until the\n",
      "    # rate limit is reset if it's a rate limiting issue (429 error). Returns None\n",
      "    # for 401 and 404 errors, which requires special handling by the caller.\n",
      "    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):\n",
      "    \n",
      "        if wait_period > 3600: # Seconds\n",
      "            print >> sys.stderr, 'Too many retries. Quitting.'\n",
      "            raise e\n",
      "    \n",
      "        # See https://dev.twitter.com/docs/error-codes-responses for common codes\n",
      "    \n",
      "        if e.e.code == 401:\n",
      "            print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'\n",
      "            return None\n",
      "        elif e.e.code == 404:\n",
      "            print >> sys.stderr, 'Encountered 404 Error (Not Found)'\n",
      "            return None\n",
      "        elif e.e.code == 429: \n",
      "            print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'\n",
      "            if sleep_when_rate_limited:\n",
      "                print >> sys.stderr, \"Retrying in 15 minutes...ZzZ...\"\n",
      "                sys.stderr.flush()\n",
      "                time.sleep(60*15 + 5)\n",
      "                print >> sys.stderr, '...ZzZ...Awake now and trying again.'\n",
      "                return 2\n",
      "            else:\n",
      "                raise e # Caller must handle the rate limiting issue\n",
      "        elif e.e.code in (500, 502, 503, 504):\n",
      "            print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \\\n",
      "                (e.e.code, wait_period)\n",
      "            time.sleep(wait_period)\n",
      "            wait_period *= 1.5\n",
      "            return wait_period\n",
      "        else:\n",
      "            raise e\n",
      "\n",
      "    # End of nested helper function\n",
      "    \n",
      "    wait_period = 2 \n",
      "    error_count = 0 \n",
      "\n",
      "    while True:\n",
      "        try:\n",
      "            return twitter_api_func(*args, **kw)\n",
      "        except twitter.api.TwitterHTTPError, e:\n",
      "            error_count = 0 \n",
      "            wait_period = handle_twitter_http_error(e, wait_period)\n",
      "            if wait_period is None:\n",
      "                return\n",
      "        except URLError, e:\n",
      "            error_count += 1\n",
      "            time.sleep(wait_period)\n",
      "            wait_period *= 1.5\n",
      "            print >> sys.stderr, \"URLError encountered. Continuing.\"\n",
      "            if error_count > max_errors:\n",
      "                print >> sys.stderr, \"Too many consecutive errors...bailing out.\"\n",
      "                raise\n",
      "        except BadStatusLine, e:\n",
      "            error_count += 1\n",
      "            time.sleep(wait_period)\n",
      "            wait_period *= 1.5\n",
      "            print >> sys.stderr, \"BadStatusLine encountered. Continuing.\"\n",
      "            if error_count > max_errors:\n",
      "                print >> sys.stderr, \"Too many consecutive errors...bailing out.\"\n",
      "                raise\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Sample usage\n",
      "#twitter_api = oauth_login()\n",
      "\n",
      "# See https://dev.twitter.com/docs/api/1.1/get/users/lookup for \n",
      "# twitter_api.users.lookup\n",
      "\n",
      "#response = make_twitter_request(twitter_api.users.lookup, \n",
      "                                screen_name=\"SocialWebMining\")\n",
      "\n",
      "#print json.dumps(response, indent=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Resolving user profile information"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_user_profile(twitter_api, screen_names=None, user_ids=None):\n",
      "   \n",
      "    # Must have either screen_name or user_id (logical xor)\n",
      "    assert (screen_names != None) != (user_ids != None), \\\n",
      "    \"Must have screen_names or user_ids, but not both\"\n",
      "        \n",
      "    items_to_info = {}\n",
      "\n",
      "    items = screen_names or user_ids\n",
      "    print >> sys.stderr, 'Grabbing {0} user data records, up to 100 at a time...'.format(len(items))\n",
      "\n",
      "    while len(items) > 0:\n",
      "\n",
      "        # Process 100 items at a time per the API specifications for /users/lookup.\n",
      "        # See https://dev.twitter.com/docs/api/1.1/get/users/lookup for details.\n",
      "        \n",
      "        items_str = ','.join([str(item) for item in items[:100]])\n",
      "        items = items[100:]\n",
      "\n",
      "        if screen_names:\n",
      "            response = make_twitter_request(twitter_api.users.lookup, \n",
      "                                            screen_name=items_str)\n",
      "        else: # user_ids\n",
      "            response = make_twitter_request(twitter_api.users.lookup, \n",
      "                                            user_id=items_str)\n",
      "    \n",
      "        for user_info in response:\n",
      "            if screen_names:\n",
      "                items_to_info[user_info['screen_name']] = user_info\n",
      "            else: # user_ids\n",
      "                items_to_info[user_info['id']] = user_info\n",
      "\n",
      "    return items_to_info\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Sample usage\n",
      "#twitter_api = oauth_login()\n",
      "\n",
      "#print get_user_profile(twitter_api, screen_names=[\"SocialWebMining\", \"ptwobrussell\"]) get_user_profile(twitter_api, user_ids=[132373965])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Get List Members"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_list_members(twitter_api, owner_screen_name=None, slug=None):\n",
      "    assert (owner_screen_name != None) & (slug != None), \\\n",
      "    \"Must have screen_names and list name\"\n",
      "    \n",
      "    print >> sys.stderr, 'Grabbing members of list {0}/{1}'.format(owner_screen_name,slug)\n",
      "    \n",
      "    items_to_info = {}\n",
      "    \n",
      "    response = make_twitter_request(twitter_api.lists.members, \n",
      "                                            owner_screen_name=owner_screen_name,slug=slug)\n",
      "    for user_info in response['users']:\n",
      "        items_to_info[user_info['screen_name']] = user_info\n",
      "\n",
      "    return items_to_info"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Sample usage\n",
      "#twitter_api = oauth_login()\n",
      "\n",
      "#print get_list_members(twitter_api, \"sidepodcast\", \"f1-drivers\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Getting all friends or followers for a user"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from functools import partial\n",
      "from sys import maxint\n",
      "\n",
      "def get_friends_followers_ids(twitter_api, screen_name=None, user_id=None,\n",
      "                              friends_limit=maxint, followers_limit=maxint):\n",
      "    \n",
      "    # Must have either screen_name or user_id (logical xor)\n",
      "    assert (screen_name != None) != (user_id != None), \\\n",
      "    \"Must have screen_name or user_id, but not both\"\n",
      "    \n",
      "    # See https://dev.twitter.com/docs/api/1.1/get/friends/ids and\n",
      "    # https://dev.twitter.com/docs/api/1.1/get/followers/ids for details\n",
      "    # on API parameters\n",
      "    \n",
      "    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, \n",
      "                              count=5000)\n",
      "    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, \n",
      "                                count=5000)\n",
      "\n",
      "    friends_ids, followers_ids = [], []\n",
      "    \n",
      "    for twitter_api_func, limit, ids, label in [\n",
      "                    [get_friends_ids, friends_limit, friends_ids, \"friends\"], \n",
      "                    [get_followers_ids, followers_limit, followers_ids, \"followers\"]\n",
      "                ]:\n",
      "        \n",
      "        if limit == 0: continue\n",
      "        \n",
      "        cursor = -1\n",
      "        while cursor != 0:\n",
      "        \n",
      "            # Use make_twitter_request via the partially bound callable...\n",
      "            if screen_name: \n",
      "                response = twitter_api_func(screen_name=screen_name, cursor=cursor)\n",
      "            else: # user_id\n",
      "                response = twitter_api_func(user_id=user_id, cursor=cursor)\n",
      "\n",
      "            if response is not None:\n",
      "                ids += response['ids']\n",
      "                cursor = response['next_cursor']\n",
      "        \n",
      "            print >> sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(len(ids), \n",
      "                                                    label, (user_id or screen_name))\n",
      "        \n",
      "            # XXX: You may want to store data during each iteration to provide an \n",
      "            # an additional layer of protection from exceptional circumstances\n",
      "        \n",
      "            if len(ids) >= limit or response is None:\n",
      "                break\n",
      "\n",
      "    # Do something useful with the IDs, like store them to disk...\n",
      "    return friends_ids[:friends_limit], followers_ids[:followers_limit]\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Sample usage\n",
      "#twitter_api = oauth_login()\n",
      "\n",
      "#friends_ids, followers_ids = get_friends_followers_ids(twitter_api, \n",
      "                                                       screen_name=\"SocialWebMining\", \n",
      "                                                       friends_limit=10, \n",
      "                                                       followers_limit=10)\n",
      "\n",
      "#print friends_ids\n",
      "#print followers_ids"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Crawling a friendship graph"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import random\n",
      "\n",
      "#Rather than crawl all followers, crawl a sample...\n",
      "def crawl_followers_sample(twitter_api, screen_name, limit=5000, depth=2, sample=50):\n",
      "    \n",
      "    print >> sys.stderr, 'Crawling depth {0} with sample size {1}'.format(depth, sample)\n",
      "    \n",
      "    # Resolve the ID for screen_name and start working with IDs for consistency \n",
      "    # in storage\n",
      "\n",
      "    ##THIS SECTION CAN BE REPLACED BY ESTABLISH USER?\n",
      "    #THOUGH TO PASS ON APPROPRIATE VALS TO next_queue\n",
      "    seed_id_ = twitter_api.users.show(screen_name=screen_name)\n",
      "    #tmp friend_ids next_queue follower_ids\n",
      "    tmp, next_queue = get_friends_followers_ids(twitter_api, user_id=seed_id_['id_str'], \n",
      "                                              friends_limit=limit, followers_limit=limit)\n",
      "    \n",
      "    # Store a seed_id => _follower_ids mapping in MongoDB\n",
      "    # Use a Twitter user id as the mongo document _id (native indexing, prevent multiple records for one individual)\n",
      "    \n",
      "    save_to_mongo({'_id': seed_id_['id'], 'screen_name':screen_name, 'id_str':seed_id_['id_str'],\n",
      "                   'follower_ids' : [ _id for _id in next_queue ]},\n",
      "                  'twitter', 'followers')\n",
      "    save_to_mongo({'_id': seed_id_['id'] , 'screen_name':screen_name, 'id_str':seed_id_['id_str'],\n",
      "                   'friend_ids' : [ _id for _id in tmp ]},\n",
      "                  'twitter', 'friends')\n",
      "    \n",
      "    udata=get_user_profile(twitter_api, user_ids=[ seed_id_['id_str'] ])\n",
      "    for u in udata:\n",
      "        save_to_mongo({'_id':udata[u]['id'],'screen_name':udata[u]['screen_name'],'id_str':udata[u]['id_str'],\n",
      "                       'name':udata[u]['name'],'description':udata[u]['description'],\n",
      "                       'location':udata[u]['location'],'followers_count':udata[u]['followers_count'],\n",
      "                       'followers_count':udata[u]['friends_count'],'created_at':udata[u]['created_at']},'twitter', 'userdata')\n",
      "\n",
      "    #We're going to try to mimimise the amount of calls we make to the Twitter API\n",
      "    #HEURISTIC: if we already have follower data for a user, don't get friend/follower data again\n",
      "    sspool=set()\n",
      "    mgd=load_from_mongo('twitter','userdata', projection={'_id':1})\n",
      "    namesdone=set([ i['_id'] for i in mgd ])\n",
      "    d = 1\n",
      "    while d < depth:\n",
      "        d += 1\n",
      "        (queue, next_queue) = (next_queue, [])\n",
      "        \n",
      "        #TH: only interested in grabbing data we haven't grabbed before\n",
      "        diff = set(queue) - set( [ i['_id'] for i in load_from_mongo('twitter','followers', projection={'_id':1})] )\n",
      "        \n",
      "        #TH: propagate the sampling measure\n",
      "        queue = random.sample(list(diff), sample) if len(diff) > sample else list(diff)\n",
      "        \n",
      "        for fid in queue:\n",
      "            \n",
      "            friend_ids, follower_ids = get_friends_followers_ids(twitter_api, user_id=fid, \n",
      "                                                     friends_limit=limit, \n",
      "                                                     followers_limit=limit)\n",
      " \n",
      "            #Get some user info while we're here...\n",
      "            sspoolt= set(follower_ids).union(set(friend_ids)) - namesdone\n",
      "            sspoolt = sspoolt.union(sspool) if len(sspoolt)<100 else sspoolt\n",
      "            ssize = 99 if len(sspoolt) > 99 else len(sspoolt)\n",
      "            uids=[fid]+random.sample(list(sspoolt), ssize)\n",
      "            namesdone=namesdone.union(set(uids))\n",
      "            sspool=sspoolt.union(sspool)-namesdone\n",
      "            \n",
      "            udata=get_user_profile(twitter_api, user_ids=uids)\n",
      "            for u in udata:\n",
      "                save_to_mongo( {'_id':udata[u]['id'],'screen_name':udata[u]['screen_name'], 'id_str':udata[u]['id_str'], \n",
      "                                'name':udata[u]['name'],'description':udata[u]['description'],\n",
      "                                'location':udata[u]['location'],'followers_count':udata[u]['followers_count'],\n",
      "                                'followers_count':udata[u]['friends_count'],'created_at':udata[u]['created_at']},\n",
      "                              'twitter', 'userdata')\n",
      "\n",
      "            \n",
      "            tmp=load_from_mongo('twitter','userdata',criteria={'_id':fid},projection={'screen_name':1,'_id':1})\n",
      "            s_name=tmp[0]['screen_name']\n",
      "            \n",
      "            # Store a fid => follower_ids mapping in MongoDB\n",
      "            save_to_mongo({'_id': fid, 'id_str': str(fid) , 'screen_name':s_name, 'follower_ids' : [ _id for _id in follower_ids ]},\n",
      "                  'twitter', 'followers')\n",
      "            save_to_mongo({'_id': fid, 'id_str': str(fid) , 'screen_name':s_name, 'friend_ids' : [ _id for _id in friend_ids ]},\n",
      "                  'twitter', 'friends')\n",
      "                   \n",
      "            next_queue += follower_ids\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Sample usage\n",
      "#twitter_api = oauth_login()\n",
      "\n",
      "#screen_name = \"bbcinternetblog\"\n",
      "#crawl_followers_sample(twitter_api, screen_name, depth=2, limit=5000, sample=10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Generating a Map of Common Friends of Followers of an Individual"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import networkx as nx\n",
      "\n",
      "def get_common_friends_of_followers_grapher(twitter_api, screen_name, foid_list, toget, minsupport=5):\n",
      "    #We're going to use networkx to construct the graph\n",
      "    DG=nx.DiGraph()\n",
      "    \n",
      "    print >> sys.stderr, 'Getting friends of followers of {0}'.format(screen_name)\n",
      "    \n",
      "    #The toget folk should already have friends/followers in the db    \n",
      "    for fo in toget:\n",
      "        tmp=load_from_mongo('twitter','friends',criteria={'_id':fo},\n",
      "                        projection={'screen_name':1, 'friend_ids':1,'_id':1})\n",
      "        members2=tmp[0]['friend_ids']\n",
      "        if len(members2)>0:\n",
      "            for foid in foid_list:\n",
      "                DG.add_edge(fo,foid)\n",
      "            fedges=[(fo,u) for u in members2]\n",
      "            DG.add_edges_from(fedges)\n",
      "\n",
      "    print >> sys.stderr, 'Filtering network...'\n",
      "    #Now we can filter the network\n",
      "    filterNodes=[]\n",
      "    for n in DG:\n",
      "        if DG.degree(n)>=minsupport:\n",
      "            filterNodes.append(n)\n",
      "    H=DG.subgraph(set(filterNodes))\n",
      "\n",
      "    #Label the filtered graph, getting in additional labels if we need them\n",
      "    mgd=load_from_mongo('twitter','userdata', projection={'_id':1})\n",
      "    got= [ i['_id'] for i in mgd ]\n",
      "    tofetch=[ _id for _id in H.nodes() if _id not in got]\n",
      "\n",
      "    for n in set(H.nodes()).intersection(got):\n",
      "        mgd=load_from_mongo('twitter','userdata', criteria={'_id':n}, projection={'screen_name':1,'id_str':1,'_id':1})\n",
      "        H.node[n]['label']=mgd[0]['screen_name']\n",
      "\n",
      "    udata=get_user_profile(twitter_api, user_ids=tofetch)\n",
      "    for u in udata:\n",
      "        save_to_mongo( {'_id':udata[u]['id'],'screen_name':udata[u]['screen_name'], 'id_str':udata[u]['id_str'], \n",
      "                                'name':udata[u]['name'],'description':udata[u]['description'],\n",
      "                                'location':udata[u]['location'],'followers_count':udata[u]['followers_count'],\n",
      "                                'followers_count':udata[u]['friends_count'],'created_at':udata[u]['created_at']},\n",
      "                              'twitter', 'userdata')\n",
      "        H.node[udata[u]['id']]['label']=udata[u]['screen_name']\n",
      "\n",
      "    print >> sys.stderr, 'Writing network to {0}_{1}.gexf'.format(screen_name,minsupport)\n",
      "    #Write the resulting network to a gexf file\n",
      "    nx.write_gexf(H, '{0}_{1}.gexf'.format(screen_name,minsupport) )\n",
      "    #print tofetch\n",
      "    print >> sys.stderr, 'Done...'\n",
      "    \n",
      "def get_common_friends_of_followers(twitter_api, screen_name, minsupport=5):\n",
      "    \n",
      "    print >> sys.stderr, 'Getting followers of {0}'.format(screen_name)\n",
      "    \n",
      "    ff=load_from_mongo('twitter','followers',criteria={'screen_name':screen_name},\n",
      "                   projection={'screen_name':1,'follower_ids':1,'_id':1})\n",
      "\n",
      "\n",
      "    #Get the follower ids of the target individual\n",
      "    members=ff[0]['follower_ids']\n",
      "    \n",
      "    #For now, find which followers we have friend data for and use that\n",
      "    tmp=load_from_mongo('twitter', 'friends', projection={'_id':1})\n",
      "    fr =[ i['_id'] for i in tmp ]\n",
      "    toget = [ i for i in members if i in fr ]\n",
      "    #What we really need to do is:\n",
      "    ## - set a sample size of followers\n",
      "    ## - get the set of ids we have friend data for and see if size of intersect with user's followers is greater than sample\n",
      "    ## - if it is, we can get the sample out of the database. If it isn't, we need to crawl some more.\n",
      "\n",
      "    get_common_friends_of_followers_grapher(twitter_api, screen_name, [ff[0]['_id']], toget, minsupport=5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Sample usage:\n",
      "#screen_name = \"schoolofdata\"\n",
      "#get_common_friends_of_followers(twitter_api, screen_name)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def quickExpt(screen_name, sample=119, minsupport=5):\n",
      "    save_to_mongo( {'_id':screen_name, 'screen_name':screen_name},'twitter', 'quickexpt_source')\n",
      "    crawl_followers_sample(twitter_api, screen_name, depth=2, limit=5000, sample=sample)\n",
      "    get_common_friends_of_followers(twitter_api, screen_name, minsupport=minsupport)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example Reports"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can use the quickExpt() function to generate the a graph file that can be visualised using a network visualisation tool such as Gephi.\n",
      "\n",
      "Double click the following cell to edit it and ent the name of the Twitter user that you would like to obtain the ESP map data for, then run the cell.\n",
      "\n",
      "Note that with a sample size of 119, the data collection exercise will tak just over two hours."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "##THIS IS WHERE YOU NEED TO ADD THE USERNAME OF THE ACCOUNT YOU WANT TO GRAB THE DATA FOR\n",
      "twitter_username='schoolofdata'\n",
      "\n",
      "\n",
      "quickExpt(twitter_username,sample=119, minsupport=5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}