cwarny/socialgraph.ipynb

## socialgraph.ipynb
{
 "metadata": {
  "name": "socialgraph"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Building social graph from list of names"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 1. Import necessary libraries"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv,json\n",
      "import facebook # Facebook SDK for Python\n",
      "import nltk\n",
      "import networkx as nx\n",
      "from networkx.readwrite import json_graph\n",
      "from itertools import combinations"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 2. Setup Facebook API credentials\n",
      "\n",
      "The access token is what enables us to communicate with the Facebook API. With it, we instantiate a GraphAPI object from the Facebook SDK for Python. This GraphAPI object will be our point of entry into the Facebook API."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ACCESS_TOKEN = \"<ENTER YOUR FACEBOOK ACCESS TOKEN HERE>\"\n",
      "graph = facebook.GraphAPI(ACCESS_TOKEN)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 3. Upload the dataset of names + cohort"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f = open('IAA_Alumni.csv', 'rb')\n",
      "reader = csv.reader(f)\n",
      "next(reader, None) # Skip CSV headers\n",
      "people = [{\"name\":' '.join(row[0:2]),\"class\":row[3]} for row in reader]\n",
      "f.close()\n",
      "\n",
      "print people[:5] # Show the five first people"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[{'name': 'Arundhati Balachandran', 'class': '2008'}, {'name': 'Angela Chen', 'class': '2008'}, {'name': 'Dragos Coles', 'class': '2008'}, {'name': 'Theresa Coles', 'class': '2008'}, {'name': 'David Dobson', 'class': '2008'}]\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 4. Retrieve Facebook user IDs\n",
      "\n",
      "For each person in the dataset, do a graph search based on the name. This will return all the Facebook users with matching names. Since that could represent many people, we limit our search to 1 page and 50 results per page (increasing that limit will give better results but will take more computational time). The offset variable is used to paginate through the results that Facebook sends back. This step is really the bottleneck of this approach, because for each name in the dataset, we have to make a Facebook search query to retrieve the associated id."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "limit = 50\n",
      "pages = 1\n",
      "response = []\n",
      "for person in people:\n",
      "    offset = 0\n",
      "    for i in range(pages):\n",
      "        response.append(graph.request(\"search\",{\"q\":person[\"name\"],\"type\":\"user\",\"limit\":limit,\"fields\":\"id,name\",\"offset\":offset})['data'])\n",
      "        offset += limit\n",
      "\n",
      "print response[:2][:5] # Show the five first results for the two first people in the dataset"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[], [{u'id': u'100000159807010', u'name': u'Angela Chen'}, {u'id': u'100003664783528', u'name': u'Angela Chen'}, {u'id': u'100005885976000', u'name': u'\\u9673\\u96e8\\u5982'}, {u'id': u'100003497726720', u'name': u'Angela Chen'}, {u'id': u'636141225', u'name': u'Angela Chen'}, {u'id': u'100006133423679', u'name': u'Chen Angela'}, {u'id': u'100005922433214', u'name': u'Angela Chen'}, {u'id': u'100002747370693', u'name': u'Angela  Chen'}, {u'id': u'1340882650', u'name': u'Angela Chen'}, {u'id': u'1032492720', u'name': u'Angela Chen'}, {u'id': u'100000269410168', u'name': u'Angela Chen'}, {u'id': u'100001346035260', u'name': u'\\u9673\\u6021\\u4f36'}, {u'id': u'100003733032843', u'name': u'Angela  Chen'}, {u'id': u'100000843515068', u'name': u'Angela Chen'}, {u'id': u'100002408155546', u'name': u'Angela Chen'}, {u'id': u'100002183553430', u'name': u'Angela Chen'}, {u'id': u'590404993', u'name': u'Angela Chen'}, {u'id': u'100004086163212', u'name': u'Angela  Chen'}, {u'id': u'100003279771095', u'name': u'Angela Chen'}, {u'id': u'100002889247417', u'name': u'Angela Chen'}, {u'id': u'100003046017066', u'name': u'Angela  Chen'}, {u'id': u'100001322906050', u'name': u'Angela Chen'}, {u'id': u'100003359818544', u'name': u'Angela Chen'}, {u'id': u'504551082', u'name': u'Angela Chen'}, {u'id': u'100005689470110', u'name': u'Angela Chen'}, {u'id': u'100004365852390', u'name': u'Angela Chen'}, {u'id': u'1851195159', u'name': u'Angela Chen'}, {u'id': u'100004015998903', u'name': u'Angela Chen'}, {u'id': u'1441834003', u'name': u'Angela Chen'}, {u'id': u'1687073840', u'name': u'Angela Chen'}, {u'id': u'100001631220315', u'name': u'Angela Chen'}, {u'id': u'100005456223301', u'name': u'Angela  Chen'}, {u'id': u'100004561024994', u'name': u'Angela  Chen'}, {u'id': u'100003663761721', u'name': u'Angela Chen'}, {u'id': u'100003795566062', u'name': u'Angela Chen'}, {u'id': u'100004695331730', u'name': u'Angela Chen'}, {u'id': u'100005846561585', u'name': u'Angela Chen'}, {u'id': u'509407570', u'name': u'Angela Chen'}, {u'id': u'100004417451906', u'name': u'Angela  Chen'}, {u'id': u'100005039173908', u'name': u'Angela Chen'}, {u'id': u'730057332', u'name': u'Angela Chen'}, {u'id': u'100004323250554', u'name': u'Angela  Chen'}, {u'id': u'100005784097484', u'name': u'Angela Chen'}, {u'id': u'1504505739', u'name': u'Angela Chen'}, {u'id': u'100000901335208', u'name': u'Angela Chen'}]]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 5. Filter results\n",
      "\n",
      "Filter out results returned by the Facebook API if the names are too dissimilar to the original query. We use the \"edit distance\" to determine how dissimilar the name of the user returned and the name of the original query are."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "users = []\n",
      "for person,results in zip(people,response):\n",
      "    for r in results:\n",
      "        if nltk.metrics.edit_distance(r[\"name\"].lower(),person[\"name\"].lower()) < 3:\n",
      "            r.update(person)\n",
      "            users.append(r)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 6. Test for connections\n",
      "\n",
      "The only way the Facebook API allows us to check who's connected to whom among a set of people that you don't necessarily know (i.e., you are not friends with them) is by carrying out *pairwise checks*. It is important to realize that this method allows one to check whether *any* two people on Facebook are connected. Using the Facebook Query Language (FQL), we can make a request to return all the connections between two sets of Facebook users. See [this API documentation](https://developers.facebook.com/docs/reference/fql/friend/) to learn more about the FQL friend table. \n",
      "\n",
      "However, limits on the length of a FQL query limits the size of the sets for which can check for connections. So a preliminary step is to chunk the set of all users into smaller sets. This then requires to check for the connections of all pairwise combinations of subsets of users, as well as the connections within each subset of users."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def chunk(l, n):\n",
      "    return [l[i:i+n] for i in xrange(0, len(l), n)]\n",
      "\n",
      "def getFriendships(chunk1,chunk2):\n",
      "    uids1 = ','.join(map(lambda x : x['id'],chunk1))\n",
      "    uids2 = ','.join(map(lambda x : x['id'],chunk2))\n",
      "    return graph.fql(\"select uid1,uid2 from friend where uid1 in (%s) and uid2 in (%s)\" % (uids1,uids2))\n",
      "\n",
      "friendships = []\n",
      "chunks = chunk(users,2000) # Chunk the sets of users into chunks of 2000 users\n",
      "# Check connections for all pairwise combination of subset of users\n",
      "for chunk1,chunk2 in combinations(chunks,2):\n",
      "    friendships.extend(getFriendships(chunk1,chunk2))\n",
      "# Check connections within each subset of users\n",
      "for chunk in chunks:\n",
      "    friendships.extend(getFriendships(chunk,chunk))\n",
      "\n",
      "print friendships[:5] # Show first five friendships"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[{u'uid2': u'100000672315060', u'uid1': u'11834639'}, {u'uid2': u'712428853', u'uid1': u'11834639'}, {u'uid2': u'11809255', u'uid1': u'11834639'}, {u'uid2': u'100001780090515', u'uid1': u'100000676453917'}, {u'uid2': u'100000676453917', u'uid1': u'100001780090515'}]\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 7. Build graph using NetworkX library"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d = {user.pop('id'):user for user in users} # Build id-user dict\n",
      "edges = map(lambda friendship: tuple(friendship.values()),filter(lambda friendship: friendship.values()[0] != friendship.values()[1], friendships)) # Don't link profiles with same name\n",
      "g = nx.Graph()\n",
      "g.add_edges_from(edges)\n",
      "for uid in g.nodes():\n",
      "    g.node[uid] = d[uid]\n",
      "    \n",
      "nld = json_graph.node_link_data(g)\n",
      "json.dump(nld, open('graph.json','w'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "See [here](http://bl.ocks.org/cwarny/6344975) the resulting graph visualized with d3.JS. The color of the nodes reflect the cohort. As expected, people from the same cohort are more likely to be connected."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "socialgraph"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Building social graph from list of names"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 1. Import necessary libraries"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import csv,json\n",
	"import facebook # Facebook SDK for Python\n",
	"import nltk\n",
	"import networkx as nx\n",
	"from networkx.readwrite import json_graph\n",
	"from itertools import combinations"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 2. Setup Facebook API credentials\n",
	"\n",
	"The access token is what enables us to communicate with the Facebook API. With it, we instantiate a GraphAPI object from the Facebook SDK for Python. This GraphAPI object will be our point of entry into the Facebook API."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"ACCESS_TOKEN = \"<ENTER YOUR FACEBOOK ACCESS TOKEN HERE>\"\n",
	"graph = facebook.GraphAPI(ACCESS_TOKEN)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 3. Upload the dataset of names + cohort"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"f = open('IAA_Alumni.csv', 'rb')\n",
	"reader = csv.reader(f)\n",
	"next(reader, None) # Skip CSV headers\n",
	"people = [{\"name\":' '.join(row[0:2]),\"class\":row[3]} for row in reader]\n",
	"f.close()\n",
	"\n",
	"print people[:5] # Show the five first people"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[{'name': 'Arundhati Balachandran', 'class': '2008'}, {'name': 'Angela Chen', 'class': '2008'}, {'name': 'Dragos Coles', 'class': '2008'}, {'name': 'Theresa Coles', 'class': '2008'}, {'name': 'David Dobson', 'class': '2008'}]\n"
	]
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 4. Retrieve Facebook user IDs\n",
	"\n",
	"For each person in the dataset, do a graph search based on the name. This will return all the Facebook users with matching names. Since that could represent many people, we limit our search to 1 page and 50 results per page (increasing that limit will give better results but will take more computational time). The offset variable is used to paginate through the results that Facebook sends back. This step is really the bottleneck of this approach, because for each name in the dataset, we have to make a Facebook search query to retrieve the associated id."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"limit = 50\n",
	"pages = 1\n",
	"response = []\n",
	"for person in people:\n",
	" offset = 0\n",
	" for i in range(pages):\n",
	" response.append(graph.request(\"search\",{\"q\":person[\"name\"],\"type\":\"user\",\"limit\":limit,\"fields\":\"id,name\",\"offset\":offset})['data'])\n",
	" offset += limit\n",
	"\n",
	"print response[:2][:5] # Show the five first results for the two first people in the dataset"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[[], [{u'id': u'100000159807010', u'name': u'Angela Chen'}, {u'id': u'100003664783528', u'name': u'Angela Chen'}, {u'id': u'100005885976000', u'name': u'\\u9673\\u96e8\\u5982'}, {u'id': u'100003497726720', u'name': u'Angela Chen'}, {u'id': u'636141225', u'name': u'Angela Chen'}, {u'id': u'100006133423679', u'name': u'Chen Angela'}, {u'id': u'100005922433214', u'name': u'Angela Chen'}, {u'id': u'100002747370693', u'name': u'Angela Chen'}, {u'id': u'1340882650', u'name': u'Angela Chen'}, {u'id': u'1032492720', u'name': u'Angela Chen'}, {u'id': u'100000269410168', u'name': u'Angela Chen'}, {u'id': u'100001346035260', u'name': u'\\u9673\\u6021\\u4f36'}, {u'id': u'100003733032843', u'name': u'Angela Chen'}, {u'id': u'100000843515068', u'name': u'Angela Chen'}, {u'id': u'100002408155546', u'name': u'Angela Chen'}, {u'id': u'100002183553430', u'name': u'Angela Chen'}, {u'id': u'590404993', u'name': u'Angela Chen'}, {u'id': u'100004086163212', u'name': u'Angela Chen'}, {u'id': u'100003279771095', u'name': u'Angela Chen'}, {u'id': u'100002889247417', u'name': u'Angela Chen'}, {u'id': u'100003046017066', u'name': u'Angela Chen'}, {u'id': u'100001322906050', u'name': u'Angela Chen'}, {u'id': u'100003359818544', u'name': u'Angela Chen'}, {u'id': u'504551082', u'name': u'Angela Chen'}, {u'id': u'100005689470110', u'name': u'Angela Chen'}, {u'id': u'100004365852390', u'name': u'Angela Chen'}, {u'id': u'1851195159', u'name': u'Angela Chen'}, {u'id': u'100004015998903', u'name': u'Angela Chen'}, {u'id': u'1441834003', u'name': u'Angela Chen'}, {u'id': u'1687073840', u'name': u'Angela Chen'}, {u'id': u'100001631220315', u'name': u'Angela Chen'}, {u'id': u'100005456223301', u'name': u'Angela Chen'}, {u'id': u'100004561024994', u'name': u'Angela Chen'}, {u'id': u'100003663761721', u'name': u'Angela Chen'}, {u'id': u'100003795566062', u'name': u'Angela Chen'}, {u'id': u'100004695331730', u'name': u'Angela Chen'}, {u'id': u'100005846561585', u'name': u'Angela Chen'}, {u'id': u'509407570', u'name': u'Angela Chen'}, {u'id': u'100004417451906', u'name': u'Angela Chen'}, {u'id': u'100005039173908', u'name': u'Angela Chen'}, {u'id': u'730057332', u'name': u'Angela Chen'}, {u'id': u'100004323250554', u'name': u'Angela Chen'}, {u'id': u'100005784097484', u'name': u'Angela Chen'}, {u'id': u'1504505739', u'name': u'Angela Chen'}, {u'id': u'100000901335208', u'name': u'Angela Chen'}]]\n"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 5. Filter results\n",
	"\n",
	"Filter out results returned by the Facebook API if the names are too dissimilar to the original query. We use the \"edit distance\" to determine how dissimilar the name of the user returned and the name of the original query are."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"users = []\n",
	"for person,results in zip(people,response):\n",
	" for r in results:\n",
	" if nltk.metrics.edit_distance(r[\"name\"].lower(),person[\"name\"].lower()) < 3:\n",
	" r.update(person)\n",
	" users.append(r)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 6. Test for connections\n",
	"\n",
	"The only way the Facebook API allows us to check who's connected to whom among a set of people that you don't necessarily know (i.e., you are not friends with them) is by carrying out pairwise checks. It is important to realize that this method allows one to check whether any two people on Facebook are connected. Using the Facebook Query Language (FQL), we can make a request to return all the connections between two sets of Facebook users. See [this API documentation](https://developers.facebook.com/docs/reference/fql/friend/) to learn more about the FQL friend table. \n",
	"\n",
	"However, limits on the length of a FQL query limits the size of the sets for which can check for connections. So a preliminary step is to chunk the set of all users into smaller sets. This then requires to check for the connections of all pairwise combinations of subsets of users, as well as the connections within each subset of users."
	]
	},
	{
	"cell_type": "code",
	"collapsed": true,
	"input": [
	"def chunk(l, n):\n",
	" return [l[i:i+n] for i in xrange(0, len(l), n)]\n",
	"\n",
	"def getFriendships(chunk1,chunk2):\n",
	" uids1 = ','.join(map(lambda x : x['id'],chunk1))\n",
	" uids2 = ','.join(map(lambda x : x['id'],chunk2))\n",
	" return graph.fql(\"select uid1,uid2 from friend where uid1 in (%s) and uid2 in (%s)\" % (uids1,uids2))\n",
	"\n",
	"friendships = []\n",
	"chunks = chunk(users,2000) # Chunk the sets of users into chunks of 2000 users\n",
	"# Check connections for all pairwise combination of subset of users\n",
	"for chunk1,chunk2 in combinations(chunks,2):\n",
	" friendships.extend(getFriendships(chunk1,chunk2))\n",
	"# Check connections within each subset of users\n",
	"for chunk in chunks:\n",
	" friendships.extend(getFriendships(chunk,chunk))\n",
	"\n",
	"print friendships[:5] # Show first five friendships"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[{u'uid2': u'100000672315060', u'uid1': u'11834639'}, {u'uid2': u'712428853', u'uid1': u'11834639'}, {u'uid2': u'11809255', u'uid1': u'11834639'}, {u'uid2': u'100001780090515', u'uid1': u'100000676453917'}, {u'uid2': u'100000676453917', u'uid1': u'100001780090515'}]\n"
	]
	}
	],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 7. Build graph using NetworkX library"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"d = {user.pop('id'):user for user in users} # Build id-user dict\n",
	"edges = map(lambda friendship: tuple(friendship.values()),filter(lambda friendship: friendship.values()[0] != friendship.values()[1], friendships)) # Don't link profiles with same name\n",
	"g = nx.Graph()\n",
	"g.add_edges_from(edges)\n",
	"for uid in g.nodes():\n",
	" g.node[uid] = d[uid]\n",
	" \n",
	"nld = json_graph.node_link_data(g)\n",
	"json.dump(nld, open('graph.json','w'))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"See [here](http://bl.ocks.org/cwarny/6344975) the resulting graph visualized with d3.JS. The color of the nodes reflect the cohort. As expected, people from the same cohort are more likely to be connected."
	]
	}
	],
	"metadata": {}
	}
	]
	}