Skip to content

Instantly share code, notes, and snippets.

@cwarny
Last active December 22, 2015 08:09
Show Gist options
  • Save cwarny/6442821 to your computer and use it in GitHub Desktop.
Save cwarny/6442821 to your computer and use it in GitHub Desktop.
Analyzing virality on Twitter
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "SharingPatterns"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sharing patterns - Analyzing virality on Twitter\n",
"\n",
"by C\u00e9dric Warny\n",
"\n",
"## URL shortening\n",
"\n",
"<p>When people share a URL on Twitter, it is automatically shortened to a 20-character URL (this process is known as \"encoding\") with the domain name space \"t.co\" in order to leave enough space for the text of the tweet. Twitter launched this URL-shortening service around 2011. Before that, Twitter was relying on third-party URL shorteners, bit.ly being among the most popular.</p>\n",
"\n",
"<p>There's more to URL shorteners than meets the eye. Whenever someone encodes a shortened URL, the resulting URL is <em>unique</em>. This has the all-important consequence that, from a shortened URL, we can find the exact tweet in which that URL was shared. Moreover, providers of URL shortening services also provide analytics about the <em>clicks</em> that each shortened URL receives.</p>\n",
"\n",
"<p>Some URL shortening services such as bit.ly make this precious data publicly available through their <a href=http://dev.bitly.com/>online API</a>; others, like the in-house Twitter URL shortening service (\"t.co\"), do not make this data publicly available yet.</p>\n",
" \n",
"## Using bit.ly\n",
"\n",
"<p>In bitly, a long URL (e.g., <strong>http://www.nytimes.com/2013/09/04/us/politics/obama-administration-presses-case-on-syria.html</strong>) can have multiple encoders, each with a unique URL and associated metrics. But all these unique URLs that are related to one specific long URL are also all wrapped up into an <em>aggregate</em> bitly link for the aggregate metrics of that long URL.</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import bitly_api\n",
"\n",
"access_token = \"<INSERT ACCESS TOKEN HERE>\"\n",
"\n",
"bitly = bitly_api.Connection(access_token=access_token)\n",
"\n",
"# Lookup the aggregate bitly link for a specific long URL\n",
"aggregate_link = bitly.link_lookup(\"http://www.nytimes.com/2013/09/04/us/politics/obama-administration-presses-case-on-syria.html\")[0][\"aggregate_link\"]\n",
"\n",
"# Find all the bitly encoders of that aggregate bitly link\n",
"encoders = bitly.link_encoders(aggregate_link)[\"entries\"]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Once we have the unique bitly links of each encoder, the click metrics are just one API call away. The bitly API can return the data in different time units (clicks per minute, hour, day, month or year). We can also specify how many units of time before a certain timestamp we want to look in the past. By default, the reference timestamp is now. However, bitly does not store per-minute click data beyond an hour. After that, it is stored per hour.</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import time\n",
"\n",
"for encoder in encoders:\n",
" encoder[\"clicks_generated\"] = []\n",
"while len(encoders[0][\"clicks_generated\"]) < 60 * 5: # Gather per-minute data for five hours\n",
" for encoder in encoders:\n",
" encoder[\"clicks_generated\"].extend(bitly.link_clicks(link=encoder[\"link\"],unit=\"minute\",units=60,rollup=False)) # We request clicks per minute for the last 60 minutes\n",
" time.sleep(60*60) # Sleep for one hour"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example output is as follows:\n",
"<pre><code>\n",
"[{'clicks_generated': [{u'clicks': 0, u'dt': 1378234560},\n",
" {u'clicks': 3, u'dt': 1378234500},\n",
" {u'clicks': 2, u'dt': 1378234440},\n",
" {u'clicks': 4, u'dt': 1378234380},\n",
" {u'clicks': 1, u'dt': 1378234320},\n",
" {u'clicks': 0, u'dt': 1378234260},\n",
" {u'clicks': 0, u'dt': 1378234200}],\n",
" u'link': u'http://j.mp/17BGtIh',\n",
" u'ts': 1378218502,\n",
" u'user': u'ircnewsly'},\n",
" {'clicks_generated': [{u'clicks': 0, u'dt': 1378234560},\n",
" {u'clicks': 0, u'dt': 1378234500},\n",
" {u'clicks': 5, u'dt': 1378234440},\n",
" {u'clicks': 4, u'dt': 1378234380},\n",
" {u'clicks': 10, u'dt': 1378234320},\n",
" {u'clicks': 8, u'dt': 1378234260}],\n",
" u'link': u'http://nyti.ms/175wJXL',\n",
" u'ts': 1378220606,\n",
" u'user': u'donohoe'}]\n",
"</code></pre>\n",
"<br/>\n",
"\"dt\" in the \"clicks_generated\" array indicates the UNIX time of the minute on which the clicks were counted. \"ts\" is the reference timestamp from which we started looking in the past.\n",
"<p>After this, it's just a matter of making a search for the unique bitly links on the Twitter API to find the tweets embedding these links. This step will be described in a later section.</p>\n",
"\n",
"## t.co and Google Analytics\n",
"\n",
"<p>The problem with bit.ly is that it requires some effort from the Twitter user to actually go on the bit.ly website and use their service to shorten a given URL and then embed the URL in a tweet. But now that Twitter automatically shortens URLs, why bother using bit.ly at all? Many people continue using bit.ly mainly because it offers fantastic click data on the URLs one encodes.</p>\n",
"\n",
"<p>But what about people who don't really care about monitoring their social influence? They just use the default t.co Twitter URL shortening service, whose main disadvantage is that it does not yet have a public API to access the associated click data. But most people will share URLs that way. How then do we analyze click data?</p>\n",
"\n",
"<p>The solution is to use Google Analytics. Indeed, if you install Google Analytics on your website, it can track how people landed on your web page, i.e. from which URL a certain visitor landed on your page. Therefore, when people click on a URL embedded in a tweet of one of the people they follow, the shortened URL of the tweet will appear in your Google Analytics dashboard as the \"referral path\".</p>\n",
"![caption](http://a1.distilledcdn.com/wp-content/uploads/2011/08/tco.png)\n",
"<p>It is then just a matter of making a request to the Google Analytics API in order to fetch the number of visits per referral path per hour, as far in the past as you want! But first you need to fetch your Google Analytics credentials (this is a one-time step).</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from oauth2client.client import AccessTokenRefreshError\n",
"from oauth2client.client import OAuth2WebServerFlow\n",
"from oauth2client.file import Storage\n",
"from oauth2client.tools import run\n",
"\n",
"FLOW = OAuth2WebServerFlow(client_id='<INSERT CLIENT ID HERE>',client_secret='<INSERT CLIENT SECRET HERE>',scope='https://www.googleapis.com/auth/analytics.readonly',user_agent='analytics-api-v3-awesomeness')\n",
"\n",
"TOKEN_FILE_NAME = 'analytics.dat'\n",
"\n",
"storage = Storage(TOKEN_FILE_NAME)\n",
"credentials = storage.get()\n",
"if not credentials or credentials.invalid:\n",
" # Get a new token.\n",
" credentials = run(FLOW, storage)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you have your credentials safely saved in the analytics.dat file, define a method (here, \"get_referrers\") that grabs the data you want. Here I slice the data along the dimensions \"dateHour\" and \"fullReferrer\" while requesting the \"visitors\" and \"visits\" metrics. I also impose a filter on the source path such that I only get the referrers whose domain is \"t.co\"."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from oauth2client.file import Storage\n",
"from apiclient.discovery import build\n",
"import httplib2\n",
"from apiclient.errors import HttpError\n",
"from oauth2client.client import AccessTokenRefreshError\n",
"import sys\n",
"\n",
"TOKEN_FILE_NAME = 'analytics.dat'\n",
"\n",
"credentials = Storage(TOKEN_FILE_NAME).get()\n",
"\n",
"http = httplib2.Http()\n",
"http = credentials.authorize(http)\n",
"service = build('analytics', 'v3', http=http)\n",
"\n",
"def get_referrers(table_id,start_date,end_date,start_index='1',max_results='25'):\n",
" try:\n",
" # Attempt making the request.\n",
" results = service.data().ga().get(ids=table_id,start_date=start_date,end_date=end_date,metrics='ga:visitors,ga:visits',dimensions='ga:fullReferrer,ga:dateHour',filters='ga:source==t.co',start_index=start_index,max_results=max_results).execute()\n",
" \n",
" except AccessTokenRefreshError:\n",
" print >> sys.stderror, 'The credentials have been revoked or expired, please re-run the application to re-authorize'\n",
" \n",
" except HttpError, error:\n",
" print >> sys.stderror, 'Arg, there was an API error : %s %s : %s' % (error.resp.status, error.resp.reason, error._get_reason())\n",
" \n",
" return results"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then we call the function just created:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import defaultdict\n",
"\n",
"results = get_referrers(table_id=\"<INSERT YOUR TABLE ID HERE>\",start_date='2013-08-28',end_date='2013-08-30')\n",
"\n",
"referrers = defaultdict(list)\n",
"for row in results[\"rows\"]:\n",
" referrers[row[0]].append({\"time\":row[1],\"visitors\":row[2],\"visits\":row[3]})\n",
"\n",
"referrers = [{\"url\":k,\"metrics\":v} for k,v in referrers.items()] # This line is to format the data in JSON style rather than tabular style"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Search Twitter for a URL\n",
"\n",
"Now that we have the URLs we are looking for, let's search Twitter for these URLs. We use the [twitter library](https://github.com/bear/python-twitter). First, let's authenticate our Twitter app."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import twitter\n",
"\n",
"def oauth_login():\n",
" CONSUMER_KEY = '<INSERT CONSUMER KEY HERE>'\n",
" CONSUMER_SECRET = '<INSERT CONSUMER SECRET HERE>'\n",
" OAUTH_TOKEN = '<INSERT OAUTH TOKEN HERE>'\n",
" OAUTH_TOKEN_SECRET = '<INSERT OAUTH TOKEN SECRET HERE>'\n",
" \n",
" auth = twitter.oauth.OAuth(OAUTH_TOKEN,OAUTH_TOKEN_SECRET,CONSUMER_KEY,CONSUMER_SECRET)\n",
" return twitter.Twitter(auth=auth)\n",
"\n",
"twitter_api = oauth_login()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to define our search function. As you may know, Twitter imposes severe rate limits, so an important design feature of the function is the ability to make robust API calls, i.e. when our queries (search queries or any other query) to the Twitter API hit the limit, we want our code to pause for 15 mins (that's the time window set by Twitter before being able to query its API again). I borrow the function \"make_twitter_request\" written by Matthew A. Russel. See [his book](http://shop.oreilly.com/product/0636920030195.do) and the associated [GitHub account](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition) for more information."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from robust import make_twitter_request # See https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/tree/master/ipynb for the make_twitter_request function\n",
"\n",
"def twitter_search(twitter_api,q,max_statuses=200,**kw):\n",
" search_results = make_twitter_request(twitter_api.search.tweets,q=q,count=100) # Twitter enforces a maximum of 100 results per page\n",
" statuses = filter(lambda x: not x.has_key('retweeted_status') and not x.get('in_reply_to_status_id'), search_results['statuses']) if not kw.get(\"allow_children\") else search_results['statuses']\n",
" max_statuses = min(1000,max_statuses)\n",
" while len(statuses) < max_statuses and search_results[\"statuses\"]: # Paginate through results as long as there are results or until we hit max_statuses\n",
" # To get the next page of results, extract the \"next results\" page ID from the current results page, and use it in a new search\n",
" try:\n",
" next_results = search_results['search_metadata']['next_results']\n",
" except KeyError, e:\n",
" break\n",
" kwargs = dict([kv.split('=') for kv in next_results[1:].split(\"&\")])\n",
" search_results = make_twitter_request(twitter_api.search.tweets,**kwargs)\n",
" statuses += filter(lambda x: not x.has_key('retweeted_status') and not x.get('in_reply_to_status_id'), search_results['statuses']) if not kw.get(\"allow_children\") else search_results['statuses']\n",
" \n",
" return statuses\n",
"\n",
"q = \"http://nyti.ms/17oFDCy\" # Example query of shortened URL\n",
"tweets = twitter_search(twitter_api,q,max_statuses=1000,allow_children=False)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze virality\n",
"\n",
"<p>Now that we fetched all the tweets that shared a specific URL, we want to know who retweeted and replied to these tweets. Also, we want to analyze virality both in time and space. All tweets are marked with a timestamp, but only a few provide geographic coordinates. To generate more spatial data, we want to geocode the free-text information on the location of the twerk. That's the purpose of the \"geo_code\" function, with the help of the generous Bing geocoding service.</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from geopy import geocoders\n",
"\n",
"def get_retweets(tweet):\n",
" if tweet.has_key('retweet_count'):\n",
" return make_twitter_request(twitter_api.statuses.retweets, _id=tweet['id'])\n",
" else:\n",
" return []\n",
"\n",
"def get_replies(tweet):\n",
" q = \"@%s\" % tweet['user']['screen_name']\n",
" replies = twitter_search(twitter_api,q,max_statuses=1000,allow_children=True)\n",
" return filter(lambda t: t.get('in_reply_to_status_id') == tweet['id'], replies)\n",
"\n",
"GEO_APP_KEY = '<INSERT YOUR BING GEO API KEY>'\n",
"g = geocoders.Bing(GEO_APP_KEY)\n",
"\n",
"def geo_code(tweet):\n",
" coordinates = None\n",
" if tweet.get(\"place\") and tweet[\"place\"].get(\"full_name\"):\n",
" try:\n",
" coordinates = g.geocode(tweet[\"place\"][\"full_name\"], exactly_one=True)\n",
" coordinates = list(coordinates[1][::-1]) # This array inversion is to match the JSON way of formatting coordinates: longitude first, then latitude\n",
" except:\n",
" coordinates = None\n",
" elif tweet[\"user\"].get(\"location\"):\n",
" try:\n",
" coordinates = g.geocode(tweet[\"user\"][\"location\"], exactly_one=True)\n",
" coordinates = list(coordinates[1][::-1]) \n",
" except:\n",
" coordinates = None\n",
" return coordinates"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, analyzing virality requires to *recursively* analyze retweets and replies: we want to see who replies to a reply or retweets a reply, etc. Let's define a function that would recursively apply the get_retweets and get_replies function on each \"layer\" of tweets up to a limit (here we limit the recursive depth to 10). The retweets and replies to a tweet are stored in the attribute \"children\" of the tweet."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def build_tweet_tree(tweet_array,depth=0):\n",
" depth += 1\n",
" for tweet in tweet_array:\n",
" tweet['children'] = get_retweets(tweet)\n",
" tweet['children'].extend(get_replies(tweet))\n",
" if not tweet.get(\"coordinates\"):\n",
" tweet[\"coordinates\"] = geo_code(tweet)\n",
" if depth < 10:\n",
" build_tweet_tree(tweet['children'],depth)\n",
"\n",
"# Now let's apply the function on the initial tweets fetched.\n",
"build_tweet_tree(tweets)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the data\n",
"\n",
"We end up with a tree of tweets, the new branches coming out of the \"children\" attributes of the tweets who generated virality. Each tweet is tagged with a timestamp and many also have been geocoded. We also know, for each original tweet, how many clicks they generated and *when* these clicks occured. The next step is then to visualize how a URL gets shared in time and space, what conversations got spawned as a result, and how influential each tweet has been (in terms of retweets, replies and clicks). For such a visualization, see [here](http://bl.ocks.org/cwarny/6441347)."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment