Skip to content

Instantly share code, notes, and snippets.

@simonlindgren
Created November 13, 2017 14:41
Show Gist options
  • Save simonlindgren/58307b062a7aaeb1c1fbffc67df003c3 to your computer and use it in GitHub Desktop.
Save simonlindgren/58307b062a7aaeb1c1fbffc67df003c3 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Twitter user networks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required packages\n",
"import tweepy\n",
"from tweepy import OAuthHandler\n",
"import time\n",
"import networkx as nx\n",
"import pandas as pd\n",
"\n",
"# Enter your own Twitter credentials\n",
"consumer_key = \"\"\n",
"consumer_secret = \"\"\n",
"access_token = \"\"\n",
"access_secret = \"\"\n",
"\n",
" \n",
"# Set up authorisation towards the API\n",
"auth = OAuthHandler(consumer_key, consumer_secret)\n",
"auth.set_access_token(access_token, access_secret)\n",
"api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We start with a list of Twitter usernames."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# A list of Twitter user screen names\n",
"users = ['simonlindgren','uniturku']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or read it from a file\n",
"users = [name.strip() for name in open('screennames.txt').readlines()]\n",
"print(len(users))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download followers and followees\n",
"\n",
"We want to gather data on the followers and followees of these users. In doing this, we must deal with Twitter's [rate limits](https://developer.twitter.com/en/docs/basics/rate-limiting) which allow for 15 calls per 15 minutes. This means 15 calls in any time less than 15 minutes, and 1 call per 1 minute if we keep going for more than 15 minutes.\n",
"\n",
"With the code below, we get one `page` (which is 5000 items) per call. After every call, we pause for 60 seconds. We thus make one call per minute (15 per 15, 30 per 30 ...).\n",
"\n",
"- As an example, getting one million followers/followees with this method would take 3 hours and 20 minutes (1000000/5000 = 200 minutes).\n",
"\n",
"- If you plan to collect less than 75000 followers + followees (15 x 5000) in total, the pause can be shortened or removed. \n",
"\n",
"This code will save the IDs of the followers and followees in two separate files for each user in the initial list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for count,user in enumerate(users):\n",
" \n",
" # Set up counter and data files\n",
" counter = []\n",
" print(\"Processing user \" + user + \" \" + str(count+1) + \"/\" + str(len(users)))\n",
" filename1 = user + '_followers.txt'\n",
" filename2 = user + '_followees.txt'\n",
" open(filename1, 'w').close()\n",
" open(filename2, 'w').close()\n",
" \n",
" # Get followers and write to a file\n",
" print(\"Getting followers ...\")\n",
" for page in tweepy.Cursor(api.followers_ids, screen_name = user).pages():\n",
" with open(filename1, \"a\") as f:\n",
" for item in page:\n",
" counter.append(1)\n",
" print(len(counter), end='\\r')\n",
" f.write(str(item) + '\\n')\n",
" time.sleep(60)\n",
" \n",
" # Reset the counter \n",
" counter = []\n",
" \n",
" # Get followees and write to another file\n",
" print(\"Getting followees ...\") \n",
" for page in tweepy.Cursor(api.friends_ids, screen_name = user).pages():\n",
" with open(filename2, \"a\") as f:\n",
" for item in page:\n",
" counter.append(1)\n",
" print(len(counter), end='\\r')\n",
" f.write(str(item) + '\\n')\n",
" time.sleep(60)\n",
"\n",
"print(\"Done!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making an edgelist\n",
"\n",
"### Method A\n",
"\n",
"With follower and followee IDs saved to files for all users, we can read all these files and ask the Twitter API which screen names correspond to the ID numbers. We can then write all pairwise follower/followee relationships to an edgelist. We pause for 1 second between each ID with `time.sleep(1)`, this seems to avoid hitting the rate limit. With this method it will take 28 hours to get 100,000 screen names."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert collected follower and followee IDs to screen names, and create an edgelist\n",
"\n",
"edgelist = ['source;target']\n",
"\n",
"for count, user in enumerate(users):\n",
" print(\"Processing user \" + user + \" (\" + str(count+1) + \"/\" + str(len(users)) + \") followers\")\n",
" file = open(user + '_followers.txt', 'r').readlines()\n",
" for count,id in enumerate(file):\n",
" try: \n",
" follower = api.get_user(id)\n",
" followername = follower.screen_name\n",
" print(str(count+1) + \"/\" + str(len(file)), end ='\\r')\n",
" edgelist.append(followername + ',' + user) # user is the edge target\n",
" time.sleep(1)\n",
" except tweepy.TweepError: # Skip errors such as 'user not found'\n",
" print(\"Skipping error...\")\n",
" pass\n",
" \n",
" print(\"Processing user \" + user + \" (\" + str(count+1) + \"/\" + str(len(users)) + \") followees\")\n",
" file = open(user + '_followees.txt', 'r').readlines()\n",
" for count,id in enumerate(file):\n",
" try:\n",
" followee = api.get_user(id)\n",
" followeename = followee.screen_name\n",
" print(str(count+1) + \"/\" + str(len(file)), end ='\\r')\n",
" edgelist.append(user + ',' + followeename) # user is the edge source\n",
" time.sleep(1)\n",
" except tweepy.TweepError: # Skip errors such as 'user not found'\n",
" print(\"Skipping error...\")\n",
" pass\n",
"print(\"Done!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"edgelist[1:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Method B\n",
"As it takes such a long time to perform the process above, it may be a better strategy for most projects to keep the downloaded followers/followees in ID-number format, and to convert the (much shorter) initial list of users (which was in screen name format) to ID-numbers as well. This is done with the alternative code below. It would not be advisable to mix screen names and IDs in any further analysis, as we then may be unknowingly dealing with duplicates of the same user (e.g. a screen name version from the initial `users` list, as well as the same user in ID format as a follower/followee collected in relation to some other `user`).\n",
"\n",
"If we want to do network analysis, it may save lots of time to do the network analysis (blindly) based on ID-numbers, and to manually (or programmatically) look up the screen names of a smaller number of key nodes that we may want to highlight in the final and filtered network visualisation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"edgelist = []\n",
"\n",
"for count,user in enumerate(users):\n",
" print(\"Getting ID number for \" + user + \" \" + str(count) + \"/\" + str(len(users))))\n",
" user_id = api.get_user(user)\n",
" user_id = user_id.id_str\n",
" file = open(user + '_followers.txt', 'r').readlines()\n",
" for id in file:\n",
" edgelist.append(id.strip() + ',' + user_id) # user is the edge target\n",
" file = open(user + '_followees.txt', 'r').readlines()\n",
" for id in file:\n",
" edgelist.append(user_id + ',' + id.strip()) # user is the edge source\n",
" time.sleep(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"edgelist[1:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Converting the edgelist to *.gexf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert the edgelist to a dataframe\n",
"edges = pd.DataFrame([sub.split(\",\") for sub in edgelist], columns=['source','target'])\n",
"edges.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we are likely to have duplicate edges in the list, we create a `MultiGraph`, which allows for this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"M = nx.from_pandas_dataframe(edges, 'source', 'target', create_using = nx.MultiGraph())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(nx.info(M))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then create a weighted, and directed, graph based on the MultiGraph."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G = nx.DiGraph()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create weighted graph from M\n",
"for u,v,data in M.edges(data=True):\n",
" w = data['weight'] if 'weight' in data else 1.0\n",
" if G.has_edge(u,v):\n",
" G[u][v]['weight'] += w\n",
" else:\n",
" G.add_edge(u, v, weight=w)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"print(nx.info(G))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save it in Gephi format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nx.write_gexf(G, \"followers_followees.gexf\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment