j6k4m8/Gather-Data-Twitter-Prestige.ipynb

## Gather-Data-Twitter-Prestige.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Twitter Followership Prestige\n",
    "\n",
    "In this notebook, we will explore a user's followership prestige preferences: Do you follow a lot of accounts with many other followers (IOW, you follow a lot of politicians or celebrities) or do you follow a lot of accounts with few followers (IOW, personal acquaintances or friends)?\n",
    "\n",
    "## A note on nomenclature\n",
    "\n",
    "Twitter refers to people that you follow as \"friends,\" and people that follow you as \"followers.\" You can think of \"friends\" as outgoing edges in a social network, and \"followers\" as incoming edges. The routines in this notebook use both friend and follower edges; it can be tricky to remember which is which at times!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip3 install tweepy matplotlib pandas\n",
    "\n",
    "import tweepy\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "\n",
    "import time\n",
    "from functools import cache\n",
    "\n",
    "# You will need to create your own Twitter API credentials for this. It is an\n",
    "# awful process, and I hate it. Unfortunately, before you get clever and go try\n",
    "# the same thing with twint or snscraper, take note that follower/friend pages\n",
    "# won't render for those unauthenticated tools.\n",
    "\n",
    "# I would recommend leaving your config in a JSON file and importing it with:\n",
    "# \n",
    "#    cfg = json.load(open('config.json'))\n",
    "# \n",
    "# But to keep things tidy in one notebook, I inline strings below. You can use\n",
    "# this pattern, but remember to remove the credentials if you push this to an\n",
    "# online repository or website!\n",
    "\n",
    "cfg = {\n",
    "    \"api_key\": \"\",\n",
    "    \"api_secret\": \"\",\n",
    "    \"bearer_token\": \"\",\n",
    "    \"access_token\": \"\",\n",
    "    \"access_secret\": \"\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your settings here!\n",
    "\n",
    "# The users that you want to study. You can also do something cleverer like a \n",
    "# graph-traversal to query the ego-network of a user.\n",
    "USERS_TO_STUDY = [\n",
    "    \"brembs\", \"KordingLab\", \"bradpwyble\", \"neuralreckoning\", \"j6m8\",\n",
    "    \"Raamana_\", \"atypical_me\", \"NeuroPolarbear\", \"tyrell_turing\", \"andpru\", \n",
    "    \"R3RTO\", \"katjaQheuer\", \"OHBM\", \"IpNeuro\"\n",
    "]\n",
    "\n",
    "# How many followers per user do you want to sample?\n",
    "SAMPLE_SIZE = 500"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auth = tweepy.OAuthHandler(cfg[\"api_key\"], cfg[\"api_secret\"])\n",
    "auth.set_access_token(cfg[\"access_token\"], cfg[\"access_secret\"])\n",
    "t = tweepy.API(auth)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some helper functions. We decorate all of these with a `@cache` decorator so that we don't wind up using our Twitter API quota to ask the same questions.\n",
    "\n",
    "If you plan on running this on a RAM-light machine, or for many users, you might consider switching this to `@lru_cache(maxsize=2048)` or something to avoid cluttering up your kernel memory too much.\n",
    "\n",
    "If you change the constants above (`USERS_TO_STUDY` or `SAMPLE_SIZE`), you will need to decide if you want to rerun this cell. If you DO rerun this cell, you will clear the cache and have to start downloading results from scratch. If you _don't_ rerun this cell, you will get the cached results, but you may have out-of-date information: For example, if you increase sample size, the depagination functions at the bottom might return the _old_ sample-size, since that's as far as they cached."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cache\n",
    "def get_user_by_screen_name(username: str):\n",
    "    \"\"\"\n",
    "    Retrieve a User object by screen name (@twitterusername).\n",
    "\n",
    "    Arguments:\n",
    "        username (str): The screen name of the user to retrieve.\n",
    "\n",
    "    Returns:\n",
    "        tweepy.User: The user object.\n",
    "\n",
    "    \"\"\"\n",
    "    return t.get_user(screen_name=username)\n",
    "\n",
    "\n",
    "@cache\n",
    "def get_follower_count(username: str):\n",
    "    \"\"\"\n",
    "    Retrieve the follower count of a user.\n",
    "\n",
    "    Arguments:\n",
    "        username (str): The screen name of the user to retrieve.\n",
    "\n",
    "    Returns:\n",
    "        int: The follower count.\n",
    "\n",
    "    \"\"\"\n",
    "    user = get_user_by_screen_name(username)\n",
    "    return user.followers_count\n",
    "\n",
    "\n",
    "@cache\n",
    "def get_friend_count(username: str):\n",
    "    \"\"\"\n",
    "    Retrieve the friend count of a user.\n",
    "\n",
    "    Arguments:\n",
    "        username (str): The screen name of the user to retrieve.\n",
    "\n",
    "    Returns:\n",
    "        int: The friend count.\n",
    "\n",
    "    \"\"\"\n",
    "    user = get_user_by_screen_name(username)\n",
    "    return user.friends_count\n",
    "\n",
    "\n",
    "# The following functions use a tweepy v1 depaginator pattern to retrieve\n",
    "# a Generator of usernames.\n",
    "\n",
    "\n",
    "@cache\n",
    "def get_followers_of_user_depaginated(username: str):\n",
    "    \"\"\"\n",
    "    Retrieve a Generator of usernames of followers of a user.\n",
    "\n",
    "    Arguments:\n",
    "        username (str): The screen name of the user to retrieve.\n",
    "\n",
    "    Returns:\n",
    "        Generator[str]: A generator of usernames of followers of the user.\n",
    "\n",
    "    \"\"\"\n",
    "    cursor = tweepy.Cursor(\n",
    "        t.get_followers,\n",
    "        screen_name=username,\n",
    "        count=200,\n",
    "    ).items()\n",
    "    while True:\n",
    "        try:\n",
    "            yield cursor.next()\n",
    "        except tweepy.TooManyRequests:\n",
    "            print(\"•\", end=\"\")\n",
    "            time.sleep(2 * 60)\n",
    "        except StopIteration:\n",
    "            break\n",
    "\n",
    "\n",
    "@cache\n",
    "def get_friends_of_user_depaginated(username: str, limit: int = None):\n",
    "    \"\"\"\n",
    "    Retrieve a Generator of usernames of friends of a user.\n",
    "\n",
    "    Arguments:\n",
    "        username (str): The screen name of the user to retrieve.\n",
    "\n",
    "    Returns:\n",
    "        Generator[str]: A generator of usernames of friends of the user.\n",
    "\n",
    "    \"\"\"\n",
    "    cursor = tweepy.Cursor(\n",
    "        t.get_friends,\n",
    "        screen_name=username,\n",
    "        count=200,\n",
    "    ).items()\n",
    "    count = 0\n",
    "    while True:\n",
    "        count += 1\n",
    "        if limit is not None and count > limit:\n",
    "            break\n",
    "        try:\n",
    "            yield cursor.next()\n",
    "        except tweepy.TooManyRequests:\n",
    "            print(\"•\", end=\"\")\n",
    "            time.sleep(2 * 60)\n",
    "        except StopIteration:\n",
    "            break\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ☕️ Run the data retrieval\n",
    "\n",
    "This next cell actually performs the API data retrieval; it may take a while! (The major rate limiter will be the API quota, not your internet connection or CPU speed...) To retrieve fourteen users with a decent amount of follower overlap in the n=500 sample (i.e., heavy reuse of the cache), this takes about an hour. Yes, I agree that this is absurd. If you have a clever speed-up in mind, I'd love to see it. This is one of the slowest, clunkiest APIs I've ever dealt with (and I've spent months debugging Airtable integrations!).\n",
    "\n",
    "If you want to learn more about the rate limits on the twitter API, [read the official documentation](https://developer.twitter.com/en/docs/rate-limits)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm.auto import tqdm\n",
    "\n",
    "user_data = {}\n",
    "for user in USERS_TO_STUDY:\n",
    "    if user not in user_data:\n",
    "        while True:\n",
    "            try:\n",
    "                followers_of_user = [\n",
    "                    follower\n",
    "                    for i, follower in tqdm(\n",
    "                        enumerate(get_friends_of_user_depaginated.__wrapped__(user, limit=SAMPLE_SIZE)),\n",
    "                    )\n",
    "                    if i < SAMPLE_SIZE\n",
    "                ]\n",
    "                user_data[user] = {\n",
    "                    \"user\": user,\n",
    "                    \"follower_count\": get_follower_count(user),\n",
    "                    \"friend_count\": get_friend_count(user),\n",
    "                    \"followers_counts\": [\n",
    "                        get_follower_count(follower.screen_name)\n",
    "                        for follower in followers_of_user\n",
    "                    ],\n",
    "                    \"followers\": followers_of_user,\n",
    "                }\n",
    "                break\n",
    "            except tweepy.TooManyRequests:\n",
    "                print(\"•\", end=\"\")\n",
    "                time.sleep(2 * 60)\n",
    "                continue\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n",
    "\n",
    "Now that we have the data, we can start to perform our analyses. In the cell below, I save the data to a CSV so that you don't have to rerun the retreival cell every time you want to play with the data. You should uncomment this line if you are alright with this notebook writing a CSV to your hard drive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Save the data:\n",
    "# pd.DataFrame(user_data).T.to_csv(\"users_data.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with plt.style.context(\"seaborn-darkgrid\"):\n",
    "    plt.figure(figsize=(12, 8), dpi=150)\n",
    "    for user, u in sorted(user_data.items(), key=lambda x: x[1][\"user\"], reverse=False):\n",
    "\n",
    "        plt.plot(sorted(u[\"followers_counts\"], reverse=True)[:500], label=user)\n",
    "    plt.yscale('log')\n",
    "    plt.legend()\n",
    "    plt.ylabel(\"Number of followers\")\n",
    "    plt.xlabel(\"Followee (500 randomly selected)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_follower_counts = [\n",
    "    u for user in user_data.values()\n",
    "    for u in user['followers_counts']\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "410f6db90cc89b666adbd1b755ae7555dd227a2d7c11822f3d377845b87672a4"
  },
  "kernelspec": {
   "display_name": "Python 3.9.7 64-bit ('scripting')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Twitter Followership Prestige\n",
	"\n",
	"In this notebook, we will explore a user's followership prestige preferences: Do you follow a lot of accounts with many other followers (IOW, you follow a lot of politicians or celebrities) or do you follow a lot of accounts with few followers (IOW, personal acquaintances or friends)?\n",
	"\n",
	"## A note on nomenclature\n",
	"\n",
	"Twitter refers to people that you follow as \"friends,\" and people that follow you as \"followers.\" You can think of \"friends\" as outgoing edges in a social network, and \"followers\" as incoming edges. The routines in this notebook use both friend and follower edges; it can be tricky to remember which is which at times!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# !pip3 install tweepy matplotlib pandas\n",
	"\n",
	"import tweepy\n",
	"import matplotlib.pyplot as plt\n",
	"import pandas as pd\n",
	"\n",
	"import time\n",
	"from functools import cache\n",
	"\n",
	"# You will need to create your own Twitter API credentials for this. It is an\n",
	"# awful process, and I hate it. Unfortunately, before you get clever and go try\n",
	"# the same thing with twint or snscraper, take note that follower/friend pages\n",
	"# won't render for those unauthenticated tools.\n",
	"\n",
	"# I would recommend leaving your config in a JSON file and importing it with:\n",
	"# \n",
	"# cfg = json.load(open('config.json'))\n",
	"# \n",
	"# But to keep things tidy in one notebook, I inline strings below. You can use\n",
	"# this pattern, but remember to remove the credentials if you push this to an\n",
	"# online repository or website!\n",
	"\n",
	"cfg = {\n",
	" \"api_key\": \"\",\n",
	" \"api_secret\": \"\",\n",
	" \"bearer_token\": \"\",\n",
	" \"access_token\": \"\",\n",
	" \"access_secret\": \"\"\n",
	"}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Your settings here!\n",
	"\n",
	"# The users that you want to study. You can also do something cleverer like a \n",
	"# graph-traversal to query the ego-network of a user.\n",
	"USERS_TO_STUDY = [\n",
	" \"brembs\", \"KordingLab\", \"bradpwyble\", \"neuralreckoning\", \"j6m8\",\n",
	" \"Raamana_\", \"atypical_me\", \"NeuroPolarbear\", \"tyrell_turing\", \"andpru\", \n",
	" \"R3RTO\", \"katjaQheuer\", \"OHBM\", \"IpNeuro\"\n",
	"]\n",
	"\n",
	"# How many followers per user do you want to sample?\n",
	"SAMPLE_SIZE = 500"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"auth = tweepy.OAuthHandler(cfg[\"api_key\"], cfg[\"api_secret\"])\n",
	"auth.set_access_token(cfg[\"access_token\"], cfg[\"access_secret\"])\n",
	"t = tweepy.API(auth)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Some helper functions. We decorate all of these with a `@cache` decorator so that we don't wind up using our Twitter API quota to ask the same questions.\n",
	"\n",
	"If you plan on running this on a RAM-light machine, or for many users, you might consider switching this to `@lru_cache(maxsize=2048)` or something to avoid cluttering up your kernel memory too much.\n",
	"\n",
	"If you change the constants above (`USERS_TO_STUDY` or `SAMPLE_SIZE`), you will need to decide if you want to rerun this cell. If you DO rerun this cell, you will clear the cache and have to start downloading results from scratch. If you _don't_ rerun this cell, you will get the cached results, but you may have out-of-date information: For example, if you increase sample size, the depagination functions at the bottom might return the _old_ sample-size, since that's as far as they cached."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"@cache\n",
	"def get_user_by_screen_name(username: str):\n",
	" \"\"\"\n",
	" Retrieve a User object by screen name (@twitterusername).\n",
	"\n",
	" Arguments:\n",
	" username (str): The screen name of the user to retrieve.\n",
	"\n",
	" Returns:\n",
	" tweepy.User: The user object.\n",
	"\n",
	" \"\"\"\n",
	" return t.get_user(screen_name=username)\n",
	"\n",
	"\n",
	"@cache\n",
	"def get_follower_count(username: str):\n",
	" \"\"\"\n",
	" Retrieve the follower count of a user.\n",
	"\n",
	" Arguments:\n",
	" username (str): The screen name of the user to retrieve.\n",
	"\n",
	" Returns:\n",
	" int: The follower count.\n",
	"\n",
	" \"\"\"\n",
	" user = get_user_by_screen_name(username)\n",
	" return user.followers_count\n",
	"\n",
	"\n",
	"@cache\n",
	"def get_friend_count(username: str):\n",
	" \"\"\"\n",
	" Retrieve the friend count of a user.\n",
	"\n",
	" Arguments:\n",
	" username (str): The screen name of the user to retrieve.\n",
	"\n",
	" Returns:\n",
	" int: The friend count.\n",
	"\n",
	" \"\"\"\n",
	" user = get_user_by_screen_name(username)\n",
	" return user.friends_count\n",
	"\n",
	"\n",
	"# The following functions use a tweepy v1 depaginator pattern to retrieve\n",
	"# a Generator of usernames.\n",
	"\n",
	"\n",
	"@cache\n",
	"def get_followers_of_user_depaginated(username: str):\n",
	" \"\"\"\n",
	" Retrieve a Generator of usernames of followers of a user.\n",
	"\n",
	" Arguments:\n",
	" username (str): The screen name of the user to retrieve.\n",
	"\n",
	" Returns:\n",
	" Generator[str]: A generator of usernames of followers of the user.\n",
	"\n",
	" \"\"\"\n",
	" cursor = tweepy.Cursor(\n",
	" t.get_followers,\n",
	" screen_name=username,\n",
	" count=200,\n",
	" ).items()\n",
	" while True:\n",
	" try:\n",
	" yield cursor.next()\n",
	" except tweepy.TooManyRequests:\n",
	" print(\"•\", end=\"\")\n",
	" time.sleep(2 * 60)\n",
	" except StopIteration:\n",
	" break\n",
	"\n",
	"\n",
	"@cache\n",
	"def get_friends_of_user_depaginated(username: str, limit: int = None):\n",
	" \"\"\"\n",
	" Retrieve a Generator of usernames of friends of a user.\n",
	"\n",
	" Arguments:\n",
	" username (str): The screen name of the user to retrieve.\n",
	"\n",
	" Returns:\n",
	" Generator[str]: A generator of usernames of friends of the user.\n",
	"\n",
	" \"\"\"\n",
	" cursor = tweepy.Cursor(\n",
	" t.get_friends,\n",
	" screen_name=username,\n",
	" count=200,\n",
	" ).items()\n",
	" count = 0\n",
	" while True:\n",
	" count += 1\n",
	" if limit is not None and count > limit:\n",
	" break\n",
	" try:\n",
	" yield cursor.next()\n",
	" except tweepy.TooManyRequests:\n",
	" print(\"•\", end=\"\")\n",
	" time.sleep(2 * 60)\n",
	" except StopIteration:\n",
	" break\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## ☕️ Run the data retrieval\n",
	"\n",
	"This next cell actually performs the API data retrieval; it may take a while! (The major rate limiter will be the API quota, not your internet connection or CPU speed...) To retrieve fourteen users with a decent amount of follower overlap in the n=500 sample (i.e., heavy reuse of the cache), this takes about an hour. Yes, I agree that this is absurd. If you have a clever speed-up in mind, I'd love to see it. This is one of the slowest, clunkiest APIs I've ever dealt with (and I've spent months debugging Airtable integrations!).\n",
	"\n",
	"If you want to learn more about the rate limits on the twitter API, [read the official documentation](https://developer.twitter.com/en/docs/rate-limits)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from tqdm.auto import tqdm\n",
	"\n",
	"user_data = {}\n",
	"for user in USERS_TO_STUDY:\n",
	" if user not in user_data:\n",
	" while True:\n",
	" try:\n",
	" followers_of_user = [\n",
	" follower\n",
	" for i, follower in tqdm(\n",
	" enumerate(get_friends_of_user_depaginated.__wrapped__(user, limit=SAMPLE_SIZE)),\n",
	" )\n",
	" if i < SAMPLE_SIZE\n",
	" ]\n",
	" user_data[user] = {\n",
	" \"user\": user,\n",
	" \"follower_count\": get_follower_count(user),\n",
	" \"friend_count\": get_friend_count(user),\n",
	" \"followers_counts\": [\n",
	" get_follower_count(follower.screen_name)\n",
	" for follower in followers_of_user\n",
	" ],\n",
	" \"followers\": followers_of_user,\n",
	" }\n",
	" break\n",
	" except tweepy.TooManyRequests:\n",
	" print(\"•\", end=\"\")\n",
	" time.sleep(2 * 60)\n",
	" continue\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Analysis\n",
	"\n",
	"Now that we have the data, we can start to perform our analyses. In the cell below, I save the data to a CSV so that you don't have to rerun the retreival cell every time you want to play with the data. You should uncomment this line if you are alright with this notebook writing a CSV to your hard drive."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# # Save the data:\n",
	"# pd.DataFrame(user_data).T.to_csv(\"users_data.csv\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"with plt.style.context(\"seaborn-darkgrid\"):\n",
	" plt.figure(figsize=(12, 8), dpi=150)\n",
	" for user, u in sorted(user_data.items(), key=lambda x: x[1][\"user\"], reverse=False):\n",
	"\n",
	" plt.plot(sorted(u[\"followers_counts\"], reverse=True)[:500], label=user)\n",
	" plt.yscale('log')\n",
	" plt.legend()\n",
	" plt.ylabel(\"Number of followers\")\n",
	" plt.xlabel(\"Followee (500 randomly selected)\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"all_follower_counts = [\n",
	" u for user in user_data.values()\n",
	" for u in user['followers_counts']\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"interpreter": {
	"hash": "410f6db90cc89b666adbd1b755ae7555dd227a2d7c11822f3d377845b87672a4"
	},
	"kernelspec": {
	"display_name": "Python 3.9.7 64-bit ('scripting')",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.7"
	},
	"orig_nbformat": 4
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}