Skip to content

Instantly share code, notes, and snippets.

@RodolfoFerro
Created August 3, 2022 17:33
Show Gist options
  • Save RodolfoFerro/8b6958efba0fe4c3b1ed66cc6f892e40 to your computer and use it in GitHub Desktop.
Save RodolfoFerro/8b6958efba0fe4c3b1ed66cc6f892e40 to your computer and use it in GitHub Desktop.
Twitter Data Extraction
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Twitter Data Extraction",
"private_outputs": true,
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyO11oVs5PZuJV9a9HjPNEjn",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/RodolfoFerro/8b6958efba0fe4c3b1ed66cc6f892e40/twitter-data-extraction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Extracción de datos en Twitter"
],
"metadata": {
"id": "xspLObyvpqQ7"
}
},
{
"cell_type": "markdown",
"source": [
"## Instalación de dependencias \n",
"\n",
"Dado que Networkx ya se encuentra instalado, basta instalar solamente Tweepy. Esto se efectúa como se muestra continuación:"
],
"metadata": {
"id": "3DbtofhdqIie"
}
},
{
"cell_type": "code",
"source": [
"!pip install tweepy -q"
],
"metadata": {
"id": "EP3btNxTXLRr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Ingresar accesos\n",
"\n",
"Estos accesos los puedes obtener desde tu Dashboard en el Developer's Portal de Twitter."
],
"metadata": {
"id": "RjsX5WQUqSKP"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kM3XE6Z3yV8i"
},
"outputs": [],
"source": [
"API_KEY = ''\n",
"API_SECRET = ''\n",
"ACCESS_TOKEN = ''\n",
"TOKEN_SECRET = ''"
]
},
{
"cell_type": "markdown",
"source": [
"## Autenticación\n",
"\n",
"\n",
"Será necesario autenticar el acceso a Twitter, lo podemos realizar con la siguiente función."
],
"metadata": {
"id": "qFK_wX6Bqd80"
}
},
{
"cell_type": "code",
"source": [
"import tweepy\n",
"\n",
"def twitter_setup():\n",
" \"\"\"\n",
" Utility function to setup the Twitter's API\n",
" with our access keys provided.\n",
" \"\"\"\n",
" # Authentication and access using keys:\n",
" auth = tweepy.OAuthHandler(API_KEY, API_SECRET)\n",
" auth.set_access_token(ACCESS_TOKEN, TOKEN_SECRET)\n",
"\n",
" # Return API with authentication:\n",
" api = tweepy.API(auth)\n",
" return api"
],
"metadata": {
"id": "4koZ0_qDXYgo"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### API Testing\n",
"\n",
"Podemos probar la conexión con la API de Twitter para verificar que tenemos acceso a la información."
],
"metadata": {
"id": "7o6M5Zpcqls4"
}
},
{
"cell_type": "code",
"source": [
"from pprint import pprint\n",
"\n",
"# We create an extractor object\n",
"extractor = twitter_setup()\n",
"\n",
"# We create a tweet list as follows\n",
"tweets = extractor.user_timeline(screen_name='rodo_ferro', count=50)\n",
"print(f'Number of tweets extracted: {len(tweets)}.\\n')\n",
"\n",
"# We print the most recent 5 tweets\n",
"print('5 most recent tweets:')\n",
"for tweet in tweets[:5]:\n",
" print(tweet._json)"
],
"metadata": {
"id": "V_afqLtWX5wg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Una vez hayamos extraído algunos Tweets, podemos explorar sus contenidos. Para esto, podemos apoyarnos con el siguiente sitio: https://codebeautify.org/jsonviewer\n",
"\n",
"\n",
"> ¿Cómo podríamos extraer información de cada no de estos elementos? Por ejemplo, si me interesara conocer el nombre de usuario en Twitter que ha twiteado."
],
"metadata": {
"id": "RlgsC-RkqxB8"
}
},
{
"cell_type": "markdown",
"source": [
"## Extracción de información para construir un grafo\n",
"\n",
"Vamos a explorar a detalle las funciones que utilizaremos para extraer información."
],
"metadata": {
"id": "tnRWM_JEsMit"
}
},
{
"cell_type": "code",
"source": [
"def get_user_id(tweet):\n",
" \"\"\"Returns user and id.\"\"\"\n",
"\n",
" user_id = None\n",
" user_name = None\n",
" user = tweet['user']\n",
"\n",
" if user is not None:\n",
" user_id = user['id']\n",
" user_name = user['screen_name']\n",
" \n",
" return (user_id, user_name)\n",
"\n",
"\n",
"def get_retweeted_info(tweet):\n",
" \"\"\"Returns retweet source info.\"\"\"\n",
"\n",
" retweet = None\n",
" retweet_count = tweet['retweet_count']\n",
"\n",
" if retweet_count > 0 and 'retweeted_status' in tweet.keys():\n",
" retweet = tweet['retweeted_status']\n",
"\n",
" if retweet is not None:\n",
" return get_user_id(retweet)\n",
" else:\n",
" return (None, None)\n",
"\n",
"def get_reply_info(tweet):\n",
" \"\"\"Returns reply info.\"\"\"\n",
"\n",
" reply_id = tweet['in_reply_to_user_id']\n",
" reply_screen_name = tweet['in_reply_to_screen_name']\n",
"\n",
" return (reply_id, reply_screen_name)\n",
"\n",
" \n",
"def get_mentions_info(tweet):\n",
" \"\"\"Returns a list of all user mentions.\"\"\"\n",
"\n",
" mentions = []\n",
" entities = tweet['entities']\n",
"\n",
" if entities is not None:\n",
" user_mentions = entities['user_mentions']\n",
" for mention in user_mentions:\n",
" mention_id = mention['id']\n",
" screen_name = mention['screen_name']\n",
" mentions.append((mention_id, screen_name))\n",
" \n",
" return mentions\n",
"\n",
"\n",
"def get_quoted_info(tweet):\n",
" \"\"\"Returns id of user quoting the tweet.\"\"\"\n",
" \n",
" if 'quoted_status' in tweet.keys():\n",
" quoted_status = tweet['quoted_status']\n",
" else:\n",
" quoted_status = None\n",
" \n",
" if quoted_status is not None:\n",
" return get_user_id(quoted_status)\n",
" else:\n",
" return (None, None)\n",
"\n",
"\n",
"def get_all_interactions(tweet):\n",
" \"\"\"Returns all interactions from this tweet.\"\"\"\n",
" \n",
" # Get the tweeter\n",
" tweeter = get_user_id(tweet)\n",
" \n",
" # Nothing to do if we couldn't get the tweeter\n",
" if tweeter[0] is None:\n",
" return (None, None), []\n",
" \n",
" # a python set is a collection of unique items\n",
" # we use a set to avoid duplicated ids\n",
" interacting_users = set()\n",
" \n",
" # Add person they're replying to\n",
" interacting_users.add(get_reply_info(tweet))\n",
" \n",
" # Add person they retweeted\n",
" interacting_users.add(get_retweeted_info(tweet))\n",
" \n",
" # Add person they quoted\n",
" interacting_users.add(get_quoted_info(tweet))\n",
" \n",
" # Add mentions\n",
" interacting_users.update(get_mentions_info(tweet))\n",
" \n",
" # remove the tweeter if he is in the set\n",
" interacting_users.discard(tweeter)\n",
" # remove the None case\n",
" interacting_users.discard((None,None))\n",
" \n",
" # Return our tweeter and their influencers\n",
" return tweeter, list(interacting_users)"
],
"metadata": {
"id": "cItwpXsQcZb0"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Búsqueda por palabras\n",
"\n",
"Procedemos a crear un Cursor en Tweepy para poder realizar una búsqueda de información. Este cursor nos devolverá un cursor que contenga todos los tweets extraídos."
],
"metadata": {
"id": "m01IdFrRsRjZ"
}
},
{
"cell_type": "code",
"source": [
"words = ['covid19']\n",
"since = '2020-01-01'\n",
"n_tweets = 100\n",
"\n",
"\n",
"tweets = tweepy.Cursor(extractor.search,\n",
" words, lang='en',\n",
" since_id=since,\n",
" tweet_mode='extended').items(n_tweets)"
],
"metadata": {
"id": "Rj4OhXoPYXMa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Podemos explorar un poco la información que extrajimos."
],
"metadata": {
"id": "7l6HDXNBtO5C"
}
},
{
"cell_type": "code",
"source": [
"# TODO\n",
"# Tweet exploration"
],
"metadata": {
"id": "LzuCdXFJtSgU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Construcción de un grafo\n",
"\n",
"Podemos utilizar lo aprendido sobre Networkx para construir un grafo con los datos que extrajimos.\n"
],
"metadata": {
"id": "l9drNgcLtS3M"
}
},
{
"cell_type": "code",
"source": [
"import networkx as nx\n",
"\n",
"\n",
"# Let's create an empty Directed Graph\n",
"G = nx.DiGraph()\n",
"\n",
"# Let's iterate all the tweets and add edges if the tweet include some interactions\n",
"for tweet_raw in tweets:\n",
" tweet = tweet_raw._json\n",
"\n",
" # Find all influencers in the tweet\n",
" tweeter, interactions = get_all_interactions(tweet)\n",
" tweeter_id, tweeter_name = tweeter\n",
" tweet_id = get_user_id(tweet)[0]\n",
" \n",
" # Add an edge to the Graph for each influencer\n",
" for interaction in interactions:\n",
" interact_id, interact_name = interaction\n",
" \n",
" # Add edges between the two user ids\n",
" # This will create new nodes if the nodes are not already in the\n",
" # network we also add an attribute the to edge equal to the id of\n",
" # the tweet\n",
" G.add_edge(tweeter_id, interact_id, tweet_id=tweet_id)\n",
" \n",
" # Add name as a property to each node with networkX each node\n",
" # is a dictionary\n",
" G.nodes[tweeter_id]['name'] = tweeter_name\n",
" G.nodes[interact_id]['name'] = interact_name"
],
"metadata": {
"id": "e2M3Rhx0eBBF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"nx.draw(G)"
],
"metadata": {
"id": "iQggslj9nhJ1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Este no será el grafo final, pero es un grafo inicial que hemos construido con información extraída directamente desde Twitter.**"
],
"metadata": {
"id": "wZP81-Hbtu8D"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment