Created
August 3, 2022 17:33
-
-
Save RodolfoFerro/8b6958efba0fe4c3b1ed66cc6f892e40 to your computer and use it in GitHub Desktop.
Twitter Data Extraction
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Twitter Data Extraction", | |
"private_outputs": true, | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyO11oVs5PZuJV9a9HjPNEjn", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/RodolfoFerro/8b6958efba0fe4c3b1ed66cc6f892e40/twitter-data-extraction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Extracción de datos en Twitter" | |
], | |
"metadata": { | |
"id": "xspLObyvpqQ7" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Instalación de dependencias \n", | |
"\n", | |
"Dado que Networkx ya se encuentra instalado, basta instalar solamente Tweepy. Esto se efectúa como se muestra continuación:" | |
], | |
"metadata": { | |
"id": "3DbtofhdqIie" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!pip install tweepy -q" | |
], | |
"metadata": { | |
"id": "EP3btNxTXLRr" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Ingresar accesos\n", | |
"\n", | |
"Estos accesos los puedes obtener desde tu Dashboard en el Developer's Portal de Twitter." | |
], | |
"metadata": { | |
"id": "RjsX5WQUqSKP" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "kM3XE6Z3yV8i" | |
}, | |
"outputs": [], | |
"source": [ | |
"API_KEY = ''\n", | |
"API_SECRET = ''\n", | |
"ACCESS_TOKEN = ''\n", | |
"TOKEN_SECRET = ''" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Autenticación\n", | |
"\n", | |
"\n", | |
"Será necesario autenticar el acceso a Twitter, lo podemos realizar con la siguiente función." | |
], | |
"metadata": { | |
"id": "qFK_wX6Bqd80" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import tweepy\n", | |
"\n", | |
"def twitter_setup():\n", | |
" \"\"\"\n", | |
" Utility function to setup the Twitter's API\n", | |
" with our access keys provided.\n", | |
" \"\"\"\n", | |
" # Authentication and access using keys:\n", | |
" auth = tweepy.OAuthHandler(API_KEY, API_SECRET)\n", | |
" auth.set_access_token(ACCESS_TOKEN, TOKEN_SECRET)\n", | |
"\n", | |
" # Return API with authentication:\n", | |
" api = tweepy.API(auth)\n", | |
" return api" | |
], | |
"metadata": { | |
"id": "4koZ0_qDXYgo" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### API Testing\n", | |
"\n", | |
"Podemos probar la conexión con la API de Twitter para verificar que tenemos acceso a la información." | |
], | |
"metadata": { | |
"id": "7o6M5Zpcqls4" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from pprint import pprint\n", | |
"\n", | |
"# We create an extractor object\n", | |
"extractor = twitter_setup()\n", | |
"\n", | |
"# We create a tweet list as follows\n", | |
"tweets = extractor.user_timeline(screen_name='rodo_ferro', count=50)\n", | |
"print(f'Number of tweets extracted: {len(tweets)}.\\n')\n", | |
"\n", | |
"# We print the most recent 5 tweets\n", | |
"print('5 most recent tweets:')\n", | |
"for tweet in tweets[:5]:\n", | |
" print(tweet._json)" | |
], | |
"metadata": { | |
"id": "V_afqLtWX5wg" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Una vez hayamos extraído algunos Tweets, podemos explorar sus contenidos. Para esto, podemos apoyarnos con el siguiente sitio: https://codebeautify.org/jsonviewer\n", | |
"\n", | |
"\n", | |
"> ¿Cómo podríamos extraer información de cada no de estos elementos? Por ejemplo, si me interesara conocer el nombre de usuario en Twitter que ha twiteado." | |
], | |
"metadata": { | |
"id": "RlgsC-RkqxB8" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Extracción de información para construir un grafo\n", | |
"\n", | |
"Vamos a explorar a detalle las funciones que utilizaremos para extraer información." | |
], | |
"metadata": { | |
"id": "tnRWM_JEsMit" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"def get_user_id(tweet):\n", | |
" \"\"\"Returns user and id.\"\"\"\n", | |
"\n", | |
" user_id = None\n", | |
" user_name = None\n", | |
" user = tweet['user']\n", | |
"\n", | |
" if user is not None:\n", | |
" user_id = user['id']\n", | |
" user_name = user['screen_name']\n", | |
" \n", | |
" return (user_id, user_name)\n", | |
"\n", | |
"\n", | |
"def get_retweeted_info(tweet):\n", | |
" \"\"\"Returns retweet source info.\"\"\"\n", | |
"\n", | |
" retweet = None\n", | |
" retweet_count = tweet['retweet_count']\n", | |
"\n", | |
" if retweet_count > 0 and 'retweeted_status' in tweet.keys():\n", | |
" retweet = tweet['retweeted_status']\n", | |
"\n", | |
" if retweet is not None:\n", | |
" return get_user_id(retweet)\n", | |
" else:\n", | |
" return (None, None)\n", | |
"\n", | |
"def get_reply_info(tweet):\n", | |
" \"\"\"Returns reply info.\"\"\"\n", | |
"\n", | |
" reply_id = tweet['in_reply_to_user_id']\n", | |
" reply_screen_name = tweet['in_reply_to_screen_name']\n", | |
"\n", | |
" return (reply_id, reply_screen_name)\n", | |
"\n", | |
" \n", | |
"def get_mentions_info(tweet):\n", | |
" \"\"\"Returns a list of all user mentions.\"\"\"\n", | |
"\n", | |
" mentions = []\n", | |
" entities = tweet['entities']\n", | |
"\n", | |
" if entities is not None:\n", | |
" user_mentions = entities['user_mentions']\n", | |
" for mention in user_mentions:\n", | |
" mention_id = mention['id']\n", | |
" screen_name = mention['screen_name']\n", | |
" mentions.append((mention_id, screen_name))\n", | |
" \n", | |
" return mentions\n", | |
"\n", | |
"\n", | |
"def get_quoted_info(tweet):\n", | |
" \"\"\"Returns id of user quoting the tweet.\"\"\"\n", | |
" \n", | |
" if 'quoted_status' in tweet.keys():\n", | |
" quoted_status = tweet['quoted_status']\n", | |
" else:\n", | |
" quoted_status = None\n", | |
" \n", | |
" if quoted_status is not None:\n", | |
" return get_user_id(quoted_status)\n", | |
" else:\n", | |
" return (None, None)\n", | |
"\n", | |
"\n", | |
"def get_all_interactions(tweet):\n", | |
" \"\"\"Returns all interactions from this tweet.\"\"\"\n", | |
" \n", | |
" # Get the tweeter\n", | |
" tweeter = get_user_id(tweet)\n", | |
" \n", | |
" # Nothing to do if we couldn't get the tweeter\n", | |
" if tweeter[0] is None:\n", | |
" return (None, None), []\n", | |
" \n", | |
" # a python set is a collection of unique items\n", | |
" # we use a set to avoid duplicated ids\n", | |
" interacting_users = set()\n", | |
" \n", | |
" # Add person they're replying to\n", | |
" interacting_users.add(get_reply_info(tweet))\n", | |
" \n", | |
" # Add person they retweeted\n", | |
" interacting_users.add(get_retweeted_info(tweet))\n", | |
" \n", | |
" # Add person they quoted\n", | |
" interacting_users.add(get_quoted_info(tweet))\n", | |
" \n", | |
" # Add mentions\n", | |
" interacting_users.update(get_mentions_info(tweet))\n", | |
" \n", | |
" # remove the tweeter if he is in the set\n", | |
" interacting_users.discard(tweeter)\n", | |
" # remove the None case\n", | |
" interacting_users.discard((None,None))\n", | |
" \n", | |
" # Return our tweeter and their influencers\n", | |
" return tweeter, list(interacting_users)" | |
], | |
"metadata": { | |
"id": "cItwpXsQcZb0" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Búsqueda por palabras\n", | |
"\n", | |
"Procedemos a crear un Cursor en Tweepy para poder realizar una búsqueda de información. Este cursor nos devolverá un cursor que contenga todos los tweets extraídos." | |
], | |
"metadata": { | |
"id": "m01IdFrRsRjZ" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"words = ['covid19']\n", | |
"since = '2020-01-01'\n", | |
"n_tweets = 100\n", | |
"\n", | |
"\n", | |
"tweets = tweepy.Cursor(extractor.search,\n", | |
" words, lang='en',\n", | |
" since_id=since,\n", | |
" tweet_mode='extended').items(n_tweets)" | |
], | |
"metadata": { | |
"id": "Rj4OhXoPYXMa" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Podemos explorar un poco la información que extrajimos." | |
], | |
"metadata": { | |
"id": "7l6HDXNBtO5C" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# TODO\n", | |
"# Tweet exploration" | |
], | |
"metadata": { | |
"id": "LzuCdXFJtSgU" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Construcción de un grafo\n", | |
"\n", | |
"Podemos utilizar lo aprendido sobre Networkx para construir un grafo con los datos que extrajimos.\n" | |
], | |
"metadata": { | |
"id": "l9drNgcLtS3M" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"\n", | |
"# Let's create an empty Directed Graph\n", | |
"G = nx.DiGraph()\n", | |
"\n", | |
"# Let's iterate all the tweets and add edges if the tweet include some interactions\n", | |
"for tweet_raw in tweets:\n", | |
" tweet = tweet_raw._json\n", | |
"\n", | |
" # Find all influencers in the tweet\n", | |
" tweeter, interactions = get_all_interactions(tweet)\n", | |
" tweeter_id, tweeter_name = tweeter\n", | |
" tweet_id = get_user_id(tweet)[0]\n", | |
" \n", | |
" # Add an edge to the Graph for each influencer\n", | |
" for interaction in interactions:\n", | |
" interact_id, interact_name = interaction\n", | |
" \n", | |
" # Add edges between the two user ids\n", | |
" # This will create new nodes if the nodes are not already in the\n", | |
" # network we also add an attribute the to edge equal to the id of\n", | |
" # the tweet\n", | |
" G.add_edge(tweeter_id, interact_id, tweet_id=tweet_id)\n", | |
" \n", | |
" # Add name as a property to each node with networkX each node\n", | |
" # is a dictionary\n", | |
" G.nodes[tweeter_id]['name'] = tweeter_name\n", | |
" G.nodes[interact_id]['name'] = interact_name" | |
], | |
"metadata": { | |
"id": "e2M3Rhx0eBBF" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"nx.draw(G)" | |
], | |
"metadata": { | |
"id": "iQggslj9nhJ1" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**Este no será el grafo final, pero es un grafo inicial que hemos construido con información extraída directamente desde Twitter.**" | |
], | |
"metadata": { | |
"id": "wZP81-Hbtu8D" | |
} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment