Last active
September 11, 2018 18:47
-
-
Save mharias/c0959479f4f40a372cb4fb6d21a4173f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Twitter Analysis con Python (I)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"En esta entrada vamos a trabajar con la recogida de tweets por medio de la libreria twitter-python, su tratamiento, importación a un pandas, y posterior análisis de sentimiento con uno de los muchos modelos documentados." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Vamos a dividirla en dos : una dedicada a la recogida de tweets y a su salvado en un fichero y tra dedicada al trabajo posterior\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 1. Recogida de tweets" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Tenemos muchas alternativas para esta tarea. En otras ocasiones utilicé `tweepy`, pero para esta ocasión he preferido utilizar `twitter-python` para comparar" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"La idea de este módulo es :\n", | |
"* Configurar el colector\n", | |
"* Recoger los tweets, viendo previamente que tipo de información y búsquedas podemos hacer con la `api`\n", | |
"* Almacenarlos en un fichero, para posterior tratamiento\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1.1 Configuracion del colector" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Empecemos con la importacion de las librerias que vamos a necesitar para esta tarea : " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import json" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import twitter" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 74, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Comenzamos por las firmas personales necesarias para configurar correctamente el colector. Recordemos que estas claves se consiguen directamente de http://twitter.com. Para más adelante me apunto el escribir una entrada explicando como se hace (si alguien necesita ayuda no dude en preguntar por favor). Y no olvide que esas claves son personales y se deben custodiar con precaución. En las manos incorrectas pueden utilizar su cuenta de twitter con toda libertad, con lo que eso supone." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"consumer_key = 'your_consumer_key'\n", | |
"consumer_secret = 'your_consumer_secret'\n", | |
"access_token = 'your_access_token'\n", | |
"access_secret = '8your_acces_secret'" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Inicializamos la api con las claves conseguidas." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"api=twitter.Api(consumer_key=consumer_key,\n", | |
" consumer_secret=consumer_secret,\n", | |
" access_token_key=access_token,\n", | |
" access_token_secret=access_secret)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1.2 Comprobaciones" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Comprobemos que la clase `twitter.api` se ha iniciado correctamete llamando la función `VerifyCredentials()` :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"User(ID=13641472, ScreenName=walyt)" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"api.VerifyCredentials()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'Mon Feb 18 20:57:28 +0000 2008'" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"api.VerifyCredentials().created_at" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'Un poco de TICs, Telecom Market, Media, Data Science y más..Opinions are my own.'" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"api.VerifyCredentials().description" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Con esta función poder recuperar los últimos tweets enviados desde mi cuenta :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"status=api.GetUserTimeline(screen_name='walyt')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"..accediendo al último tweet" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Status(ID=1036372948662276103, ScreenName=walyt, Created=Sun Sep 02 21:59:00 +0000 2018, Text='Interactive: The Top Programming Languages 2018 https://t.co/OsZqAPPPHR')" | |
] | |
}, | |
"execution_count": 63, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"status[0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"y al texto en particular de ese tweet :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'Interactive: The Top Programming Languages 2018 https://t.co/OsZqAPPPHR'" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"status[0].text" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1.3 Recogida de tweets" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Veamos un primer método de recogida de tweets con `api.GetSearch`. Con `count=10`indicamos que queremos recoger diez tweets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"search=api.GetSearch('#Almendralejo',count=10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"..y veamos tres campos como ejemplos de esos diez tweets :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 70, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"RV Ribera Guadiana No te lo p Wed Sep 05 18:35:34 +0000 2018\n", | |
"RV Ribera Guadiana !No te lo Wed Sep 05 18:32:19 +0000 2018\n", | |
"Antonio Márquez Lay Acaba de p Wed Sep 05 16:51:55 +0000 2018\n", | |
"Bowlerhat Necesitas Wed Sep 05 13:16:38 +0000 2018\n", | |
"Víctor. ¡Muchísima Tue Sep 04 13:52:07 +0000 2018\n", | |
"RV Ribera Guadiana Por fin ll Tue Sep 04 07:46:11 +0000 2018\n", | |
"Estanco 3 Avenida Tenemos to Mon Sep 03 18:29:41 +0000 2018\n", | |
"★☆Laura ∞ Fitness☆★ Un poco de Mon Sep 03 14:43:32 +0000 2018\n", | |
"paco bernal https://t. Sun Sep 02 22:44:21 +0000 2018\n", | |
"Aprendiz A llenarte Sun Sep 02 12:26:41 +0000 2018\n" | |
] | |
} | |
], | |
"source": [ | |
"for s in search:\n", | |
" print (s.user.name,s.text[:10],s.created_at)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"La búsqueda se puede realizar también por medio de coordenadas de la localización. En este caso utilizaremos las coordenadas de [Almendralejo](https://es.wikipedia.org/wiki/Almendralejo)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"search=api.GetSearch(geocode=[38.6831,-6.4075,'1km'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Podemos comprobar que, efectivamente, estos tweets estan geolocalizados en esa población : " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 68, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'coordinates': [38.6833, -6.4], 'type': 'Point'}" | |
] | |
}, | |
"execution_count": 68, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"search[0].geo" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Para de igual manera que en el caso anterior comprobar que información tienen : " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 71, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"RV Ribera Guadiana No te lo p Wed Sep 05 18:35:34 +0000 2018\n", | |
"RV Ribera Guadiana !No te lo Wed Sep 05 18:32:19 +0000 2018\n", | |
"Antonio Márquez Lay Acaba de p Wed Sep 05 16:51:55 +0000 2018\n", | |
"Bowlerhat Necesitas Wed Sep 05 13:16:38 +0000 2018\n", | |
"Víctor. ¡Muchísima Tue Sep 04 13:52:07 +0000 2018\n", | |
"RV Ribera Guadiana Por fin ll Tue Sep 04 07:46:11 +0000 2018\n", | |
"Estanco 3 Avenida Tenemos to Mon Sep 03 18:29:41 +0000 2018\n", | |
"★☆Laura ∞ Fitness☆★ Un poco de Mon Sep 03 14:43:32 +0000 2018\n", | |
"paco bernal https://t. Sun Sep 02 22:44:21 +0000 2018\n", | |
"Aprendiz A llenarte Sun Sep 02 12:26:41 +0000 2018\n" | |
] | |
} | |
], | |
"source": [ | |
"for s in search:\n", | |
" print (s.user.name,s.text[:10],s.created_at)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"Vamos a ver ahora el método `GetStreamFilter`, que nos devuelve un [*generador*](https://wiki.python.org/moin/Generators) de python. Para el trabajo que explicaremos en la segunda parte vamos a recuperar tweets con el hashtag *#HalaMadrid*. Arranqué este script a las 13:00 del Sábado 2 de Septiembre durante 11 horas (los minutes=660). Ese día se jugó a las 20:45 el RealMadrid-Leganés :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"stream = api.GetStreamFilter(None, ['#HalaMadrid'])\n", | |
"#stream = api.GetStreamFilter(locations=['-6.44,38.65','-6.36,38.72'])\n", | |
"cont=0\n", | |
"lista_tweet=[]\n", | |
"limite=datetime.datetime.now()+datetime.timedelta(minutes=660)\n", | |
"for tweet in stream:\n", | |
" lista_tweet.append(tweet)\n", | |
"#Este print funciona a modo de señal de que el código se está ejecutando correctamente y se están recuperando tweets \n", | |
" #print (cont)\n", | |
" cont+=1\n", | |
" if datetime.datetime.now()>limite:\n", | |
" stream.close()\n", | |
" break" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Once horas después vamos a recuperar esa lista, `lista_tweet`, convertirla en un `pandas` (me siento muy cómodo con esta estructura de datos, y ya tengo alguna experiencia trabajando con ella), para finalmente guardarla en un fichero csv en el disco, que nos permita recuperarlo posteriormente para trabajar con él." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Para ello creamos una función `from_list_2_pandas`, que, básicamente va iterando sobre `tweet_list`, cada elemento de esa lista es un `dict`, y utilizando el método `from_dict` de `pandas` crea un elemento nuevo correspondiente a cada tweet. Tuve que hacer `from_dict` con `orient=index`, transponiendo posteriormente, pues con la `row` no conseguí hacer una importación correcta. La función devuelve un pandas, `pt` con el que podemos seguir trabajando de manera más cómoda." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 73, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def from_list_2_pandas(tweet_list):\n", | |
" pt=pd.DataFrame()\n", | |
" for i in range(len(tweet_list)):\n", | |
" pt_aux=pd.DataFrame.from_dict(tweet_list[i],orient='index')\n", | |
" pt_aux=pt_aux.T\n", | |
" pt=pd.concat([pt,pt_aux],ignore_index=True)\n", | |
" pt.set_index('id',inplace=True)\n", | |
" \n", | |
" return pt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 171, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"pt=from_dict_pandas(lista_tweet)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 75, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "NameError", | |
"evalue": "name 'pt' is not defined", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", | |
"\u001b[0;32m<ipython-input-75-deafb65b0d2e>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", | |
"\u001b[0;31mNameError\u001b[0m: name 'pt' is not defined" | |
] | |
} | |
], | |
"source": [ | |
"pt.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1.4 Guardar el pandas en un fichero" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Por último guardamos el pandas en un fichero csv por medio del método `to_csv`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 150, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"pt.to_csv('twitter_csv_halamadrid_sinsent')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Hemos acabado con esta parte. Hemos importado la librería, conseguido las claves privadas para activar la *api* correctamente, hemos descargado tweets que hemos convertido en un pandas y finalmente lo hemos grabado en un fichero para trabajar sobre él posteriormente." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment