Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
A script to download all of a user's tweets into a csv
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print "...%s tweets downloaded so far" % (len(alltweets))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
#write the csv
with open('%s_tweets.csv' % screen_name, 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text"])
writer.writerows(outtweets)
pass
if __name__ == '__main__':
#pass in the username of the account you want to download
get_all_tweets("J_tsar")

Thanks for posting this script! Just a heads-up on a minor typo in line 36: "gefore" instead of "before"

https://gist.github.com/yanofsky/5436496#file-tweet_dumper-py-L36

markwk commented Sep 24, 2013

Works great. I'm wondering how I'd do this to get the next 3200 after the initial pull.

danriz commented Oct 17, 2013

I am getting error on windows:

C:>C:\Python26\python.exe C:\Python26\tweet_dumper.py
File "C:\Python26\tweet_dumper.py", line 17
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
^
IndentationError: expected an indented block

C:>C:\Python275\python.exe C:\Python26\tweet_dumper.py
File "C:\Python26\tweet_dumper.py", line 17
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
^
IndentationError: expected an indented block

Owner

yanofsky commented Nov 1, 2013

@greglinch thanks, fixed!
@markwk to my understanding there is no way to get these without using a 3rd party or asking the user to download their history
@riznad hard to say what's going on there, is is possible an extra space got inserted on that line? There should only be one tab on that line.

Kaorw commented Dec 28, 2013

Thanks for great code!
I've modified a bit to grap timeline and save to excel format("xls") using xlswriter.

https://gist.github.com/Kaorw/7594044

Thanks again

jdkram commented Dec 28, 2013

Thanks for the code.

I switched up the final line (after importing sys) to feed in usernames from shell:

get_all_tweets(sys.argv[1])

hub2git commented Apr 2, 2014

Dear all, I downloaded the py file. I'm running Linux Mint. In terminal, I did:
python tweet_dumper.py

but I got this:
Traceback (most recent call last):
File "tweet_dumper.py", line 4, in
import tweepy #https://github.com/tweepy/tweepy
ImportError: No module named tweepy

What am I doing wrong? What must I do?

By the way, I've created a twitter API for myself. In the tweet_dumper.py file, I've entered my 4 Twitter API credentials. And in the last line of the .py file, I've put in the username whose tweets I want to download.

Should I download the zip file from https://github.com/tweepy/tweepy? I'm so lost, but I want to learn.


UPDATE:
I did
sudo apt-get install python-pip
then
sudo pip install tweepy
.

Then I ran python tweet_dumper.py again. Now I see a csv file! Thanks!!!

Fantastic! Thanks!

This worked great! Thanks for this! Had to get pip and tweepy installed, but it worked out great. Also, note that if the targeted user's twitter account is protected, the account used to authorize the api calls must be following the targeted user.

i tried executing the program. there is no error reported.

But no .csv file created.Please help me out

UPDATE : 1

Later it worked.

UPDATE : 2

But now all of a sudden my program show me error as follows and So I repeated all the steps stated by hub2git. Still its not...........Please do help me to trace out

lifna@lifna-Inspiron-N5050:~$ python
Python 2.7.3 (default, Feb 27 2014, 20:00:17)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tweepy
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named tweepy
exit()

i tried executing it using editrocket[http://editrocket.com/download_win.html]
got following error
File "tweet_dumper.py", line 35
print "getting tweets before %s" % (oldest)
^
SyntaxError: invalid syntax

hub2git commented Nov 11, 2014

Thanks to this script, I succesfully downloaded a user's most recent 3240 tweets.

Line 15 of the script says
* #Twitter only allows access to a users most recent 3240 tweets with this method*

Does anybody know how to download tweets that are older than the 3240th tweet?

I am getting the below, what am I doing wrong? Thanks

File "tweet_dumper.py", line 27, in get_all_tweets
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
File "C:\Python27\lib\site-packages\tweepy-2.3.0-py2.7.egg\tweepy\binder.py", line 230, in _call
return method.execute()
File "C:\Python27\lib\site-packages\tweepy-2.3.0-py2.7.egg\tweepy\binder.py", line 203, in execute
raise TweepError(error_msg, resp)
TweepError: [{u'message': u'Bad Authentication data', u'code': 215}]

yosun commented Feb 1, 2015

This seems to only work for tweets from the past year? (For users with more than 3200 tweets)

Is that any way we can more than 3200 tweets.....I want all the tweets of a particular user?

Sweet!

have modified to get tweets with images and store to csv:
id, tweet text, image url

just in case anyone else needs as well:
https://gist.github.com/freimanas/39f3ad9a5f0249c0dc64

Works great. But have a question. How do I get only the status and not reply or retweets from a user? Is there any way?

Hi, I'm using python3.4 and tweepy 3.3.0
I'm getting the following error:

File "dump_tweets.py", line 56, in get_all_tweets
writer.writerows(outtweets)
TypeError: 'str' does not support the buffer interface

This error is also thrown for line 55, but I commented it out in an attempt to debug.

I've tried to just include the text of the tweet which is encoded to utf-8 on line 50, but this still throws the same error.

Does anyone have any hints/suggestions?

EDIT: This appears to only occur on Windows. When running the script from an Ubuntu install it works.

Thanks for posting the script in the fist place - good way to start tweaking with this library. After playing a bit around with it, it seems like the updated versions of the library solve both the "cope with # of requests/window" and the "don't get busted by the error".

  • <wait_on_rate_limit> parameter for the api to have it deal with the server
  • use of Cursor to avoid all the # of requests/window reckon

Just in case somebody needs as well, I did a small implementation following the new features here: https://gist.github.com/MihaiTabara/631ecb98f93046a9a454
(mention: I store the tweets in a MongoDB databases instead of csv files)

Din1993 commented Sep 11, 2015

I am trying to do this for multiple users by including a for loop. Do you know how to have it also print either their name or their screenname? Thanks!

@ Purptart

Just change the line-
with open('%s_tweets.csv' % screen_name, 'wb', encoding='utf-8') as f:
to
with open('%s_tweets.csv' % screen_name, 'w', encoding='utf-8') as f:

b stands for binary actually. and python 3x versions have modified many things. It works for me fine.

@Din1993

we can get the screen_name user name and other information as well. Show me how you are trying to do it for multiple users. (Code snippet)

Thanks

Hi guys, i'm using python 2.7 and the script works fine. I've just a problem with the csv. Is there a way to ignore \n in tweets retrieved? A new line cause the text to span in a new column, so in excel or openrefine it's almost impossible to edit the manually all the cells in the "id" column.

Din1993 commented Oct 6, 2015

@Sourabh87 thanks for the offer! i ended up figuring it out by just using tweet.user.screen_name. Super easy. Now, I am working on migrating the code from python 2 to python 3.4. Has anyone else done this yet on windows?

Hey guys!

I'm using this to pull tweets for list of users. But I'm running into an error every so often. I think it might have to do with the amount of queries you can make to the Twitter API but I'm not sure. Here's the error below, please help.

File "twitterAPI.py", line 118, in
get_all_tweets(user)
File "twitterAPI.py", line 73, in get_all_tweets
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
File "//anaconda/lib/python2.7/site-packages/tweepy/binder.py", line 239, in _call
return method.execute()
File "//anaconda/lib/python2.7/site-packages/tweepy/binder.py", line 223, in execute
raise TweepError(error_msg, resp)
tweepy.error.TweepError: Not authorized.

thanks for code @yanofsky
i have modified your code. I am using pandas csv to store downloaded tweets into csv. Along with csv i have used another information too.
Also i had another code which uses csv created by your code to download latest tweets of user timeline.
here is my github link:
https://github.com/suraj-deshmukh/get_tweets

Also i am working on cassandra python integration to download all tweets in cassandra database instead of csv file

Thanks for this @yanofsky - its awesome code. I'm trying to rework it so I can drop the data into a MySQL table. I'm running into some issues and wondering if you can take a look at the snippet of my code to see if I'm doing anything obvious? Much appreciated.

def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method

    #authorize twitter, initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)

    #initialize a list to hold all the tweepy Tweets
    alltweets = []

    #make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name,count=1)

    #save most recent tweets
    alltweets.extend(new_tweets)

    #save the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1

    #keep grabbing tweets until there are no tweets left to grab
    while len(new_tweets) > 0:
            print "getting tweets before %s" % (oldest)

            #all subsequent requests use the max_id param to prevent duplicates
            new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)

            #save most recent tweets
            alltweets.extend(new_tweets)

            #update the id of the oldest tweet less one
            oldest = alltweets[-1].id - 1

            print "...%s tweets downloaded so far" % (len(alltweets))

    return alltweets

def store_tweets(alltweets)

MySQL initialization

connection = (host = "",
    user="",
    passwd="",
    db="")
cursor = connection.cursor()

for tweet in alltweets:
    cursor.execute("INSERT INTO twittest (venue, tweet_id, text time, retweet, liked) VALUES 
    (user['screen_name'],tweet['id_str'], tweet['created_at'], tweet['text'], tweet['retweet_count'], tweet['favorite_count'])")    
cursor.close()
connection.commit()

if name == 'main':
#pass in the username of the account you want to download
get_all_tweets("KidsAquarium")

Thanks!

Hi..

Is thr a way to download public tweets 'key word search' older than a week or month. I m able to download public tweets of current week only and not beyond that. Any suggestions appreciated.. Thanks

alce370 commented Feb 10, 2016

This one still works with Python 3.5. I just added the () to each print call, and that in the comment of Sourabh87 (commented on 18 Sep 2015), and it works fine.

Is there no way we can crawl ALL the tweets of a particular user and not just 3200 most recent ones..??

News about if we can download all tweets?

DavidNester commented May 9, 2016

I am somewhat inexperienced with this. Do the 4 lines with the API credentials need to be filled in? If so, where do we get the credentials?


UPDATE:
I figured out my first issue but then ran into this issue when running it:

tweepy.error.TweepError: [{u'message': u'Bad Authentication data.', u'code': 215}]

I only changed the username that was being looked up from the original code

jdchatham commented May 10, 2016

Am also having the same problem as @DavidNester. Any updates?

UPDATE:
This actually worked for me/showed me how to get the credentials if you're still looking @DavidNester
http://www.getlaura.com/how-to-download-tweets-from-the-twitter-api/

I have the credentials now. I tried running with the other script and I still got the same error.

Just an FYI for people trying to utilize this in Sublime (and you happen to be using Anaconda on a windows machine), you need to run python -m pip install tweepy while in the proper directory that Sublime expects it to be installed in; pip install tweepy alone may not work. Some people who run the code and think they installed tweepy may get an error saying otherwise.

This truly is a glorious script yanofsky! I plan on playing around with it for the next few days for a stylometry project, and thanks to you getting the raw data desired is no longer an issue!

gerardtoko commented Jun 15, 2016

Other astuce, recursive function

#!/usr/bin/env python
# encoding: utf-8

import tweepy #https://github.com/tweepy/tweepy
from time import sleep

#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

def get_all_tweets(screen_name, alltweets=[], max_id=0):
    #Twitter only allows access to a users most recent 3240 tweets with this method
    #authorize twitter, initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)

    #make initial request for most recent tweets (200 is the maximum allowed count)
    if max_id is 0:
        new_tweets = api.user_timeline(screen_name=screen_name, count=200)
    else:
        # new new_tweets
        new_tweets = api.user_timeline(screen_name=screen_name, count= 200, max_id=max_id)

    if len(new_tweets) > 0:
        #save most recent tweets
        alltweets.extend(new_tweets)
        # security
        sleep(2)
        #update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1
        return get_all_tweets(screen_name=screen_name, alltweets=alltweets, max_id=oldest)

    #final tweets
    return alltweets
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("J_tsar", [], 0)

gowrik73 commented Jul 5, 2016

is it possible to extract protected tweets?

Hi, i just tried the code it is working well. p.s. I am using Python 3.5

I am thinking of getting multiple users' tweets at same time; another word, having multiple usernames at one time. Any thought on that how should I set it?

Thanks.

Thanks. It returns the shortened URLs. If you need the original URLs, you can use this snippet (Python 3):

from urllib.request import urlopen

def redirect(url):
    page = urlopen(url)
    return page.geturl()

In Python 2: from urllib import urlopen.

I'm getting this error when I enter someone's username. It works fine for others, but not this username.

tweepy.error.TweepError: [{'code': 34, 'message': 'Sorry, that page does not exist.'}]

Hunterjet commented Aug 9, 2016

For some reason, calling user_timeline directly is much more inefficient than doing it with a cursor, like so:

                cursor = Cursor(self.client.user_timeline,
                                id=user_id,
                                count=200,
                                max_id=max_id).pages(MAX_TIMELINE_PAGES)
                for page in cursor:
                    logging.info('Obtained ' + str(i) + ' tweet pages for user ' + str(user_id) + '.')
                    i += 1
                    for tweet in page:
                        if not hasattr(tweet, 'retweeted_status'):
                            tweets.append(tweet)
                    max_id = page[-1].id - 1

I've seen speed gains of over 150 seconds for users with a greater amount of posted tweets than the maximum retrievable. The error handling is a bit trickier but doable thanks to the max ID parameter (just stick the stuff I posted into a try/except and put that into a while (1) and the cursor will refresh with each error). Try it out!

BTW, MAX_TIMELINE_PAGES theoretically goes up to 16 but I've seen it go to 17.

I am getting a syntax error at: print "getting tweets before %s" % (oldest)
Not sure what is wrong. Request your help.

what if i wanted to save it into a database , would i then need to extract the data from the csv

adixxov commented Aug 22, 2016

Thank you for this code. It worked as expected to pull a given user's tweets.

However, I have a side problem with retrieving the tweets after saving them to a json file. I saved the list of "alltweets" in a json file using the following. Note that without "repr", i wasn't able to dump the alltweets list into json file.

with open('file.json, 'a') as f: json.dump(repr(alltweets), f)

Attached is a sample json file containing the dump. Now, I need to access the text in each tweet, but I'm not sure how to deal with "Status".

I tried to iterate over the lines in the file, but the file is being seen as a single line.

with open(fname, 'r') as f: for line in f: tweet = json.loads(line)

I also tried to iterate over statuses after reading the json file as a string, but iteration rather takes place on the individual characters in the json file.

with open(fname, 'r') as f: x = f.read() for status in x: code

Appreciate any help...

Troasi commented Aug 22, 2016

I get error in python 3x as the buffer does not support string. Help me to encode it.

dev-luis commented Sep 10, 2016

@santoshbs That's because the script was written for an older version of Python. The new syntax is: print(Your statements).

For the people that have problems running this script, I posted an alternate way to download the tweets using the new syntax on my website: http://luis-programming.com/blog/download_tweets/

I also added an example on how to analyze tweets that are not written using "Latin characters." If you're interested, you can also download the script on my website: http://luis-programming.com/blog/kanji_prj_twitter/

owlcatz commented Sep 24, 2016

I read all the comments, but have not tried it yet... So... Assuming I had a user (not me or anyone I know personally) that has roughly 15.5k tweets, is there any way to get just the FIRST few thousand and not the last? Thanks! 👍

cbaysan commented Sep 24, 2016

Has anyone figured out how to grab the "retweeted_status.text" if the retweeted_status is "True"? It seems that one to specify: "api.user_timeline(screen_name = screen_name,count=200,include_rts=True)"

dhaikney commented Nov 2, 2016

@yanofsky Found this very useful, thank you!

I found this article which says that request rate more than 2.5 times the access token rate is achieved. I haven't personally tested this.
Hope it is found useful.

http://www.karambelkar.info/2015/01/how-to-use-twitters-search-rest-api-most-effectively./

ShupingZhang commented Nov 6, 2016

I run the code but it only downloaded 6 tweets (sometimes 206) instead of 3240. Does anyone know the reason? Thanks a lot!

get_all_tweets("City of Toronto")
getting tweets before 616320501871452159
...6 tweets downloaded so far

I'm using Python 2.7.12 Shell.

def get_all_tweets(screen_name):
alltweets = []
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
alltweets.extend(new_tweets)
oldest = alltweets[-1].id - 1
while len(new_tweets) > 0:
print "getting tweets before %s" % (oldest)
new_tweets = api.user_timeline(screen_namem = screen_name,count=200,max_id=oldest)
alltweets.extend(new_tweets)
oldest = alltweets[-1].id - 1
print "...%s tweets downloaded so far" % (len(alltweets))
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
with open('%s_tweets.csv' % screen_name, 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text"])
writer.writerows(outtweets)
pass

Im trying to run the code however keep getting the following error:

Traceback (most recent call last):
File "/Users/Brian/Desktop/get_tweets.py", line 60, in
get_all_tweets("realDonaldTrump")
File "/Users/Brian/Desktop/get_tweets.py", line 52, in get_all_tweets
writer.writerow(["id","created_at","text"])
TypeError: a bytes-like object is required, not 'str'

Anyone know what this could be?

@brianhalperin

I received the same error. Try changing line 53.

Change line 53 from this:
with open('%s_tweets.csv' % screen_name, 'wb') as f:

to this:
with open('%s_tweets.csv' % screen_name, 'w') as f:

Pretty much just drop the 'b'. Let me know if it works for you.

thank you for posting this! May I ask how did you find out that each "tweet" has information like "id_str", "location" and etc. I used dir() to look at it, but the "location" is not included, so I was a bit confused.

Hello, I get this error, what can it be? http://i.imgur.com/lDRA7uX.png

Siddhant08 commented Jan 25, 2017

@yanofsky The code runs without errors but I can't seem to find the csv file. Where is it created?

+1 Thanks for this script

Thanks for sharing!

Can we download more than 3240 tweets?

Deepak- commented Feb 27, 2017

Thanks for the script! I do wish there was a way to circumvent the 3240 limit.

buddiex commented Feb 27, 2017

@deepak same here... having that issue now ... trying to collect historical tweets for a data warehouse project.

adam-fg commented Feb 28, 2017

Hi everyone,

I'm after some help - I'm trying to complete some research on Twitter Data for my MSc and this might work, but I have no idea how to use Python.

Would anybody be willing to run this code for me for 3 companies and if this works for hashtags, 3 more hashtags?

Fingers crossed!
Adam

Thanks! I was just looking to grab a single tweet from one user's timeline and this was the best example of how to do that.

xdslx commented Mar 29, 2017

is there a way to grab tweets in other languages which uses diferent language codes ? This code only get proper tweets in English. how to change the lang code ? , shortly.

how to collect tweets in roman urdu language ? by using python as well as java i am able to get standard english tweets but i want to collect roman urdu tweets for sentiment analysis. please anyone

lnvrl commented Apr 15, 2017

I am having the same problem that @ilemtheme in the line 36, says syntax error: invalid syntax

print "getting tweets before %s" % (oldest)

Hello, can anyonoe help me out with getting tweets for multiple users? I tried: forming a list of users and pass it in the end like this: for item in list: get_all_tweets("list").

the tweets i need to download are non-English language , when i open the output file it shows funny stuff !!
any clues ?

thanks

@invrl are you using python 3.X? there is a chance that this cuold be the issue. The sintax for print changed with th 3.x now if you want to print something you have to pass a functio
print (getting tweets before %s" % (oldest))

its giving most recent 3200 tweets . so what is the way to get older tweets than that? please post or let me know on my email : kumarkondi@gmail.com

Great code!

I edited it for Python 3.x. Also, I removed the URLs and the RTs from the user.

def get_all_tweets(screen_name):
"""Download the last 3240 tweets from a user. Do text processign to remove URLs and the retweets from a user.
Adapted from https://gist.github.com/yanofsky/5436496"""
#Twitter only allows access to a users most recent 3240 tweets with this method

#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(credentials['twitter']['consumer_key'], credentials['twitter']['consumer_secret'],)
auth.set_access_token(credentials['twitter']['token'], credentials['twitter']['token_secret'])
api = tweepy.API(auth)

#initialize a list to hold all the tweepy Tweets
alltweets = []	

#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)

#save most recent tweets
alltweets.extend(new_tweets)

#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1

#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
	print ("getting tweets before %s" % (oldest))
	
	#all subsiquent requests use the max_id param to prevent duplicates
	new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
	
	#save most recent tweets
	alltweets.extend(new_tweets)
	
	#update the id of the oldest tweet less one
	oldest = alltweets[-1].id - 1
	
	print ("...%s tweets downloaded so far" % (len(alltweets)))
cleaned_text = [re.sub(r'http[s]?:\/\/.*[\W]*', '', i.text, flags=re.MULTILINE) for i in alltweets] # remove urls
cleaned_text = [re.sub(r'@[\w]*', '', i, flags=re.MULTILINE) for i in cleaned_text] # remove the @twitter mentions 
cleaned_text = [re.sub(r'RT.*','', i, flags=re.MULTILINE) for i in cleaned_text] # delete the retweets
#transform the tweepy tweets into a 2D array that will populate the csv	
outtweets = [[tweet.id_str, tweet.created_at, cleaned_text[idx].encode("utf-8")] for idx,tweet in enumerate(alltweets)]

#write the csv	
with open('../data/raw/svb_founders/%s_tweets.csv' % screen_name, 'w') as f:
	writer = csv.writer(f)
	writer.writerow(["id","created_at","text"])
	writer.writerows(outtweets)

pass

If I run this 16 times in less than 15 minutes, will the API stop answering? Thanks

rs2283 commented Jun 23, 2017

I need to extract tweets from twitter for a specific hashtag for last ten years.Can anyone please help me in providing the code in R for the same.

santiag080 commented Jun 28, 2017

i work with a similar code, with the code that i use i can input the username as i download the timeline directly without having to edit the code itself.... but the output format it's unreadable.... so, is there any way of making this code into a macro? like with an excel table put in a bunch of user and download every timeline???

oh! this is the code i used before but doesn't work :/ as i said before the output format is unreadable... any ideas??

`import sys
import csv
import json
from datetime import datetime, date, timedelta
import time
import os
import twitter
import smtplib
import collections
from random import shuffle
from urllib2 import URLError
import signal
import atexit
import logging
import re
import argparse
import StringIO, traceback,string
import csv, codecs, cStringIO
from ConfigParser import SafeConfigParser

def t():
configParser = SafeConfigParser()
configFilePath = 'C:\config.txt'
configParser.read(configFilePath)
with codecs.open(configFilePath, 'r', encoding='utf-8') as f:
configParser.readfp(f)

CONSUMER_KEY = configParser.get('file', 'CONSUMER_KEY')
CONSUMER_SECRET = configParser.get('file', 'CONSUMER_SECRET')
APP_NAME = configParser.get('file', 'APP_NAME')

TOKEN_FILE = 'out/twitter.oauth'
try:
    (oauth_token, oauth_token_secret) = read_token_file(TOKEN_FILE)
except IOError, e:
    (oauth_token, oauth_token_secret) = oauth_dance(APP_NAME, CONSUMER_KEY,
            CONSUMER_SECRET)
    if not os.path.isdir('out'):
        os.mkdir('out')
    write_token_file(TOKEN_FILE, oauth_token, oauth_token_secret)
return twitter.Twitter(domain='api.twitter.com', api_version='1.1',
                    auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
                    CONSUMER_KEY, CONSUMER_SECRET))

def makeTwitterRequest(t, twitterFunction, max_errors=3, *args, **kwArgs):
wait_period = 2
error_count = 0
while True:
try:
return twitterFunction(*args, **kwArgs)
except twitter.api.TwitterHTTPError, e:
error_count = 0
wait_period = handleTwitterHTTPError(e, t, wait_period)
if wait_period is None:
return
except URLError, e:
error_count += 1
print >> sys.stderr, "URLError encountered. Continuing."
if error_count > max_errors:
print >> sys.stderr, "Too many consecutive errors...bailing out."
errorEmail ()
raise
def _getRemainingHits(t, resource_family):
remaining_hits = t.application.rate_limit_status()[u'resources'][u'search'][resource_family]
return remaining_hits
def handleTwitterHTTPError(e, t, wait_period=2):
if wait_period > 3600: # Seconds
print >> sys.stderr, 'Too many retries. Quitting.'
return None
wait_variable = int(datetime.now().strftime("%Y")[:2])
if e.e.code == 401:
print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
return None
if e.e.code == 401:
print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
return None
elif e.e.code in (404, 34):
print >> sys.stderr, 'Encountered 404 Error (pagina no encontrada)'
return None
elif e.e.code in (502, 503):
print >> sys.stderr, 'Encountered %i Error. Will retry in %i seconds' % (e.e.code,
wait_period)
time.sleep(wait_period)
wait_period *= 1.5
return wait_period
elif _getRemainingHits(t, u'/search/tweets')['remaining'] == 0:
status = _getRemainingHits(t, u'/search/tweets')['reset']
now = time.time()
rate_limit = status+wait_variable-now
sleep_time = max(900, rate_limit, 5) # Prevent negative numbers
print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (rate_limit, )
time.sleep(sleep_time)
return 2
else:
raise e
def makeTwitterSearch (t, sts, salida,maximo):
cant_total = 0
#print "call inicial"
response = makeTwitterRequest(t, t.statuses.user_timeline, screen_name = sts, count=200)
#print response
if response is not None and len(response) > 0:
##lista tempral para almacenar los ids de la respuesta
temp_id_list = []
rta = response
for tweet in rta:
salida.write(str(tweet).replace('\r\n', '').replace('\n','').replace('\r','') + '\n')
temp_id_list.append(tweet['id'])
max_id = min(temp_id_list)
cantidad = len(response)
cant_total+= cantidad
#print "cant = %s" % cantidad
cont = 1
while cantidad:
temp_id_list = []
print "Call %s " % (cont)
response = makeTwitterRequest(t, t.statuses.user_timeline, screen_name = sts, max_id = max_id, count=200)
rta = response
for tweet in rta:
salida.write(str(tweet) + '\n')
temp_id_list.append(tweet['id'])
if max_id == min(temp_id_list):
print "Finished! Thanks for searching with us today!"
break
max_id = min(temp_id_list)
cantidad = len(response)
cant_total+= cantidad
#print cantidad * cont
print cantidad * cont
if maximo <> '':
if int(cantidad * cont)>= int(maximo):
break
print "cantidad encontrada = %s" % cantidad
cont += 1
print "Finalmente devolvemos %s tweets" % cant_total
return None
def normalize(archivo):
normalizations = {
'norm_search': collections.OrderedDict([
('Tweet ID',('xpath_get','id')),
('Tipo',('get_tweet_type', )),
('Retweet ID',('xpath_get','retweeted_status/id')),
('Retweet username',('xpath_get','retweeted_status/user/screen_name')),
('Retweet Count',('get_count','rts')), # sobre el rt si el tipo es RT
('Favorite Count',('get_count','favs')), # sobre el rt si el tipo es RT
('Text',('xpath_get','text')),
('Tweet_Lang',('xpath_get','lang')),
('Fecha',('format_date','created_at')),
('Source',('xpath_get','source')),
('User_username',('xpath_get','user/screen_name')),
('User_ID',('xpath_get','user/id')),
('User_tweet count',('xpath_get','user/statuses_count')),
('User_followers',('xpath_get','user/followers_count')),
('User_followings',('xpath_get','user/friends_count')),
('User_time zone',('xpath_get','user/time_zone')),
('User_language',('xpath_get','user/lang')),
('Location',('xpath_get','user/location')),
('User_create date',('format_date','user/created_at')),
('Mention1',('get_entities','mention',1)),
('Mention2',('get_entities','mention',2)),
('Mention3',('get_entities','mention',3)),
('Link1',('get_entities','link',1)),
('Link2',('get_entities','link',2)),
('Hashtag1',('get_entities','hashtag',1)),
('Hashtag2',('get_entities','hashtag',2)),
('Hashtag3',('get_entities','hashtag',3)),
('Fecha Timezone',('format_date','created_at',"%Y-%m-%d")),
('Dia Timezone',('format_date','created_at',"%a")),
('Hora Timezone',('format_date','created_at',"%H:00")),
('Corte Hora',('format_date','created_at',"%Y-%m-%d %H")),
('place_country',('xpath_get','place/country')),
('user_favourites_count',('xpath_get','user/favourites_count')),
('user_description',('xpath_get','user/description')),
('retweeted_status_user_favourites_count',('xpath_get', 'retweeted_status/user/favourites_count')),
('retweeted_status_user_listed_count',('xpath_get', 'retweeted_status/user/listed_count')),
('retweeted_status_user_profile_image_url',('xpath_get', 'retweeted_status/user/profile_image_url')),
('retweeted_status_created_at',('format_date','retweeted_status/created_at',"%Y-%m-%d %H")),
])
}
file = open(archivo,'r')
with open("/tmp/%s" %archivo+"_normalizado",'wb') as f_csv:
# write data
for row in file:
print row
row_2 = normalize_row(row, normalizations['norm_search'], None)
for e in row_2.iteritems():
print e
def normalize_row(row,format,timezone):
#pprint.pprint(row)

f = row_formatter(row, timezone)
f_rows = []
for (name, action) in format.iteritems():
    # call the appropiate method of row_formatter
    value = getattr(f, action[0])(*action[1:])
if (not value): value = ""
    if (type(value) != str and type(value) != unicode): 
        value = str(value)
    f_rows.append((name, value))
return collections.OrderedDict(f_rows)

class row_formatter:
def init(self, row, timezone):
self.row = row
self.timezone = timezone

def xpath_get(self, path):
    elem = self.row
    try:
        for x in path.strip("/").split("/"):
            elem = elem.get(x)
    except:
        pass

    return elem

def get_tweet_type(self): 
    if 'retweeted_status' in self.row and self.row['retweeted_status']:
        return "RT"
    #elif 'in_reply_to_user_id' in self.row and self.row['in_reply_to_user_id']: 
       # return "REPLY"
    else:
        return "TWEET" 

def get_count(self, count_type): 
    query = ''
    if self.get_tweet_type() == 'RT':
        query+= 'retweeted_status/'
    if (count_type == 'favs'): 
        query+= 'favorite_count'
    elif (count_type == 'rts'): 
        query+= 'retweet_count'
    else: 
        return None
    return self.xpath_get(query)

def get_text(self):
    if self.get_tweet_type() == 'RT':
        query+= ''

def format_date(self, query, output_format = "%Y-%m-%d %H:%M", timezone = None): 
    if (not timezone): timezone = self.timezone
    date = self.xpath_get(query)
    if (not date): return None
    utc = datetime.strptime(date, '%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=tz.gettz('UTC'))
    local =  utc.astimezone(tz.gettz(timezone))
    return local.strftime(output_format)

def get_entities(self, e_type, index): 
    matches = []
    if (e_type == 'link'):
        tmp = self.xpath_get('/entities/urls')
        if (tmp):
            matches = [e['expanded_url'] for e in tmp]
    if (e_type == 'mention'):
        tmp = self.xpath_get('/entities/user_mentions')
        if (tmp):
            matches = [e['screen_name'] for e in tmp]
    if (e_type == 'hashtag'):
        tmp = self.xpath_get('/entities/hashtags')
        if (tmp):
            matches = [e['text'] for e in tmp]
    
    if (len(matches) >= index):
        return matches[index - 1]
    
    return None

class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
    # Redirect output to a queue
    self.queue = cStringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f
    self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
    self.writer.writerow([s.encode("utf-8").replace("\n"," ").replace("\r"," ").replace("\t",'') for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = self.encoder.encode(data)
    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

if name == 'main':
t = t()
sts = raw_input("Ingrese usuario:")
maximo = raw_input("Ingrese el maximo de registros:")
ht = raw_input("Nombre de archivo?: ")
f = open(ht, 'w')
#sts = "from:%s OR @%s" % (sts,sts)
print "Buscando %s para %s." % (sts,ht)
makeTwitterSearch(t, sts, f,maximo)
f.close()
#normalize(ht)

`

I'm sure I'm doing something obviously wrong, but I'm getting this error when I try to run the code:

Traceback (most recent call last):
File "tweet_dumper.py", line 64, in
get_all_tweets("J_tsar")
File "tweet_dumper.py", line 18, in get_all_tweets
from tweepy.auth import OAuthHandler
ImportError: No module named auth

Any thoughts on this?

@colbybair

did you put the Twitter API's keys?

jasserkh commented Jul 4, 2017

i want to extract tweets for a specific period of time, any one have an idea?? thanks

dev-luis commented Aug 3, 2017

@jasserkh You can do it like this:

import time
import tweepy
from datetime import datetime, date

#get current date
currentDate = time.strftime("%x")

year = currentDate [6:8]
month = currentDate [0:2]
day = currentDate [3:5]

#reformat the date values
current_dateStr = "20" + year + "-" + month + "-" + day

#convert string to date
currentDate = datetime.strptime(current_dateStr, "%Y-%m-%d").date()
...
...

for tweet in allTweetsList:
try:
#make sure the tweet is recent
createdAt_str = str(tweet.created_at)
ind = createdAt_str.find(" ")
new_createdAt = createdAt_str[:ind]

    #convert string to date
    createdAt = datetime.strptime(new_createdAt, "%Y-%m-%d").date()

    #compare the dates
    if createdAt == currentDate:
        #do something

except tweepy.TweepError as e:
    print(e.response)

If you have questions, please reply to me: http://luis-programming.com/blog/download_tweets/
It's hard to track the replies here.

Hi thanks for you code. when I used you code to collect data using python 3
why do the tweet text include characters like: "b" and \xe2\x80\x99s

"b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s ROH Broadcast https://t.co/uIV7TKHs9K'"

Actually in the original tweet is(https://twitter.com/sheezy0): Adam Cole Praises Kevin Owens + A Preview For Next Week’s ROH Broadcast

\xe2\x80\x99s is represent ''s'. I don't know how to solve this issue, I mean I want to get the ''s' in the text. Thanks!

states-of-fragility commented Sep 29, 2017

Hi! The code works just fine, thanks for sharing.
Yet, I would like to extend the code to retrieve non-english tweets as with this method the arabic letters are translated into funny combinations of roman letters and numbers. I have seen other people asking the same question but so far no answer. Maybe this time it attracts more attention.
Has someone found a solution? I'm a bit desperate.
Merci bien!

Edit: I posted the answer in stack overflow and was able to overcome this issue. In case someone else got stuck with this: https://stackoverflow.com/questions/46510879/saving-arabic-tweets-from-tweepy-in-cvs/46523781?noredirect=1#comment80010395_46523781

hub2git commented Oct 18, 2017

Hi all. Is there a similar script for downloading all of a CuriousCat.me user's Q&As? For example, https://curiouscat.me/curiouscat.

Posted code works for a given handle, m trying to introduce filters for the tweets, any help would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment