-
-
Save edsu/54e6f7d63df3866a87a15aed17b51eaf to your computer and use it in GitHub Desktop.
#!/usr/bin/env python | |
""" | |
Twitter's API doesn't allow you to get replies to a particular tweet. Strange | |
but true. But you can use Twitter's Search API to search for tweets that are | |
directed at a particular user, and then search through the results to see if | |
any are replies to a given tweet. You probably are also interested in the | |
replies to any replies as well, so the process is recursive. The big caveat | |
here is that the search API only returns results for the last 7 days. So | |
you'll want to run this sooner rather than later. | |
replies.py will read a line oriented JSON file of tweets and look for replies | |
using the above heuristic. Any replies that are discovered will be written as | |
line oriented JSON to stdout: | |
./replies.py tweets.json > replies.json | |
It also writes a log to replies.log if you are curious what it is doing...which | |
can be handy since it will sleep for periods of time to work within the | |
Twitter API quotas. | |
PS. you'll need to: | |
pip install python-twitter | |
and then set the following environment variables for it to work: | |
- CONSUMER_KEY | |
- CONSUMER_SECRET | |
- ACCESS_TOKEN | |
- ACCESS_TOKEN_SECRET | |
""" | |
import sys | |
import json | |
import time | |
import logging | |
import twitter | |
import urllib.parse | |
from os import environ as e | |
t = twitter.Api( | |
consumer_key=e["CONSUMER_KEY"], | |
consumer_secret=e["CONSUMER_SECRET"], | |
access_token_key=e["ACCESS_TOKEN"], | |
access_token_secret=e["ACCESS_TOKEN_SECRET"], | |
sleep_on_rate_limit=True | |
) | |
def tweet_url(t): | |
return "https://twitter.com/%s/status/%s" % (t.user.screen_name, t.id) | |
def get_tweets(filename): | |
for line in open(filename): | |
yield twitter.Status.NewFromJsonDict(json.loads(line)) | |
def get_replies(tweet): | |
user = tweet.user.screen_name | |
tweet_id = tweet.id | |
max_id = None | |
logging.info("looking for replies to: %s" % tweet_url(tweet)) | |
while True: | |
q = urllib.parse.urlencode({"q": "to:%s" % user}) | |
try: | |
replies = t.GetSearch(raw_query=q, since_id=tweet_id, max_id=max_id, count=100) | |
except twitter.error.TwitterError as e: | |
logging.error("caught twitter api error: %s", e) | |
time.sleep(60) | |
continue | |
for reply in replies: | |
logging.info("examining: %s" % tweet_url(reply)) | |
if reply.in_reply_to_status_id == tweet_id: | |
logging.info("found reply: %s" % tweet_url(reply)) | |
yield reply | |
# recursive magic to also get the replies to this reply | |
for reply_to_reply in get_replies(reply): | |
yield reply_to_reply | |
max_id = reply.id | |
if len(replies) != 100: | |
break | |
if __name__ == "__main__": | |
logging.basicConfig(filename="replies.log", level=logging.INFO) | |
tweets_file = sys.argv[1] | |
for tweet in get_tweets(tweets_file): | |
for reply in get_replies(tweet): | |
print(reply.AsJsonString()) |
Hey,
i am trying to do something similar to this but i already got all of the tweets in json files. is there any way this could be modified to do so?
Thanks!
@E123omega - this little script is for collecting JSON. What are you trying to do with the JSON you have?
@edsu The aim of the project is to make a report about the functioning of a particular twitter helpdesk so I am trying to organise the tweets in such a way that I can easily look at a tread.
We received the tweets already in JSON-form so I need to mimic what your script does but fully inside the JSON file.
All the script does is collect the tweets as JSON. But if you want to construct threads out of the messages you can use the in_reply_to_status_id_str
and id_str
values in the tweets to reconstruct the thread.
There is a utility here that helps with this if you are curious:
https://github.com/DocNow/twarc/blob/master/utils/network.py
Could you give a example for tweets_file?
I wonder what should be written in the file
@Allen-Qiu If anyone is wondering the file should be in jsonl (as the code describes) You get that file from twac or tweepy (those are the format outputted by library)
The Twitter API v2 supports this now using a conversation_id
field. You can read more in the docs.
First, request the conversation_id
field of the tweet.
https://api.twitter.com/2/tweets?ids=1225917697675886593&tweet.fields=conversation_id
Second, then search tweets using the conversation_id
as the query.
https://api.twitter.com/2/tweets/search/recent?query=conversation_id:1225912275971657728
This is a minimal example, so you should add other fields as you need to the URL.
Absolutely, v2 is the way to go now! We have support in twarc for doing it too:
$ twarc2 conversation 1225912275971657728 > tweets.jsonl
or, if you have a file of tweet ids:
$ twarc2 conversations ids.txt > tweets.jsonl
Because it relies on the search API It only works for tweet threads that were alive in the last week. I'm assuming you have been trying to use it with some old threads? You may want to take a look at twint for scraping Twitter instead of using the twarc which relies on the API.