Skip to content

Instantly share code, notes, and snippets.

@edsu
Last active December 7, 2022 18:59
Show Gist options
  • Save edsu/54e6f7d63df3866a87a15aed17b51eaf to your computer and use it in GitHub Desktop.
Save edsu/54e6f7d63df3866a87a15aed17b51eaf to your computer and use it in GitHub Desktop.
Try to get replies to a particular set of tweets, recursively.
#!/usr/bin/env python
"""
Twitter's API doesn't allow you to get replies to a particular tweet. Strange
but true. But you can use Twitter's Search API to search for tweets that are
directed at a particular user, and then search through the results to see if
any are replies to a given tweet. You probably are also interested in the
replies to any replies as well, so the process is recursive. The big caveat
here is that the search API only returns results for the last 7 days. So
you'll want to run this sooner rather than later.
replies.py will read a line oriented JSON file of tweets and look for replies
using the above heuristic. Any replies that are discovered will be written as
line oriented JSON to stdout:
./replies.py tweets.json > replies.json
It also writes a log to replies.log if you are curious what it is doing...which
can be handy since it will sleep for periods of time to work within the
Twitter API quotas.
PS. you'll need to:
pip install python-twitter
and then set the following environment variables for it to work:
- CONSUMER_KEY
- CONSUMER_SECRET
- ACCESS_TOKEN
- ACCESS_TOKEN_SECRET
"""
import sys
import json
import time
import logging
import twitter
import urllib.parse
from os import environ as e
t = twitter.Api(
consumer_key=e["CONSUMER_KEY"],
consumer_secret=e["CONSUMER_SECRET"],
access_token_key=e["ACCESS_TOKEN"],
access_token_secret=e["ACCESS_TOKEN_SECRET"],
sleep_on_rate_limit=True
)
def tweet_url(t):
return "https://twitter.com/%s/status/%s" % (t.user.screen_name, t.id)
def get_tweets(filename):
for line in open(filename):
yield twitter.Status.NewFromJsonDict(json.loads(line))
def get_replies(tweet):
user = tweet.user.screen_name
tweet_id = tweet.id
max_id = None
logging.info("looking for replies to: %s" % tweet_url(tweet))
while True:
q = urllib.parse.urlencode({"q": "to:%s" % user})
try:
replies = t.GetSearch(raw_query=q, since_id=tweet_id, max_id=max_id, count=100)
except twitter.error.TwitterError as e:
logging.error("caught twitter api error: %s", e)
time.sleep(60)
continue
for reply in replies:
logging.info("examining: %s" % tweet_url(reply))
if reply.in_reply_to_status_id == tweet_id:
logging.info("found reply: %s" % tweet_url(reply))
yield reply
# recursive magic to also get the replies to this reply
for reply_to_reply in get_replies(reply):
yield reply_to_reply
max_id = reply.id
if len(replies) != 100:
break
if __name__ == "__main__":
logging.basicConfig(filename="replies.log", level=logging.INFO)
tweets_file = sys.argv[1]
for tweet in get_tweets(tweets_file):
for reply in get_replies(tweet):
print(reply.AsJsonString())
@MichaelCurrin
Copy link

MichaelCurrin commented Jan 12, 2019

I can confirm the point made by @serdec - using raw meant the other fields like max_id were ignored so I was stuck on the first page.

I took out 'raw' key and replaced with term key and value. This works great.

    term = "to:%s" % user
    replies = t.GetSearch(
                term=term,
                since_id=tweet_id,
                max_id=max_id,
                count=100,
            )

@MichaelCurrin
Copy link

There's a problem on breaking out the while loop - it happens to soon and will miss the last page of results which will probably have less than 100 tweets.

Also bear in mind that API's max ID filter is inclusive, which means that the last tweet of page N will be at the start of page N+1, which means you double count and it's hard to know when you have the last page.

So my implementations gets one less than the last ID as the max ID, so that reply will be excluded from the next page. And then I check for zero tweets on a page and then break from the while loop.

    ...
    page_index = 0
    while True:
        page_index += 1
        print(f"Page: {page_index}")

        try:
            replies = ...
        except twitter.error.TwitterError as e:
            ...

        if not replies:
            break        # <<<

        for reply in replies:
            ...

        max_id = reply.id - 1     # <<<

My suggestion is also that the recursive reply magic can be commented out if it's not needed. And to save getting rate limited too easily from frequent requests.

@PAVITHRA-CP
Copy link

I have a problem that, I have a file which looks like this:

['972651', '80080680482123777', '0.0']->['189397006', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['10678072', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['14569462', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['41634505', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['81232966', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['21282483', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['35165557', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['12735762', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['39076620', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['36841912', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['174692880', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['63007952', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['23500923', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['14287455', '80080680482123777', '1.8']
['972651', '80080680482123777', '0.0']->['166323176', '80080680482123777', '2.17']
['972651', '80080680482123777', '0.0']->['19543802', '80080680482123777', '2.68']
['972651', '80080680482123777', '0.0']->['25246700', '80080680482123777', '2.7']
['972651', '80080680482123777', '0.0']->['286219571', '80080680482123777', '2.85']
['972651', '80080680482123777', '0.0']->['22028700', '80080680482123777', '2.98']

First value represent user id and second value tweet id then after "->" symbol, first value represent response user id corresponding to same tweet id.

I want to retrieve corresponding responses of the source tweet from particular users.

Can anyone help me.

Thanks in advance..!!

@MerleLiuKun
Copy link

@PAVITHRA-CP Looks like you want to get users conversation belong to the pointed tweet. You can search the target tweet. or search the user you want to get(use the endpoint search/tweets). just set the since_id to tweet id. Maybe you can get what you want.

@fredwilliam
Copy link

Hi i have a small task based on this, i am paying for this assistance. Please Reach me at fred.haule@gmail.com. Thanks for sharing Great work!

@Aminaba2016
Copy link

I keep getting a key error when i put my consumer_key any workarounds?
@lakshadvani did u found the solutiion plz

@pnija
Copy link

pnija commented May 15, 2019

Is the GetSearch API equivalent of /timeline/home on the twitter Web ?

@Ms-Seeker
Copy link

How would this be done in reverse? - as in, you have a certain reply & want to find the ID of the original tweet it was in reply to

in_reply_to_status_id is the attribute that gives the tweet id of original tweet

@edsu
Copy link
Author

edsu commented Jun 25, 2019

Yes, that's the easy part, assuming that tweet hasn't been deleted. But finding out what tweets reply to a given tweet is currently not possible with Twitter's public API.

@edsu
Copy link
Author

edsu commented Jun 25, 2019

FYI, this replies functionality is now part of the twarc utility.

https://github.com/docnow/twarc

@hjkgithub
Copy link

plz write some stepts with hashtags ,from which we can read and understand

@fatimaikrams
Copy link

it only give replies in some tweets it give 0 replies in case of some tweets how can we get more than 15 replies through this code??

@edsu
Copy link
Author

edsu commented Dec 27, 2019

Because it relies on the search API It only works for tweet threads that were alive in the last week. I'm assuming you have been trying to use it with some old threads? You may want to take a look at twint for scraping Twitter instead of using the twarc which relies on the API.

@Ms-Seeker
Copy link

Ms-Seeker commented Dec 27, 2019 via email

@fatimaikrams
Copy link

fatimaikrams commented Dec 27, 2019 via email

@E123omega
Copy link

Hey,
i am trying to do something similar to this but i already got all of the tweets in json files. is there any way this could be modified to do so?
Thanks!

@edsu
Copy link
Author

edsu commented May 10, 2020

@E123omega - this little script is for collecting JSON. What are you trying to do with the JSON you have?

@E123omega
Copy link

E123omega commented May 10, 2020

@edsu The aim of the project is to make a report about the functioning of a particular twitter helpdesk so I am trying to organise the tweets in such a way that I can easily look at a tread.
We received the tweets already in JSON-form so I need to mimic what your script does but fully inside the JSON file.

@edsu
Copy link
Author

edsu commented May 10, 2020

All the script does is collect the tweets as JSON. But if you want to construct threads out of the messages you can use the in_reply_to_status_id_str and id_str values in the tweets to reconstruct the thread.

There is a utility here that helps with this if you are curious:

https://github.com/DocNow/twarc/blob/master/utils/network.py

@AbdullaRifai
Copy link

Could you give a example for tweets_file?
I wonder what should be written in the file

@Allen-Qiu If anyone is wondering the file should be in jsonl (as the code describes) You get that file from twac or tweepy (those are the format outputted by library)

@dimdenGD
Copy link

The Twitter API v2 supports this now using a conversation_id field. You can read more in the docs.

First, request the conversation_id field of the tweet.

https://api.twitter.com/2/tweets?ids=1225917697675886593&tweet.fields=conversation_id

Second, then search tweets using the conversation_id as the query.

https://api.twitter.com/2/tweets/search/recent?query=conversation_id:1225912275971657728

This is a minimal example, so you should add other fields as you need to the URL.

@edsu
Copy link
Author

edsu commented Aug 12, 2021

Absolutely, v2 is the way to go now! We have support in twarc for doing it too:

$ twarc2 conversation 1225912275971657728 > tweets.jsonl

or, if you have a file of tweet ids:

$ twarc2 conversations ids.txt > tweets.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment