Skip to content

Instantly share code, notes, and snippets.

Last active February 1, 2025 19:01
Show Gist options
  • Save marcoqu/e17e1c4414f8d18e6672976d941161fa to your computer and use it in GitHub Desktop.
Save marcoqu/e17e1c4414f8d18e6672976d941161fa to your computer and use it in GitHub Desktop.
# pylint: skip-file
import time
import re
import md5
import requests
import json
HASHTAG_ENDPOINT = "/graphql/query/?query_hash={}&variables={}"
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
def get_first_page(hashtag):
return requests.get(INSTAGRAM_URL + "/explore/tags/{}/".format(hashtag), headers={"user-agent": USER_AGENT})
def get_csrf_token(cookies):
return cookies.get("csrftoken")
def get_query_id(html):
script_path ='/static(.*)TagPageContainer\.js/(.*).js', html).group(0)
script_req = requests.get(INSTAGRAM_URL + script_path)
return re.findall('return e.tagMedia.byTagName.get\\(t\\).pagination},queryId:"([^"]*)"', script_req.text)[0]
def get_rhx_gis(html):
return'rhx_gis":"([^"]*)"', html).group(1)
def get_end_cursor_from_html(html):
return'end_cursor":"([^"]*)"', html).group(1)
def get_end_cursor_from_json(json_obj):
return json_obj['data']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']
def get_params(hashtag, end_cursor):
return '{{"tag_name":"{}","first":50,"after":"{}"}}'.format(hashtag, end_cursor)
def get_ig_gis(rhx_gis, params):
return + ":" + params).hexdigest()
def get_posts_from_json(json_obj):
edges = json_obj['hashtag']['edge_hashtag_to_media']['edges']
return [o['node'] for o in edges]
def get_posts_from_html(html):
json_str ='window._sharedData = (.*);</script>', html).group(1)
json_obj = json.loads(json_str)
graphql = json_obj["entry_data"]["TagPage"][0]["graphql"]
return get_posts_from_json(graphql)
def make_cookies(csrf_token):
return {
"ig_pr": "2",
"csrftoken": csrf_token,
def make_headers(ig_gis):
return {
"x-instagram-gis": ig_gis,
"x-requested-with": "XMLHttpRequest",
"user-agent": USER_AGENT
def get_next_page(csrf_token, ig_gis, query_id, params):
cookies = make_cookies(csrf_token)
headers = make_headers(ig_gis)
url = INSTAGRAM_URL + HASHTAG_ENDPOINT.format(query_id, params)
req = requests.get(url, headers=headers, cookies=cookies)
json_obj = req.json()
end_cursor = get_end_cursor_from_json(json_obj)
posts = get_posts_from_json(json_obj['data'])
return posts, end_cursor
def scrape_hashtag(hashtag, sleep=3):
Yields scraped posts, one by one
first_page = get_first_page(hashtag)
csrf_token = get_csrf_token(first_page.cookies)
query_id = get_query_id(first_page.text)
rhx_gis = get_rhx_gis(first_page.text)
end_cursor = get_end_cursor_from_html(first_page.text)
home_posts = get_posts_from_html(first_page.text)
for post in home_posts:
yield post
while end_cursor is not None:
params = get_params(hashtag, end_cursor)
ig_gis = get_ig_gis(rhx_gis, params)
posts, end_cursor = get_next_page(csrf_token, ig_gis, query_id, params)
for post in posts:
yield post
# main
for post in scrape_hashtag("summer"):
print post['id']
# do stuff
Copy link

jslim89 commented Apr 12, 2018

Thanks for the great updated script.
I've tried and managed to get around 120 results in total, then encounter an issue with the rate limit and 429 status

cache-control: private, no-cache, no-store, must-revalidate
content-language: en
content-length: 45
content-type: application/json
date: Thu, 12 Apr 2018 07:07:37 GMT
expires: Sat, 01 Jan 2000 00:00:00 GMT
pragma: no-cache
set-cookie: csrftoken=eelevTUwlfyd7pCbBA7WLSUDLXOXm5iK; expires=Thu, 11-Apr-2019 07:07:37 GMT; Max-Age=31449600; Path=/; Secure
set-cookie: rur=PRN; Path=/
set-cookie: urlgen="{\"time\": 1523431786\054 \"\": 43350\054 \"\": 132890}:1f6WKX:DdDwzgKxaXzDK9Zl_-Y6WpYLpG0"; Path=/
status: 429
strict-transport-security: max-age=86400
vary: Cookie, Accept-Language
x-frame-options: SAMEORIGIN

This is the response header.

Any solution for this?

Copy link

marcoqu commented Apr 12, 2018

No, you just need to wait a while when you get a 429, or slow down your requests

Copy link

Trying to run this errors out with simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0). Adding print(next_page.status_code) to line 48 shows that IG is returning a 403 error

Copy link

devauxa commented Apr 13, 2018

@cyrian-1756 +1
I think the generation of x-instagram-gis has changed.
When i want to build my x-instagram-gis with rhx_gis + ":" + csrf_token + ":" + user_agent + ":" + params the md5 doesn't match with the correct md5.

Copy link

devauxa commented Apr 13, 2018

it's working if you remove ":" + user_agent

Copy link

marcoqu commented Apr 13, 2018

Updated, thanks @devauxa

Copy link

@marcoqu: Do you have any idea how to access<username>/?__a=1 endpoint?

Copy link

marcoqu commented Apr 14, 2018

@kuldeepaggarwal: Haven't tried yet, sorry.

Copy link

marcoqu commented Apr 14, 2018

Copy link

@kuldeepaggarwal need to use just path, so for<username>/?__a=1 param should be /<username>/

Copy link

marcoqu commented Apr 17, 2018

updated: now x-instagram-gis is just rhx_gis + ":" + params

Copy link

marcoqu commented Apr 18, 2018

first value has to be at most 50.

Copy link

@marcoqu coll, thanks

Copy link

Kingson commented May 3, 2018

Use "" API to return json data.

Copy link

marcoqu commented May 4, 2018

@Kingson that would be only for the first page, right?

Copy link

ketankr9 commented May 11, 2018

Thanks a ton, saved a lot of time.
I modified it slightly to scrape instagram user's public timeline photos link.
Could you please explain "ig_pr":2 in cookies, I don't find it necessary, thanks.
Also you could change while True: to while end_cursor != None: in case the scraper reaches the very last page(where end_cursor is None).

Copy link

marcoqu commented May 16, 2018

@ketankr9 thanks for the suggestions. ig_pr was needed at some point, so I'm going to leave it there just in case..
I also integrated the loading of posts from the home, thanks about that.

Copy link

Is anybody else getting a 500 status after roughly 70 posts? The same script worked fine 2 weeks ago ... Maybe they tightened their temporary IP blocking?!?

Copy link

Has somebody had problems with the scraper?

Basically the cookies are not having anymore the csrf token, and so, it is breaking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment