Skip to content

Instantly share code, notes, and snippets.

@cosmocatalano
Last active August 6, 2023 07:32
Show Gist options
  • Save cosmocatalano/4544576 to your computer and use it in GitHub Desktop.
Save cosmocatalano/4544576 to your computer and use it in GitHub Desktop.
Quick-and-dirty Instagram web scrape, just in case you don't think you should have to make your users log in to deliver them public photos.
<?php
//returns a big old hunk of JSON from a non-private IG account page.
function scrape_insta($username) {
$insta_source = file_get_contents('http://instagram.com/'.$username);
$shards = explode('window._sharedData = ', $insta_source);
$insta_json = explode(';</script>', $shards[1]);
$insta_array = json_decode($insta_json[0], TRUE);
return $insta_array;
}
//Supply a username
$my_account = 'cosmocatalano';
//Do the deed
$results_array = scrape_insta($my_account);
//An example of where to go from there
$latest_array = $results_array['entry_data']['ProfilePage'][0]['user']['media']['nodes'][0];
echo 'Latest Photo:<br/>';
echo '<a href="http://instagram.com/p/'.$latest_array['code'].'"><img src="'.$latest_array['display_src'].'"></a></br>';
echo 'Likes: '.$latest_array['likes']['count'].' - Comments: '.$latest_array['comments']['count'].'<br/>';
/* BAH! An Instagram site redesign in June 2015 broke quick retrieval of captions, locations and some other stuff.
echo 'Taken at '.$latest_array['location']['name'].'<br/>';
//Heck, lets compare it to a useful API, just for kicks.
echo '<img src="http://maps.googleapis.com/maps/api/staticmap?markers=color:red%7Clabel:X%7C'.$latest_array['location']['latitude'].','.$latest_array['location']['longitude'].'&zoom=13&size=300x150&sensor=false">';
?>
*/
@restyler
Copy link

restyler commented Oct 8, 2020

According to Instagram's documentation for their API they want you to have a API Key for every User who wishes to pull their photos (keeping the API key in sandbox mode). Again this seems unrealistic to me. You "can" submit your App on Instagram for review (which theoretically "may" let you pull photos for other Users from the same API key), but I highly doubt they'd approve an app that pulls images off their servers (like the above mentioned scripts do). I also do not specifically see this supported with their current API documentation. Nonetheless I have submitted my app for review. So I will let you know how that turns out.

were you able to find any good solution to this issue ? what is the best way we can bypass this login page ? @bateller

I've hit the same issue and had to spend fair amount of time on it for my own project.
Here is the code I came up with: https://github.com/restyler/instagram-php-scraper - it uses Rapid API ( https://rapidapi.com/restyler/api/instagram40 ) to bypass ip restrictions.

@LabN36
Copy link

LabN36 commented Oct 8, 2020

@restyler are you fetching the user's post details ie. suppose if i provide you a instagram post link does it return the the path where it's stored ? instagram ususally detect the datacenter IP.
i can see you've a method getMediaByUrl but I'm not sure how you're dealing with the IP, please let me know. Thanks

@restyler
Copy link

restyler commented Oct 12, 2020

@restyler are you fetching the user's post details ie. suppose if i provide you a instagram post link does it return the the path where it's stored ? instagram ususally detect the datacenter IP.
i can see you've a method getMediaByUrl but I'm not sure how you're dealing with the IP, please let me know. Thanks

Yes. Technically there is a proxy method in the API which allows you to submit any instagram.com* link and get raw HTML/JSON response, and there are helper endpoints like getMediaByUrl you've mentioned, if you don't need raw response. I'd recommend use helpers when it is feasible, because this approach uses more optimisations on the API side.

To mitigate Instagram ip detection (on the API side) I use proxies which are usually not located in popular data center ip ranges.

@LabN36
Copy link

LabN36 commented Oct 12, 2020

To mitigate Instagram ip detection (on the API side) I use proxies which are usually not located in popular data center ip ranges.

@restyler thanks for replying really appreciated, can you tell me a little more about your login on how you are handling from not getting blocked by instagram, are you using any third party API or anything which provides new IP on each request ? because by looking your code it seems like you're just asking proxy credentials from user and connecting to that proxy server if i'm not wrong. please let me know your comments. Thanks.

@ycaty
Copy link

ycaty commented Oct 30, 2020

hey really enjoyed this post. i made a quick lil mockup on the break down of scraping user tags without login.
https://gist.github.com/ycaty/23cf1c17e6bb6e353f5823b3392c1e01#file-instagram-user-tag-scraping-2020

By any chance does anyone happen to have a way to collect followers without logging in?

@rramoscabral
Copy link

hey really enjoyed this post. i made a quick lil mockup on the break down of scraping user tags without login.
https://gist.github.com/levlet/23cf1c17e6bb6e353f5823b3392c1e01

By any chance does anyone happen to have a way to collect followers without logging in?

Page not found

@ycaty
Copy link

ycaty commented Feb 2, 2021

hey really enjoyed this post. i made a quick lil mockup on the break down of scraping user tags without login.
https://gist.github.com/levlet/23cf1c17e6bb6e353f5823b3392c1e01
By any chance does anyone happen to have a way to collect followers without logging in?

Page not found

updated link
https://gist.github.com/ycaty/23cf1c17e6bb6e353f5823b3392c1e01#file-instagram-user-tag-scraping-2020

@Yashwanthd1998
Copy link

looks like instagram blocking scraping using file_get_contents/curl anyone got solution? i wonder how online web scraping tools are working then without block?

Copy link

ghost commented Aug 6, 2021

Hi 'Cosmocatalano' [ nomen est omen?] :) ,
this is a very interesting solution. I only try it on local host so I have no problem with CORS. But the array names seem to be changed completely. The only one which is still the same seems to be 'entry_data'. Is this changed response still usable with alternative array 'names'? This would be very interesting.

Best regards and thanks
Axel Arnold Bangert

@skmachine
Copy link

looks like instagram blocking scraping using file_get_contents/curl anyone got solution? i wonder how online web scraping tools are working then without block?

I guess it is just the right amount of good proxies.. I am using https://rapidapi.com/neotank/api/instagram130 /proxy method to avoid dealing with proxies now because they fail all the time (for Instagram) and get 302 redirect to login..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment