Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active March 24, 2024 12:14
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save eliasdabbas/169cc580f8d10a63d5a5d3df04ef9758 to your computer and use it in GitHub Desktop.
Save eliasdabbas/169cc580f8d10a63d5a5d3df04ef9758 to your computer and use it in GitHub Desktop.
Get the most up-to-date list of IP addresses for crawler bots, belonging to Google and Bing.
import ipaddress
import requests
import pandas as pd
def bot_ip_addresses():
bots_urls = {
'google': 'https://developers.google.com/search/apis/ipranges/googlebot.json',
'bing': 'https://www.bing.com/toolbox/bingbot.json'
}
ip_addresses = []
for bot, url in bots_urls.items():
bot_resp = requests.get(url)
for iprange in bot_resp.json()['prefixes']:
network = iprange.get('ipv4Prefix')
if network:
ip_list = [(bot, str(ip)) for ip in ipaddress.IPv4Network(network)]
ip_addresses.extend(ip_list)
return pd.DataFrame(ip_addresses, columns=['bot_name', 'ip_address'])
@kstubs
Copy link

kstubs commented Apr 25, 2023

Have you discovered any .json resources for other search engines like DuckDuckGo, or Yahoo?
I appreciate your effort here, even though I'm a C# guy, I'm just looking for anyone else who is doing something similar to what I need which is a bot whitelist.

@eliasdabbas
Copy link
Author

@kstubs
I've tried to do the same thing, but couldn't find a similar place where the IPs get updated and you can simply get the latest list:

https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot

I just tried and the IPs can be scraped with this simple code:

import requests
from bs4 import BeautifulSoup
resp = requests.get('https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/')

soup = BeautifulSoup(resp.text, 'lxml')

ddg_ip_list = [x.text for x in soup.select('.content li')]
ddg_ip_list
['20.191.45.212',
 '40.88.21.235',
 '40.76.173.151',
 '40.76.163.7',
 '20.185.79.47',
 '52.142.26.175',
 '20.185.79.15',
 '52.142.24.149',
 '40.76.162.208',
 '40.76.163.23',
 '40.76.162.191',
 '40.76.162.247']

Happy to look into others to make the list as comprehensive and up-to-date as possible.

Thanks!

@kstubs
Copy link

kstubs commented Apr 26, 2023

I've created a DuckDuckGo prefixes file here:
https://jsoneditoronline.org/#left=cloud.511273c830ca42a488778345c096f6a5
Unfortunately I do not see a way to grab this content programmatically from this site, but you can at least consume it and use it locally.

@eliasdabbas
Copy link
Author

That's cool.

The code I shared can be used for programmatically grabbing the content from the page they are listed on. (or any equivalent in another language).

@kstubs
Copy link

kstubs commented Jul 1, 2023

Nice! I'll consider scraping that page as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment