Skip to content

Instantly share code, notes, and snippets.

@stephenhouser
Last active May 20, 2023 04:31
Show Gist options
  • Star 28 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save stephenhouser/c5e2b921c3770ed47eb3b75efbc94799 to your computer and use it in GitHub Desktop.
Save stephenhouser/c5e2b921c3770ed47eb3b75efbc94799 to your computer and use it in GitHub Desktop.
Bing-Image-Scraper

Bing Image Scraper Example

Example using Python to query and scrape Microsoft Bing image search.

Requirements

BeautifulSoup and requests packages are required to run this example.

If you are using the command line and pip you should be able to install these with:

pip install bs4
pip install requests

Running

To run, simply give the script an image search term. It will bing search for it and drop the discovered images into the Pictures subdirectory. In the following example we look for images of kittens.

python bing_image.py kitten
Bing image scraper example using Python to query and scrape Microsoft Bing image search.
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
import re
import sys
import os
import http.cookiejar
import json
import urllib.request, urllib.error, urllib.parse
def get_soup(url,header):
#return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),
# 'html.parser')
return BeautifulSoup(urllib.request.urlopen(
urllib.request.Request(url,headers=header)),
'html.parser')
query = sys.argv[1]
query= query.split()
query='+'.join(query)
url="http://www.bing.com/images/search?q=" + query + "&FORM=HDRSC2"
#add the directory for your image here
DIR="Pictures"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
soup = get_soup(url,header)
ActualImages=[]# contains the link for Large original images, type of image
for a in soup.find_all("a",{"class":"iusc"}):
#print a
mad = json.loads(a["mad"])
turl = mad["turl"]
m = json.loads(a["m"])
murl = m["murl"]
image_name = urllib.parse.urlsplit(murl).path.split("/")[-1]
print(image_name)
ActualImages.append((image_name, turl, murl))
print("there are total" , len(ActualImages),"images")
if not os.path.exists(DIR):
os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])
if not os.path.exists(DIR):
os.mkdir(DIR)
##print images
for i, (image_name, turl, murl) in enumerate(ActualImages):
try:
#req = urllib2.Request(turl, headers={'User-Agent' : header})
#raw_img = urllib2.urlopen(req).read()
#req = urllib.request.Request(turl, headers={'User-Agent' : header})
raw_img = urllib.request.urlopen(turl).read()
cntr = len([i for i in os.listdir(DIR) if image_name in i]) + 1
#print cntr
f = open(os.path.join(DIR, image_name), 'wb')
f.write(raw_img)
f.close()
except Exception as e:
print("could not load : " + image_name)
print(e)
@wisehackermonkey
Copy link

wisehackermonkey commented Jun 23, 2020

Thanks @stephenhouser this was exactly what i needed for a project

heres my version that only returns the first image url as a tuple

#!/usr/bin/env python3
# adapted from code by @stephenhouser on github
# https://gist.github.com/stephenhouser/c5e2b921c3770ed47eb3b75efbc94799
from bs4 import BeautifulSoup
import requests
import re
import sys
import os
import http.cookiejar
import json
import urllib.request, urllib.error, urllib.parse


def get_soup(url,header):
    return BeautifulSoup(urllib.request.urlopen(
        urllib.request.Request(url,headers=header)),
        'html.parser')

def bing_image_search(query):
    query= query.split()
    query='+'.join(query)
    url="http://www.bing.com/images/search?q=" + query + "&FORM=HDRSC2"

    #add the directory for your image here
    DIR="Pictures"
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
    soup = get_soup(url,header)
    image_result_raw = soup.find("a",{"class":"iusc"})

    m = json.loads(image_result_raw["m"])
    murl, turl = m["murl"],m["turl"]# mobile image, desktop image

    image_name = urllib.parse.urlsplit(murl).path.split("/")[-1]
    return (image_name,murl, turl)



if __name__ == "__main__":
    query = sys.argv[1]
    results = bing_image_search(query)
    print(results)

run example

python bing_image.py kitten

results

('tabby-kitten-small-xlarge.jpg', 'https://www.telegraph.co.uk/content/dam/Pets/spark/royal-canin/tabby-kitten-small-xlarge.jpg', 'http://tse1.mm.bing.net/th?id=OIP.OgqBDWRFUYWF0Wunyye_GgHaEo&pid=15.1')

@a19singh
Copy link

Actually it didn't worked for me and rather showed me an error as:

Traceback (most recent call last):
File "bing-image.py", line 30, in
mad = json.loads(a["mad"])
File "C:\ProgramData\Anaconda3\lib\site-packages\bs4\element.py", line 1321, in getitem
return self.attrs[key]
KeyError: 'mad'

Need some help to get over this.

@ShamsAnsari
Copy link

ShamsAnsari commented Aug 30, 2020

Actually it didn't worked for me and rather showed me an error as:

Traceback (most recent call last):
File "bing-image.py", line 30, in
mad = json.loads(a["mad"])
File "C:\ProgramData\Anaconda3\lib\site-packages\bs4\element.py", line 1321, in getitem
return self.attrs[key]
KeyError: 'mad'

Need some help to get over this.

You can fix this problem by replacing the code inside the for loop at line 29 with this code

     print(a)
   # mad = json.loads(a["mad"])
   # turl = mad["turl"]
    m = json.loads(a["m"])
    murl = m["murl"]
    turl = m["turl"]

    image_name = urllib.parse.urlsplit(murl).path.split("/")[-1]
    print(image_name)

    ActualImages.append((image_name, turl, murl))

@BasitJaved
Copy link

This only downloads first 35 images, is there any way to increase it to 100 or more?

@JJwilkin
Copy link

JJwilkin commented Apr 13, 2021

This only downloads first 35 images, is there any way to increase it to 100 or more?

Was wondering the same thing^. Also the ability to take from other than the top results

@michhar
Copy link

michhar commented May 11, 2021

@BasitJaved and @JJwilkin: this is the snippet I changed to get multiple pages of results. The key was adding &pageNum={}. I don't think it necessarily means images are unique, but this might help.

...

query = sys.argv[1]
query= query.split()
query='+'.join(query)
#add the directory for your image here
DIR=os.path.join(os.getcwd(), "pictures")
ActualImages=[]# contains the link for Large original images, type of  image

for i in range(5):
    url="http://www.bing.com/images/search?q={}&pageNum={}&FORM=HDRSC2".format(query, i)
    print(url)
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
    soup = get_soup(url,header)

    for a in soup.find_all("a",{"class":"iusc"}):
        m = json.loads(a["m"])
        murl = m["murl"]
        turl = m["turl"]
        image_name = urllib.parse.urlsplit(murl).path.split("/")[-1]
        ActualImages.append((image_name, turl, murl))
...

@robosina
Copy link

@michhar, All of the images on the next pages are duplicated(we don't get new images), I changed ActualImages to a set to prevent redownloading the same photos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment