Skip to content

Instantly share code, notes, and snippets.

@lsfalimis
Created June 5, 2014 10:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lsfalimis/bcf9a780bce3c69e3850 to your computer and use it in GitHub Desktop.
Save lsfalimis/bcf9a780bce3c69e3850 to your computer and use it in GitHub Desktop.
my-1st-crawler, from clipboard source code, batch download images with filtered names
import pprint, re, urllib
from bs4 import BeautifulSoup
from subprocess import check_output
html = check_output(["pbpaste"])
soup = BeautifulSoup(html)
stuff = soup(class_="WHATEVER")
# not quite understand the following line, it will insert '\n'
stuff.insert(0, stuff)
f=open('/Users/henry/Desktop/output.txt', 'w')
pprint.pprint(stuff, f)
f.close()
# sorry, I tried 'pprint.pformat' which returns a string, but after that, the loop won't loop lines
f=open('/Users/henry/Desktop/output.txt', 'r')
lines = f.readlines()[1:]
for line in lines:
a = re.search(r'http.*?jpg', line).group()
b = re.search(r'".*?"', line).group().strip('"')+'.jpg'
urllib.urlretrieve(a,b)
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment