Skip to content

Instantly share code, notes, and snippets.

@lsfalimis
Created June 5, 2014 10:21
Show Gist options
  • Save lsfalimis/bbd8f8194f4ed1045998 to your computer and use it in GitHub Desktop.
Save lsfalimis/bbd8f8194f4ed1045998 to your computer and use it in GitHub Desktop.
my-1st-crawler-alt.py download html doc instead from reading from clipboard (I haven't tested it yet)
import pprint, re, urllib, urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://SOMEWEBSITE').read()
soup = BeautifulSoup(html)
stuff = soup(class_="WHATEVER")
# not quite understand the following line, it will insert '\n'
stuff.insert(0, stuff)
f=open('/Users/henry/Desktop/output.txt', 'w')
pprint.pprint(stuff, f)
f.close()
# sorry, I tried 'pprint.pformat' which returns a string, but after that, the loop won't loop lines
f=open('/Users/henry/Desktop/output.txt', 'r')
lines = f.readlines()[1:]
for line in lines:
a = re.search(r'http.*?jpg', line).group()
b = re.search(r'".*?"', line).group().strip('"')+'.jpg'
urllib.urlretrieve(a,b)
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment