Skip to content

Instantly share code, notes, and snippets.

@NamPNQ
Last active April 9, 2016 10:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NamPNQ/e65672d8f907152c417bee778a53eba6 to your computer and use it in GitHub Desktop.
Save NamPNQ/e65672d8f907152c417bee778a53eba6 to your computer and use it in GitHub Desktop.
Download images and store in correct path via curl
cat images.txt | awk 'match($2, /images\/(.*)$/, a){print $2 "\t" a[1]}' | while read url store_path; do
  curl ${url} --create-dirs -o ${store_path}
done;
import os
import sys
import requests
import urlparse
from pyquery import PyQuery as pq
had_get = []
def get_urls(url):
if url in had_get:
return
had_get.append(url)
r = requests.get(url)
if r.status_code == 200:
print url
if not r.headers['content-type'].startswith('text/html'):
return
doc = pq(r.text)
links = doc('a')
for link in doc('a'):
link = pq(link)
real_link = urlparse.urljoin(url_base, link.attr('href'))
if real_link not in had_get and real_link.startswith(url_base):
get_urls(real_link)
if __name__ == "__main__":
global url_base
url_base = sys.argv[1]
get_urls(url_base)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment