Skip to content

Instantly share code, notes, and snippets.

@ShaikeA
Created January 16, 2019 20:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ShaikeA/d75cd548a4314d5d36c433218e8dded3 to your computer and use it in GitHub Desktop.
Save ShaikeA/d75cd548a4314d5d36c433218e8dded3 to your computer and use it in GitHub Desktop.
urls = ["www.google.com", "..."] # all links to scrape
# Create pools of proxies and headers and get the first ones
proxies_pool, headers_pool = create_pools()
current_proxy = next(proxy_pool)
current_headers = next(headers_pool)
# Create a generator of all links that are used in grequests.map() function. This way, 4 requests are sent concurrently
# Note that the current proxy and headers are the same for all the requests below. It is up to you to specify the urls for it.
rs = (grequests.get(u) for u in urls)
pages = grequests.map(rs, size=4, proxies={"http": current_proxy, "https": current_proxy}, headers=current_headers, exception_handler=exception_handler)
# get all Beautifulsoup objects of all retrieved pages
soups = [BeautifulSoup(pages[ind].content, 'html.parser') if
pages[ind].status_code == 200 else "problem" for ind in range(len(pages))]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment