Skip to content

Instantly share code, notes, and snippets.

@gordonje
Last active February 19, 2020 20:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gordonje/99a36e3e465ed4c136ccd7c8f5c04bdb to your computer and use it in GitHub Desktop.
Save gordonje/99a36e3e465ed4c136ccd7c8f5c04bdb to your computer and use it in GitHub Desktop.
A scraping script that runs in multiple, parallel processes
import requests
from time import sleep
from multiprocessing import Pool
session = None
def set_global_session():
global session
if not session:
session = requests.Session()
def cache_page(identifier):
sleep(3)
url = f'https://mycourts.in.gov/PORP/Search/Detail?ID={identifier}'
r = session.get(url)
html = r.content
with open(f".cache/SearchDetail/{identifier}.html", 'wb') as file:
file.write(html)
return print(f' Cached content from {url}')
if __name__ == "__main__":
identifiers = [i for i in range(1, 60000)]
with multiprocessing.Pool(initializer=set_global_session) as pool:
pool.map(cache_page, identifiers)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment