Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active April 27, 2022 08:56
Show Gist options
  • Save eliasdabbas/face80c62574830376bc74e82b5a2322 to your computer and use it in GitHub Desktop.
Save eliasdabbas/face80c62574830376bc74e82b5a2322 to your computer and use it in GitHub Desktop.
Crawl multiple websites with one for loop, while saving the output, logs, and job status separately for each website. Resume crawling any time simply be re-running the same code
from urllib.parse import urlsplit
import advertools as adv
sites = [
'https://www.who.int',
'https://www.nytimes.com',
'https://www.washingtonpost.com',
]
for site in sites:
domain = urlsplit(site).netloc
adv.crawl(site,
output_file=domain + '.jl',
follow_links=True,
custom_settings={
'LOG_FILE': domain + '.log',
# change this to any number of pages
'CLOSESPIDER_PAGECOUNT': 50,
# resume the same crawl jobs later
'JOBDIR': domain
})
@eliasdabbas
Copy link
Author

Directory structure after running the above code:

Screen Shot 2022-03-28 at 1 58 09 PM

Line, word, character count for each file:

Screen Shot 2022-03-28 at 1 58 20 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment