Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active October 20, 2023 12:42
Show Gist options
  • Save eliasdabbas/7f1aeb291c0de51a3a86400817aaabf1 to your computer and use it in GitHub Desktop.
Save eliasdabbas/7f1aeb291c0de51a3a86400817aaabf1 to your computer and use it in GitHub Desktop.
Incremental crawling with advertools. Crawl a set number of pages every time without re-crawling the same pages
import advertools as adv
adv.crawl(
# start crawling from this URL(s):
url_list='https://en.wikipedia.org/wiki/Main_Page',
# save the crawl output to this file:
output_file='/home/user_name/wikipedia_en_crawl.jl',
# Should it follow links?
follow_links=True,
# But don't follow all links, only links that match this regex:
include_url_regex='https://en.wikipedia.org/wiki',
custom_settings={
# where to save the crawl job (this manages deduplication, and avoids re-crawling crawled pages):
'JOBDIR': '/home/user_name/wikipedia_crawl_job',
# After how many URLs should it stop wraling?
'CLOSESPIDER_PAGECOUNT': 250,
# where to save crawl logs:
'LOG_FILE': '/home/user_name/wikipedia_en_crawl.log'
})
@eliasdabbas
Copy link
Author

Create a virtual environment, let's say virtual_env

From the command line run

crontab -e

Then add the following line to the end of the file:

@hourly PATH=/path/to/virtual_env/bin ;  /path/to/virtual_env/bin/python /path/to/your_script.py

Note: Make sure you use the full path to your environment, Python, and your script

More on how to automate python scripts on a Linux server: https://bit.ly/476BSlt

In addition to @hourly, you can use @daily @weekly @monthly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment