Last active
October 20, 2023 12:42
-
-
Save eliasdabbas/7f1aeb291c0de51a3a86400817aaabf1 to your computer and use it in GitHub Desktop.
Incremental crawling with advertools. Crawl a set number of pages every time without re-crawling the same pages
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import advertools as adv | |
adv.crawl( | |
# start crawling from this URL(s): | |
url_list='https://en.wikipedia.org/wiki/Main_Page', | |
# save the crawl output to this file: | |
output_file='/home/user_name/wikipedia_en_crawl.jl', | |
# Should it follow links? | |
follow_links=True, | |
# But don't follow all links, only links that match this regex: | |
include_url_regex='https://en.wikipedia.org/wiki', | |
custom_settings={ | |
# where to save the crawl job (this manages deduplication, and avoids re-crawling crawled pages): | |
'JOBDIR': '/home/user_name/wikipedia_crawl_job', | |
# After how many URLs should it stop wraling? | |
'CLOSESPIDER_PAGECOUNT': 250, | |
# where to save crawl logs: | |
'LOG_FILE': '/home/user_name/wikipedia_en_crawl.log' | |
}) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Create a virtual environment, let's say
virtual_env
From the command line run
Then add the following line to the end of the file:
@hourly PATH=/path/to/virtual_env/bin ; /path/to/virtual_env/bin/python /path/to/your_script.py
Note: Make sure you use the full path to your environment, Python, and your script
More on how to automate python scripts on a Linux server: https://bit.ly/476BSlt
In addition to
@hourly
, you can use@daily
@weekly
@monthly