This page gathers some thoughts on implementing URL priorisation into wpull.
Priorisation of URLs fulfilling a certain criterion would often be very useful for a more efficient archival. For example, when grabbing a dying website, the grab could first focus on the website's resources and only retrieve external links (which are likely to stay around longer) later. Or if time may not be sufficient to archive a website in full, priorities would allow to selectively grab certain, more important things first.
My idea is to add --priority-regex
, --priority-domain
, --priority-hostname
, etc. options. In a sense, that would be similar to the URL filters --accept-regex
, --domains
, and --hostnames
, respectively, but the --priority-*
options could be used multiple times and would be applied in the order specified (and would obviously take two arguments, a regex/domain/hostname and the priority value). The first matching option defines the priority for a particular URL.
For example, --priority-domain example.com 3 --priority-scheme ftp -1 --priority-regex ^https?://example.net/critical/ 2 --priority-domain example.net 1
would assign all URLs under example.com
priority 3, FTP URLs priority -1, the critical
directory on example.net
priority 2, other pages on example.net
priority 1, and any other URL priority 0 (default value). Some example URLs, in the order they would be processed:
URL | Priority | Rule |
---|---|---|
https://example.com/foo.html | 3 | --priority-domain example.com 3 |
ftp://example.com/bar | 3 | --priority-domain example.com 3 |
https://example.net/critical/missile_codes | 2 | --priority-regex ^https?://example.net/critical/ 2 |
https://example.net/baz | 1 | --priority-domain example.net 1 |
https://example.org/ | 0 | (default) |
ftp://example.net/baz | -1 | --priority-scheme ftp -1 |
I have no idea, and I honestly haven't thought a lot about it yet. Plugins would need to be able to add and remove priorisation rules on the fly. There should probably also be a way to trigger a re-priorisation of all URLs already in the queue.
There would be an object with a function calculating the priority of a given URL (DemuxPriorityRule.get_priority
or similar), similar to DemuxURLFilter.test
checking whether a URL should be added to the queue. Under the hood, this could reuse the *URLFilter
classes for actually handling the rules.
Related issues: