JustAnotherArchivist/wpull-url-priorisation-thoughts.md Secret

## wpull-url-priorisation-thoughts.md

      
    Raw
  

              wpull-url-priorisation-thoughts.md
            
          
    This page gathers some thoughts on implementing URL priorisation into wpull.
Background

Priorisation of URLs fulfilling a certain criterion would often be very useful for a more efficient archival. For example, when grabbing a dying website, the grab could first focus on the website's resources and only retrieve external links (which are likely to stay around longer) later. Or if time may not be sufficient to archive a website in full, priorities would allow to selectively grab certain, more important things first.
Implementation

CLI options

My idea is to add --priority-regex, --priority-domain, --priority-hostname, etc. options. In a sense, that would be similar to the URL filters --accept-regex, --domains, and --hostnames, respectively, but the --priority-* options could be used multiple times and would be applied in the order specified (and would obviously take two arguments, a regex/domain/hostname and the priority value). The first matching option defines the priority for a particular URL.
For example, --priority-domain example.com 3 --priority-scheme ftp -1 --priority-regex ^https?://example.net/critical/ 2 --priority-domain example.net 1 would assign all URLs under example.com priority 3, FTP URLs priority -1, the critical directory on example.net priority 2, other pages on example.net priority 1, and any other URL priority 0 (default value). Some example URLs, in the order they would be processed:


URL
Priority
Rule


https://example.com/foo.html
3
--priority-domain example.com 3


ftp://example.com/bar
3
--priority-domain example.com 3


https://example.net/critical/missile_codes
2
--priority-regex ^https?://example.net/critical/ 2


https://example.net/baz
1
--priority-domain example.net 1


https://example.org/
0
(default)


ftp://example.net/baz
-1
--priority-scheme ftp -1


Plugins

I have no idea, and I honestly haven't thought a lot about it yet. Plugins would need to be able to add and remove priorisation rules on the fly. There should probably also be a way to trigger a re-priorisation of all URLs already in the queue.
Assigning priorities

There would be an object with a function calculating the priority of a given URL (DemuxPriorityRule.get_priority or similar), similar to DemuxURLFilter.test checking whether a URL should be added to the queue. Under the hood, this could reuse the *URLFilter classes for actually handling the rules.
URL	Priority	Rule
https://example.com/foo.html	3	`--priority-domain example.com 3`
ftp://example.com/bar	3	`--priority-domain example.com 3`
https://example.net/critical/missile_codes	2	`--priority-regex ^https?://example.net/critical/ 2`
https://example.net/baz	1	`--priority-domain example.net 1`
https://example.org/	0	(default)
ftp://example.net/baz	-1	`--priority-scheme ftp -1`