Skip to content

Instantly share code, notes, and snippets.

@JustAnotherArchivist
Last active January 3, 2022 02:52
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a to your computer and use it in GitHub Desktop.
Save JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a to your computer and use it in GitHub Desktop.
URL priorisation in wpull

This page gathers some thoughts on implementing URL priorisation into wpull.

Background

Priorisation of URLs fulfilling a certain criterion would often be very useful for a more efficient archival. For example, when grabbing a dying website, the grab could first focus on the website's resources and only retrieve external links (which are likely to stay around longer) later. Or if time may not be sufficient to archive a website in full, priorities would allow to selectively grab certain, more important things first.

Implementation

CLI options

My idea is to add --priority-regex, --priority-domain, --priority-hostname, etc. options. In a sense, that would be similar to the URL filters --accept-regex, --domains, and --hostnames, respectively, but the --priority-* options could be used multiple times and would be applied in the order specified (and would obviously take two arguments, a regex/domain/hostname and the priority value). The first matching option defines the priority for a particular URL.

For example, --priority-domain example.com 3 --priority-scheme ftp -1 --priority-regex ^https?://example.net/critical/ 2 --priority-domain example.net 1 would assign all URLs under example.com priority 3, FTP URLs priority -1, the critical directory on example.net priority 2, other pages on example.net priority 1, and any other URL priority 0 (default value). Some example URLs, in the order they would be processed:

URL Priority Rule
https://example.com/foo.html 3 --priority-domain example.com 3
ftp://example.com/bar 3 --priority-domain example.com 3
https://example.net/critical/missile_codes 2 --priority-regex ^https?://example.net/critical/ 2
https://example.net/baz 1 --priority-domain example.net 1
https://example.org/ 0 (default)
ftp://example.net/baz -1 --priority-scheme ftp -1

Plugins

I have no idea, and I honestly haven't thought a lot about it yet. Plugins would need to be able to add and remove priorisation rules on the fly. There should probably also be a way to trigger a re-priorisation of all URLs already in the queue.

Assigning priorities

There would be an object with a function calculating the priority of a given URL (DemuxPriorityRule.get_priority or similar), similar to DemuxURLFilter.test checking whether a URL should be added to the queue. Under the hood, this could reuse the *URLFilter classes for actually handling the rules.

@JustAnotherArchivist
Copy link
Author

@ae-s
Copy link

ae-s commented Jul 11, 2017

If this is going to add what is essentially a parallel ignore system, perhaps it would be better to rework the ignoracle into a thing that returns a priority, where there is a very-lowest priority defined as "ignore".

@ae-s
Copy link

ae-s commented Jul 11, 2017

Is a higher priority more urgent, or is lower more urgent? Important UX consideration.

@JustAnotherArchivist
Copy link
Author

Hmm, that sounds like an interesting idea, but I'm not sure if mixing the concepts is a good idea. (Also, the ignoracle is an ArchiveBot thing while this will be raw wpull.)
What do others think about it?

In my opinion, higher priority value = more urgent.

@hannahwhy
Copy link

Hmm, this doesn't seem like a parallel ignore system to me. It uses a similar matching system, but I understand it as a sort order and not a filter.

I like this idea in its outlines. Dynamic reprioritization would be a very powerful tool: start a grab, start poking at the target in a web browser, find something important, reprioritize. It should be possible to do so even without hairy synchronization code, as the wpull URL database acts as the synchronization point (sqlite allowing only a single writer, etc).

@JustAnotherArchivist
Copy link
Author

Correct. The existing options are "Should we grab this at all?", while this is the follow-up question "Ok, so we want this. Should we grab it now or later?".

Yeah, and even if you don't use sqlite (the --database-uri option allows you to use any backend supported by SQLAlchemy), the code uses transactions throughout to ensure everything's right. So start a transaction, iterate over all URLs with status todo, throw each into the get_priority function, write the result back to the DB. This would obviously happen somewhere in wpull, and the plugin would simply call a function recalculate_priorities to trigger it.

@Sanqui
Copy link

Sanqui commented Jul 12, 2017

I'm a fan of this idea. Not sure if priority domains/hostnames/schemes aren't superfluous to just regexes. Careful about fluff.

@JustAnotherArchivist
Copy link
Author

True, regexes could cover those things. But there's a performance perspective as well: checking whether a string is in the tuple ('http', 'https', 'ftp') is certainly much faster than matching the regex ^(https?|ftp):, for example. Similarly, the domain and hostname checks are simply string comparisons and would require quite complex regexes otherwise to handle authentication data and ports correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment