Link previews should not be generated by each instance
Currently, each Mastodon instance generates its own link preview, using the LinkCrawlerWorker
.
The preview is generated right after creating the status on the original instance.
When a status is received through ActivityPub, the worker is launched after a random wait, up to 60s.
Link previews are cached locally for each instance, keyed by URL. So another status with the same URL will re-use the cached preview, if available.
As the Fediverse grows, there are more and more instances. A single status can be federated to more than 1000 instances, generating this many queries to the URL contained in the status (thundering herd problem).
For example, this status (from a 30k follower account) generated more than 3000 hits on this URL.
This is a real issue reported by multiple people. I am starting to see missing previews in my feed as some servers are no longer responsing timely in requests to generate it, and some website owner are even blocking the Mastodon User Agent now due to this issue.
Rather than generating the preview when the status is first received, we could have a way to generate it when the status is read by a client for the first time.
This could maybe help spread the origin load over a longer period and avoid instances without active users generating a preview, reducing the total of needed queries.
However this may also not work for big accounts, as there is a big chance that their statuses will be read by at least a person on a instance in the minute following the post.
Also we need to define what "seen" means here:
- either it is when the instance first needs to send this preview to a client (because it requests it / appears in a status list that needs to be sent), in which case generating the preview asynchronously will not work, as we need to sent it right now
- or we need the clients to implement a way to fetch a status when the status is really displayed to the user, which (if I am not wrong) does not exists at the moment
Previews could be fetched only (or mostly) by the instance where the initial status has been created (the origin instance).
Then the preview is posted to the federated instances, included into the status payload.
The receiving instances will not need to generate the preview, only populate it's local cache with it.
A big problem here is that the preview will most often not be generated when the status is sent to the federated instances, as the federation happens right after the status creation, but the preview generation is an asynchronous job.
It also means that a malicous instance could generate a wrong preview for this URL, and this preview will be populated in other instances caches. So if this URL is used again in another (non-malicious) status, the malicious preview from the cache will be used (cache-poisoning).
A mitigation for this could be to split the cache into 2:
- a local trusted cache, containing every preview for locally-generated (thus trusted) URLs
- an untrusted cache, containing previews received from other instances, not tied to the URL but to a specific status. This means the preview for this status will come from the origin instance, so if this instance is malicious it will only affect this specific status
This is similar to the previous solution, but this time the LinkCrawlerWorker
will first request the preview from the status origin instance. If it is already generated, then it will use it, otherwise it will generate it locally using the existing mechanism.
We could also have a special status value to indicate that the preview is being generated (queued) and the request should be retried in a few seconds.
This solution has a risk of generating a lot of trafic for the origin instance, as many federated instances will request the preview in a short timespan (when the status propagates). The instance will in fact receive the same amount of requests that the origin website is currently receiving, so it could overwhelm Mastodon if not sized properly. But this can easily be mitigated by caching this endpoint (using Rails cache or a CDN).
It has the same trust implications that the previous solution, with the same reasoning on if this is in fact really important.
Give the instance operator the ability to provide a list of "trusted" instances for link preview fetching (TrustedPreviewSources
).
When a status is received, FetchLinkCardService
can then call a specific endpoint on a TrustedPreviewSource
to fetch the preview for this URL.
There are multiple ways of doing this:
- fetch the preview for
n
(2?) sources, use it if they both return the same preview (what does "same" mean?) - fetch from a random source, retry until it gets a preview, fallback on local fetch
Similar to the previous solution, we could imagine trying to fetch the local preview randomly and check with the one from the TrustedPreviewSource
to see if it has been altered.
There should probably be a list of such default instances in the sample configuration, to encourage owners to have some here.
This will probably generate quite a lot of trafic to those trusted instances, but this endpoint can easily be cached (either using Rails cache, or even better a CDN), as the response will always be the same for an URL.
This is similar to the previous solution, but we have a separate service to generate (and cache) the previews. A few community-maintained (and trusted) instances of this service are configured by the instance owner.
It could more easily allow other services needing URL previews to use this service/API/protocol, including the other Fediverse software.
Having a separate service for this will probably also help scaling and caching it (very easy to put a CDN in front of it).
This is probably the best long-term solution, so I am including it, but it will require the whole web ecosystem to implement it… I included it so it is mentionned, but I dont consider it solves the issue as it will take years to get done.
We design a web protocol for websites to generate their previews (oEmbed extension?) and sign them. Each instance would then only need to fetch the origin public key (static file) and validate that the received or fetched preview (using one of the above solutions) is correctly signed by the origin website.
A few other issues I have in mindl
- The Fediverse is not only Mastodon. Other software will need to be updated to not originate those preview queries, and we probably want a solution they can implement as well and rely on our work
Re: solution 6, I was was wondering whether you might be able to reuse a fair chunk of the Subresource Integrity spec by having the server which publishes an image declare the hashes for previews and the source image which would allow clients and potentially servers to retrieve the same payload from a shared cache or other service.
One other thing which comes to mind is the multihash format which the IPFS community created. I'm not sure about IPFS in general for this problem - it seems to have performance issues which might be a challenge for the Fediverse – but conceptually this seems to solve a similar problem except that a Mastodon implement could likely afford to be more lenient about bypassing the system since the failure mode is fetching it from the origin server rather than someone's data being irrecoverably lost.