Link Expansion is used to produce a graph of links between content. An example can be seen by looking at the "links" section of https://www.gov.uk/api/content/world/all
You can access the expanded links on demand through the /v2/expanded-links/{content-id} endpoint.
Most graphs of expanded links are small which produce no performance problems - however there are some that represent a large taxonomoy - which are large and slow to load. Eg /world/all (91b8ef20-74e7-4552-880c-50e6d73c2ff9) which has over 2000 items in it's graph.
The process of building expanded links can be thought of as two distinct steps:
- Building a link graph - This is the process of working out the link types and content_ids into a graph structure
- Populating the link graph - This is the process of mixing the link graph with content
At present the expanded-links endpoint caches for one hour.
We have access to memcached with the Publishing API, however each machine has it's own instance and the cache isn't shared.
We have recently made performance improvements to building expanded links. Where 91b8ef20-74e7-4552-880c-50e6d73c2ff9
has gone from ~20 seconds to ~2 seconds.
With the recent improvements to performance we can now generate expanded links within a request, but not consistently fast enough so we would still benefit from some form of caching.
This is currently how the caching works and the problem is that users can't see changes they've made until the cache invalidates which at it's worst case is 1 hour.
One of the simpler options to resolve this problem is to build the expanded links in the downstream worker as part of our existing process and using this to update what is returned from expanded links.
We'd likely store this information in the database, since storing it in memcached would mean needing to run this on each machine.
The problem with this is that the date returned from the endpoint might be stale and there isn't a way that the client knows this.
If we are able to invalidate cache when data changes then we are most likely able to return an accurate response from the endpoint. This however is also the most complicated approach.
There are two approaches defined for how to do this: Link graph cache and Expanded links cache. Both of these however require solving the same problem: Defining and Invalidating cache dependencies
Outside of the above ones these are ideas that are either new or haven't been pursued.
It is wondered whether it'd be quicker to lookup the individual links of each edition/link set by using a key value cache - eg redis or similar. Thereby trading say a couple of hundred queries for a couple of hundred cache lookups. Expectation was that this might perform similar so it hasn't been investigated.
When we build a expanded link set we could store the most recent updated_at
for the editions used to populate the link graph.
Then when we populate a link graph we could first do a query to get most recent updated_at
for the entire set of editions and check whether this data is cached or not.
updated_at
might not be the most approriate field though for handling deletes.
We would however still need a link graph to use this.
Obviously this brings back the question of whether we'd be better suited with a graph database, but the amount of change that would involve is extremely high. Even with some sort of graph database we still have a problem that the resources we are returning are large and require a lot of data.
Ideally to solve this we'd want a way to have a cache for a key with a collection of dependent keys. Any time we change something in the Publishing API we could invalidate all instances of a cache with that cache.
The way this has been explored so far is to store this in the database.
A couple of problems have been enountered so far:
- If we store the dependents in an array in Postgres array (with GIN index) it is quite slow to find a record that matches and invalidate the cache (we may find it is fast enough)
- If we store the dependents in a different table the inserts are slow (maybe inserting 2000 rows rather than 1) but deletes are fast - it might be that we can work out that we can do the insert much faster though
Links don't change hugely often, so trying to maintain a link graph cache would require less frequent changes than trying to update the cache for every content change.
When state changing events occur in Publishing API (PutContent, Publish, Unpublish, DiscardDraft) we can compare whether links have changed and invalidate caches accordingly. This is explored in When does data change
We would then need a different method to cache the populated data though
We could use the same approach as in Link graph cache but instead invalidate the cache of dependencies on every change. This would be a bigger thing stored in cache that is frequently invalidated but it would serve as a simple concept than caching the graph and populated links separately.