A few cases to remind ourselves what the behaviour of the whole pipeline should be when merging things:
-
A:1 -> B:1 if A:1 is the canonical results in:
- A:1 merged with B:1
- B:1 redirected to A:1
-
A:1 -> B:1 -> C:1 if A:1 is the canonical results in:
- A:1 merged with B:1 and C:1
- B1: redirected to A:1
- C:1 redirected to A:1
-
assuming the previous case has happened, an update comes for B:2 which no longer links to C:1:
- A:1 merged to B:2
- B:2 redirected to A:1
- C:1 no longer redirected
The matcher receives individual works and updates stores the state of the graph.
- It receives A:1 which says that A:1 -> B:1
- it stores the link A:1 -> B1
- it stores A:1 and B:1 as belonging to the same component AB
- it sends
[[A:1, B:1]]
to the merger
- It receives B:1 -> C:1
- it stores the link B:1 -> C:1
- Updates A:1, B:1 And C:1 as belonging to ABC
- it sends
[[A:1, B:1, C:1]]
to the merger
- It receives B:2 breaking the link to C:1
- it stores B:2 with no links removing the link to C:1
- Updates A:1, B:2 ad belonging to AB
- it sends
[[A:1,B:2],[C:1]]
to the merger
The merger reads the ids and versions received by the matcher from the recorder store and decides if and how to merge based on internal rules If it decides to merge it updates the works as:
- Fills
numberOfSources
field on the work selected as target with the number of works merged into it - Modifies all the other works as redirected poiting to the target
if it decides not to merge, it sends the works unchanged
At some point (ie probably a few weeks ago) the merger used to send a merged
boolean flag to the ingestor that basically indicated if the work had been tampered with by the merger.
Because updates to works can get to the ingestor out of order, the ingestor assigns a version to each work. Works are ingested if they're version is greater or equal than the one already present in the index.
The version is calculated as based on the transformer version and the number of sources: transformerVersion*1000 + numberOfSources
.
This means that
- a new
transformerVersion
always gets ingested - a work for the same
transformerVersion
and more sources gets ingested if there are already works with the sametransformerVersion
but less sources
The ingestor used to calculate the version as transformerVersion * 10 + (merged?1:0)
This meant that
- a new
transformerVersion
always got ingested - an out of order update from the meregr for the same
transformerVersion
might have been ingested incorrectly - unlink kind of worked because
merged
would have been set totrue
so, given elastic search greater or equal than versioning, the unlinked version would have been ingested (provided no message out of order issues)
There are a number of issues with the current approach:
- We need to multiply
transformerVersion
by 1000 to make it take precedence overnumberOfSources
becausenumberOdSource
can be quite high (650 is the recorded max). If we don't do that we run into data consistency problems in the index. As we add more sources we will find ourselves incrementing this number regularly - The unliking case only works if the update that causes the unlink is on the target work, therefore incrementing
transformerVersion
. If, as in the example above, the update is on a work that gets redirected to the target, it won't be reflected in the API - The ingestor is responsible fof figuring out the version based on information passed on by the merger. This causes coupling between the merger and the ingestor. The ingestor is currently aware of merging happening at some point, which it shouldn't be. It also makes the versioning logic very hard to follow and modify
Jamie and I came up with multiple ideas to tweak the current behaviour, each one with some problems:
- Go back to the
merged
flag: it has issues with multiple sources and it has issues with images versioning - Add unlink in the version function like
(transformerVersion * 10000) + (unlinked?1:0)*1000 + numberOfSources
: bleurgh and also when something is ingested as unlinked it's impossible to override it unless thetransformerVersion
changes
I think that the unlink case is not really solvable without keeping some state.
I also think the merger should be responsible for calculating the version, not the ingestor. It sould just pass on a field mergedVersion
(or something) fot the ingestor to use.
A very vague idea is to have a key value store in the merger where the merger stores sourceId
->mergeVersion
with incremental mergeVersion
:
[[A:1, B:1]]
from the matcher that results in B:1 merged into A:1 means that the merger stores and sends on- A with version 1
- B redirected and with version 1
- a subsequent
[[A:1,B:1,C:1]]
from the matcher that results in B:1 and C:1 merged into A:1 means that the merger stores and sends on- A with version 2
- B redirected and with version 2
- C redirected and with version 1
- a subsequent
[[A:1,B:2],[C:1]]
(unlink) from the matcher that results in B:1 merged into A:1 means that the merger stores and sends on- A with version 3
- B redirected and with version 3
- C no longer redirected and with version 2
This should solve most cases (I think?) but it complicates infrastructure and, more importantly, probably forces us to implement some locking logic around the key value store