Implement an efficient datastructure that allows FAST access to the nth
datom of a datomic index.
This allows getting the "most recent post", "most popular post" etc.
;; API: to implement:
(nth-datoms db :avet :post/date 2284) ;; == (drop 2284 (d/datoms ...))
(rdatoms db :avet :post/date 50) ;; 50 last datoms of the index
(rdatoms-skip db :avet :post/date 150 50) ;; skip 150 datoms, get 50.
We keep a datastructure on the client code that keeps a sketch (aka bookmarks) of the index.
Given such a sketch, we know which values we need to put into the (d/datoms :avet :post/date the-value)
.
We keep a histogram of values holding lower/upper bounds and the number of datoms in between.
This gets updated as transactions happen. If the buckets become too large over time, we
split the bucket into two (by reading back out the datoms
index and finding the middle value.
Parameters:
- Bucket size min (
:bmin
): 100 - Bucket size max (
:bmax
): 200
- Initally upon startup we iterate over the entire datoms index
- We create a new bucket every
:bmin
datoms
- We listen to the datomic transaction queue and update the histogram. Ie put the newly arrived item(s) into the correct bucket(s).
- If any bucket has above
:bmax
items: - Query the db (dbafter of transaction) and take all datoms from this bucket
(might be above
:bmax
size if multiple datoms have been added) - Split into multiple (>= 2) buckets with each bucket containing :bmin items
(the last bucket might have
<=:bmin items
)
Easy, take the last bucket and sum up the items until we have >= k
items, take the lower
value of that bucket and use it for d/datoms
. Return the subvec
such that we have the last k
datoms.
Either start from the beginning of the sketch or the end of the sketch until we have k
or N-k
items.
Similar to rdatoms
above. Then use that value to query d/datoms
and drop
datoms until the nth
is satisfied.
Similar...
The result might not be exact if the passed db is old and the datastructure has already been updated
from newly arrived transactions.
This is only a problem when using nth
and rdatoms-skip
and NOT rdatoms
since rdatoms
will get all the values until the end.
Similarly for an old db: If items have been removed the rdatoms
might get less datoms than
the histogram promised, in this case we have to requery the db with a lower bucket until we have
the promised k
items. This should happen rarely however.
A parameter in rdatoms-skip
and nth
could allow for "approx result ok/not-ok".
Alternatively we could also store the db
value of the last transaction and use that to query the item.
If nth
is not needed (often the case) and only rdatoms
, then we can just store the last few buckets instead.
This would reduce memory. Allow some parameters how many buckets we should keep maintained.
Eg: Keep a histogram of the last 10,000 items of an index only.