rauhs/rseek-index-datomic.md

## rseek-index-datomic.md

      
    Raw
  

              rseek-index-datomic.md
            
          
    Synopsis

Implement an efficient datastructure that allows FAST access to the nth datom of a datomic index.
This allows getting the "most recent post", "most popular post" etc.
;; API: to implement:
(nth-datoms db :avet :post/date 2284) ;; == (drop 2284 (d/datoms ...))
(rdatoms db :avet :post/date 50) ;; 50 last datoms of the index
(rdatoms-skip db :avet :post/date 150 50) ;; skip 150 datoms, get 50.
Design

We keep a datastructure on the client code that keeps a sketch (aka bookmarks) of the index.
Given such a sketch, we know which values we need to put into the (d/datoms :avet :post/date the-value).
Algorithm/Datastructure

We keep a histogram of values holding lower/upper bounds and the number of datoms in between.
This gets updated as transactions happen. If the buckets become too large over time, we
split the bucket into two (by reading back out the datoms index and finding the middle value.
Parameters:

Bucket size min (:bmin): 100
Bucket size max (:bmax): 200

Init


Initally upon startup we iterate over the entire datoms index
We create a new bucket every :bmin datoms

Updating


We listen to the datomic transaction queue and update the histogram.
Ie put the newly arrived item(s) into the correct bucket(s).
If any bucket has above :bmax items:
Query the db (dbafter of transaction) and take all datoms from this bucket
(might be above :bmax size if multiple datoms have been added)
Split into multiple (>= 2) buckets with each bucket containing :bmin items
(the last bucket might have <=:bmin items)

Querying

rdatoms, param: k

Easy, take the last bucket and sum up the items until we have >= k items, take the lower
value of that bucket and use it for d/datoms. Return the subvec such that we have the last k datoms.
nth-datoms, param: k

Either start from the beginning of the sketch or the end of the sketch until we have k or N-k items.
Similar to rdatoms above. Then use that value to query d/datoms and drop datoms until the nth is satisfied.
rdatoms-skip, params skip, k

Similar...
Problems:

The result might not be exact if the passed db is old and the datastructure has already been updated
from newly arrived transactions.
This is only a problem when using nth and rdatoms-skip and NOT rdatoms since rdatoms
will get all the values until the end.
Similarly for an old db: If items have been removed the rdatoms might get less datoms than
the histogram promised, in this case we have to requery the db with a lower bucket until we have
the promised k items. This should happen rarely however.
A parameter in rdatoms-skip and nth could allow for "approx result ok/not-ok".
Alternatively we could also store the db value of the last transaction and use that to query the item.
Variations:

If nth is not needed (often the case) and only rdatoms, then we can just store the last few buckets instead.
This would reduce memory. Allow some parameters how many buckets we should keep maintained.
Eg: Keep a histogram of the last 10,000 items of an index only.