Skip to content

Instantly share code, notes, and snippets.

@rnewman
Created April 23, 2018 17:23
Show Gist options
  • Save rnewman/3b4a7ea38ec1306298d7debca50a57e1 to your computer and use it in GitHub Desktop.
Save rnewman/3b4a7ea38ec1306298d7debca50a57e1 to your computer and use it in GitHub Desktop.
Expiration of data in Mentat

Some classes of applications, including browsers, generate an ever-growing set of data: visits to pages, plays of songs and videos, purchases, messages.

It's routine for these applications to have approaches to scaling by minimizing the 'working' data set: expiration of old data, archiving of subsets of the data, or similar.

The simplest approach to scale is to more carefully constrain queries, doing most of the work on only a subset of the candidates — e.g., evaluating, ranking, and extracting only history entries with any visit within the last year, rather than ranking all history entries and applying a limit after ranking. Sometimes this kind of bounding is sufficient.

When it isn't, we need other mechanisms for reducing the working set.

Data in Mentat lives in these four places:

  1. In the local transaction log, where — apart from excision and sync-related transformations — it is immutable.
  2. In the local datoms table, which is an indexed 'roll-up' representation of all non-retracted datoms. This supports general-purpose querying, and also plays a role in applying transacted datoms, most notably upserts.
  3. In local materialized views and caches derived from the datoms table. (These don't yet exist, but will.)
  4. On zero or more servers.

We can imagine five ways to minimize the working data set in Mentat:

  1. Don't put it into the database in the first place, delaying any scaling problems. This includes not only omitting data altogether, but also the storage of totally unrelated data in a separate database, which trades useful interrelation for scaling.
  2. Explicitly excise old entities or datoms. This is deeply consistent: all clients and the server will deterministically process excisions, converging on the same final state for the log, datoms, and any views. There are two downsides of this: the excised data is lost forever, as if it were never recorded; and it involves routinely mutating/replacing old server data, which limits our ability to put it 'on ice' to reduce storage costs.
  3. Archive to external storage or the server. We could spread storage across multiple databases, with infrequently accessed data no longer stored in the main log. This is a variant of regular excision with some different advantages and disadvantages, but more complexity, especially if the archive should be usable on its own (e.g., including the schema, page URLs, and titles).
  4. Implement forgetting. One can define a set of rules by which some datoms or entities are present in the log but are not materialized into the datoms table or materialized views. E.g., one might omit visits older than 12 months, reducing the size of the working set without losing other page metadata. This exclusion from datoms is perfectly safe for attributes that are cardinality-many and non-unique (and we don't have to worry about unique-value for ref-typed attributes), because those attributes don't play a role in transacting. However, excision and retraction must now examine the log directly, rather than relying on the datoms table, and the datoms table and the log no longer completely correspond, which can affect queries.
  5. Selectively materialize. The purpose of materialized views is to make queries faster and more predictable. If performance is our only concern (that is: we accept growing database size so long as it's not slow), then we can allow the log and the set of datoms to grow without bound, limiting data size only when we construct materialized views. In this approach we would exclude old visits at the point of constructing a view, not by removing them from the data. Because materialized views are an application construct, it's already expected that they represent only a subset of the data. This approach maintains the correspondence between the log and the datoms table, doesn't require touching long-term storage, and is therefore preferred.
@rfk
Copy link

rfk commented Apr 23, 2018

FWIW, on a quick read a combination of (4) and (1) sounds like the best option to me - (4) to keep good performance on an ongoing basis, and an occasional (1) to avoid "your firefox profile will grow without bound" situations. Leaving excised data sitting somewhere on the server doesn't seem terrible to me, if it makes (1) easier to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment