Skip to content

Instantly share code, notes, and snippets.

@arriqaaq
Created June 7, 2023 23:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arriqaaq/4d9b97ed164fa35d43bbbca799d7be94 to your computer and use it in GitHub Desktop.
Save arriqaaq/4d9b97ed164fa35d43bbbca799d7be94 to your computer and use it in GitHub Desktop.
---
IIP: 2
title: Data Retention in Immudb
author: Farhan Khan <farhan@codenotary.com>
status: Proposal
category: Storage
created: 2022-10-18
---
### Abstract
This proposal is about data retention strategy in `immudb`.
### Motivation
Immudb aims to be an enterprise ready database, which means that it needs to operate in long living (years) environments, where data is constantly added and updated. This also means that over time some data is not needed anymore, and it should be dropped. Otherwise, costs of the system grow indefinitely, as storage is not for free.
### Goals
* Define the high level design (HLD) of how data retention would work
### Non-Goals
* Prescribe how exactly the Data Retention functionality has to be implemented
### Requirements
As an Database Admin:
* I want to be able to permanently delete entries based on time (older than).
* I want to be able to permanently delete history entries based on count (keep max 10 elements in history).
* I want to be able to permanently delete selected entries (older than).
## Proposal
### Architecture
Currently, immudb data storage on the filesystem is split into the following append-only files:
- [**Appendable Hash Tree**]((https://github.com/codenotary/immudb-issues/blob/feat/analysis-of-proofs/PROOFS.md))
- **Commit log** (commit)
- **Timed B-Tree** (index)
- **Transaction log** (tx)
- **Value log** (val_x)
The immudb directory for each database looks as follows:
```
.
├── defaultdb
│   ├── aht
│   │   ├── commit
│   │   │   └── 00000000.di
│   │   ├── data
│   │   │   └── 00000000.dat
│   │   └── tree
│   │   └── 00000000.sha
│   ├── commit
│   │   └── 00000000.txi
│   ├── index
│   │   ├── commit
│   │   │   └── 00000000.ri
│   │   ├── history
│   │   │   └── 00000000.hx
│   │   └── nodes
│   │   └── 00000000.n
│   ├── tx
│   │   └── 00000000.tx
│   └── val_0
│   └── 00000000.val
├── immudb.identifier
```
When a transaction is stored in the database, the details of the transaction are stored in the `tx` log and `commit` log. The ALH (accumulated linear hash) is stored in the `aht` log. The `index` stores the B-Tree index mapping the key to the offset in the value log. And finally, the `value` log where the values are stored in an append-only format. This means deletion of data requires deletion from multiple files. To provide a solution with the least amount of changes, yet provides good value to the customer, for 1.5, the proposed strategy for time based data retention is based on compaction of `aht`, `value` and `index` logs.
#### AHT compaction
The AHT is a history tree which stores the ALH (accumulated linear hash) for each transaction in an append-only log. This is used for checking and deriving merkle proofs for a transaction, and also for checking consistency proofs across the database. Truncation of AHT can be done by deleting the logs to a particular offset (or transaction). Truncating just like this could lead to a corrupt log because the nodes required for incremental proofs (inlcusion and consistency) for transactions beyond the offset can be lost. To solve this, when truncation an `aht` to a particular offset, the nodes required for insertion at that offset will be stored in a different file. This will keep proofs intact for nodes.
#### Value-Log compaction
Value log stores all the values for a particular transaction (value of each key) in an append-only file. The file is stored in chunks of 512mb, so when the value log becomes larger than 512mb, there is another chunk that is created. For eg:
```
0000000.val
0000001.val
0000002.val
...
```
![io log](/img/value-log-compaction.png)
The Timed B-tree is used as an index, but also preserves the history of revisions for a particular key. In the B-Tree, for a given key, the value that is stored is the offset of the actual value in the value-log.
For 1.5 release, keeping the time frame in mind, we want to start with a basic method for deletion which would add a significant value to the customer, even though it might not be the best optimal solution (though I've listed the solutions further below for future considerations). The approach is to truncate the value log and discard values upto a timeframe specified by the user.
In time based retention, the user mentions how long should the data span is in immudb. It is a relative time span calculated w.r.t. the max time of the newest persistent transaction. This could be specified in options:
```go
Options{
Dir: "./data",
Network: "tcp",
Address: "0.0.0.0",
Port: 3322,
Config: "configs/immudb.toml",
RetentionTime: 15d, // When to remove old data. Example: After 15 days (15d).
}
```
For example, if the retention period is 15d, there is a compaction process which would run in the background every hour which would select the transaction id beyond 15d from which the data can be truncated. This means that the gap between the oldest transaction timestamp and the newest transaction timestamp goes beyond 15d. And once the last transaction ID has been is found, the values before it would be deleted from the value log.
The way the oldest transaction ID can be retrieved is by using the the `immustore` command to figure out the oldest transaction which lies beyong the retention data span. All the values in the value log are then discarded upto the particlar offset of the key(s) in the transaction.
If the value is not present at the specified offset, the key is considered to be deleted.
##### Discarding transactions
To increase write concurrency, `immudb` has an option to increase the IO concurrency. When the IO concurrency is increased, data is written to multiple value log. There can be multiple value log(s) if the MaxIOConcurrency is enabled.
```
│   └── val_0
│   └── 00000000.val
│   └── val_1
│   └── 00000000.val
│   └── val_2
│   └── 00000000.val
```
![io log](/img/concurrent-io-writes.png)
Each transaction can lie in any of the value logs. In the presence of multiple value logs, the transactions are written to each one of them one step at a time, as show in the above figure (with an IO concurrency of 3). In such scenarios, a map of value-log records to offset needs to be maintained to determine the offset upto which the values in each value-log can be truncated.
##### Buffer window
Currently the values in the value-log are not strictly sorted monotonically with increase in time. Note, this is not `strictly` sorted because the values are written to the value log concurrently, so it could be that there could be some transactions before the delete offset which belong to a future transaction. To avoid deleting data accidentally, a buffer window of one hour is kept from the deletion timestamp so that we do not accidentally delete unordered future transactions.
![io log](/img/value-log-ordering.png)
#### Limitations
Currently, the schema of a SQL table is written to the value log as normal key-value pairs. Because of this, if truncation happens, the schema can be deleted because of which the SQL transactions can't be used. So SQL transactions would be a miss, as per this minimal design, and should be disabled when Data Retention mode is enabled. This is a blocker in the current design, and to avoid this, the schema needs to be persisted in a different value-log which cause breaking changes in the current storage architecture.
#### B-Tree
TODO
#### Estimated reduction
Using the latest `immudb` version, a test was run using the stresser tool to estimate the size of a particular database. Around 27 million transactions were inserted, and the following data was observed:
| Directory | Key/Value Size | KV/Txn | 1 million TXN | 10 million TXN | 50 million TXN |
|-----------|----------------|--------|---------------|----------------|----------------|
| aht | 8/256 | 1 | 378M | 4.6G | 23G |
| commit | 8/256 | 1 | 12M | 126M | 569M |
| index | 8/256 | 1 | 102M | 1.1G | 4.5G |
| tx | 8/256 | 1 | 177M | 1.9G | 8.6G |
| val_0 | 8/256 | 1 | 246M | 2.7G | 12G |
| Directory | Key/Value Size | KV/Txn | 1 million TXN | 10 million TXN | 50 million TXN |
|-----------|----------------|--------|---------------|----------------|----------------|
| aht | 8/512 | 1 | 378M | 4.6G | 24G |
| commit | 8/512 | 1 | 12M | 126M | 570M |
| index | 8/512 | 1 | 102M | 1.1G | 4.6G |
| tx | 8/512 | 1 | 177M | 1.9G | 8.6G |
| val_0 | 8/512 | 1 | 490M | 5.3G | 24.5G |
| Directory | Key/Value Size | KV/Txn | 1 million TXN | 10 million TXN | 100 million TXN |
|-----------|----------------|--------|---------------|----------------|-----------------|
| aht | 8/256 | 100 | 378M | 4.6G | |
| commit | 8/256 | 100 | 12M | 127M | |
| index | 8/256 | 100 | 11G | 52G | |
| tx | 8/256 | 100 | 5.5G | 54.5G | |
| val_0 | 8/256 | 100 | 24G | 240G | |
As we can see, the `value-log` accounts for ~41% of the total storage. Compacting this file can allow a significant reduction in the entire storage space, which would depend on the data-retention period.
The `aht` accounts for a significant % of the total storage. Compacting this file can allow a significant reduction in the entire storage space, which would depend on the data-retention period.
### Tasks
- #### Time based retention based on value-log truncation
The goal of this task is to implement data retention based on the value-log compaction proposal, to allow users to retain data for `n` number of days. The data retention time period option should be provided in options, and the process should only be enabled if the value is explicitly set.
##### Acceptance criteria:
The specified `immudb` database should not have data > n days specified.
##### Business value:
High, allows upto 40% compression of data in single KV per transaction setup.
##### Estimation/things to do:
- Add compactor for per database (1 day)
- Add deletion logic to fetch transaction older than `n` days and discard it from value-logs (2 day)
- Add comprehensive integration tests for compactor (2 day)
- Code Reviews (1-2 days)
- Document configuration options, use-cases, optimization guide (1 day)
Total: 7-8 days
- #### Time based retention based on aht truncation
The goal of this task is to implement data retention based on the aht compaction proposal, to allow users to retain data for `n` number of days. The data retention time period option should be provided in options, and the process should only be enabled if the value is explicitly set.
##### Acceptance criteria:
The specified `immudb` database should not have data > n days specified.
##### Business value:
High, allows upto 40% compression of data in single KV per transaction setup.
##### Estimation/things to do:
- Add/Fetch discard points for `aht` per database (1 day)
- Add new appendable to store the discard points (or add to transaction headers) (3-4 days)
- Add comprehensive integration tests for compactor (2 day)
- Code Reviews (1-2 days)
- Document configuration options, use-cases, optimization guide (1 day)
Total: 9-10 days
### Long Term Solutions (or Ideas)
- Add sql table schema separately in a different value file
- Implement Block based storage
![io log](/img/block-storage.png)
The idea of block based storage is partially based on LSM trees and how time series databases store data. All the files which are currently being written on storage in chunks. A block is an a storage of multiple chunks. We can specify when a block would be created based on either a timeframe (say every 4 hours) or based on size. Here is an example of how two blocks would look over time.
```
.
├── b1
│   ├── tombstones
│   ├── commit
│   │   ├── 00000000.txi
│   ├── index
│   │   ├── commit
│   │   │   └── 00000000.ri
│   │   ├── history
│   │   │   └── 00000000.hx
│   │   └── nodes
│   │   └── 00000000.n
│   ├── tx
│   │   ├── 00000000.tx
│   └── val_0
│   ├── 00000000.val
└── b2
   ├── tombstones
├── commit
│   ├── 00000001.txi
│   ├── 00000002.txi
├── index
│   ├── commit
│   │   └── 00000001.ri
│   │   └── 00000002.ri
│   ├── history
│   │   └── 00000001.hx
│   │   └── 00000002.hx
│   └── nodes
│   └── 00000001.n
│   └── 00000002.n
├── tx
│   ├── 00000001.tx
│   ├── 00000002.tx
└── val_0
├── 00000001.val
├── 00000002.val
```
A block on disk would be a collection of chunks for a fixed time range consisting of its own index. It is a directory with multiple files inside it. Every block has a unique ID, which is a Universally Unique Lexicographically Sortable Identifier (ULID).
The samples in a block are immutable. We can have deletions on blocks via tombstones (special records) while not touching the data, since re-writing a block on every delete request will bloat the block and will lead to lot of IO.
Each block will have a metadata file, which looks something like this for indexing and faster search.
![io log](/img/block-meta.png)
With time, multiple blocks can be compacted into a larger block in the background, similar to how compaction happens in LSM trees. Tombstones are deletion markers, i.e., they tell us what time ranges of data to ignore during reads. This is the only file in the block which is created and modified after writing a block to store the delete requests.
Compaction process would consist of writing a new block from one or more existing blocks, and at the end, the source blocks are deleted and the new compacted block is used in place of those source blocks.
##### Time based retention
With this method, a block is deleted when it goes completely beyond the time retention period and not when part of the block goes beyond the time retention.
For example, if the retention period is 15d, as soon as the gap between the oldest block's max time and the newest block's max time goes beyond 15d, the oldest block is deleted.
Though we have not explored the details for this idea completely, few questions which remain are
- How will the AHT for the entire database be handled? Universal Hash tree?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment