jankotek/gist:c4c47a9a7e87029d9e96

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Overview

Append-Only-Files is key-value storage where key is 8-byte long and value serialized byte[].
It writes modified data to the end of the file, never overwriting existing data.
In theory it is more durable than other transactional stores such as write-ahead-log.
Because older versions are kept around it is very easy to implement snapshots and other
advanced features.
Because data are never overwritten, the disk space usage skyrockets over time.
This is solved by background compaction, which defragments the store and unused versions of records.
Compaction in this design can preserve old data based on their age (keep last 3 months..)
Major problem of append-only-store is need for replay if store crashes.
It reads all data and reconstructs index table (array of record offsets).
In this design we make snapshots of index table periodically,
so worst case scenario is replay of small part of files since last index snapshot.
Log structure

Storage is log (stream) of instructions. Data are not stored in single file,
but log will overflow into new file, every-time limit size (1MB) is exceeded.
However log file could exceed limit size, if last record is large (for example 20MB).
Log file has following instructions:

record - followed by recid, record size and record binary data
thumbstone - marks deleted record, followed by recid
tx start - marks start of transaction
tx commit - marks sucesfull commit, all data since last tx start are valid
tx rollback - marks rollback, all data since last tx start are invalid, do not update index table with their offsets.
file overflow - end of current file, data continue in next log file

Index table

Maps recids into offsets into log file. It says this record is stored at Nth file at this offset.
Technically it maps long to long. It can be also represented as long[] where offset is derived from key (recid) value.
In current prototype index table is Map<Long,Long> and is constructed by replaying entire log.
Each 'record' updates map and last winner remains in map as current valid record.
In final design it should not be stored on heap, so it will move into binary file.
Also replaying entire store is not acceptable, so we need periodical snapshots.
So in final design index table will be periodically stored in separate files.
File structure will follow numbering of log files, so it will be easy to follow replay.
Index files will be read-only. Difference between files and modified records will be maintained
in-memory in form of Map<Long,Long>.
Periodical snapshots of index table will be made. There will be two forms:
In short intervals dump difference between index file and modified records to disk.
It means dumping Map<Long,Long> into disk, probably every commit or on file overflow.
In longer interval create new complete index file.
It means to copy index file into new location and apply modifications from Map.
This could be done on time interval, when modification map reaches certain size or after N log files.
Replay and index reconstruction

Replay should provide last index table file and on-heap Map of modifications. This is done by this step:

find last successful commit, and determine where log files end.
find index file matching to log last commit
find last successful difference map snapshot
replay log files from last difference map snapshot

This should be very fast. In worst case only a few MB of log files needs to be replayed.
Snapshots

Each snapshot is identified by log-file-end pointer at the time of snapshot was made.
I find timestamps unreliable so I am not going to use them here.
Each snapshot needs index table, this will be replayed from last index snapshot to match log-file-end
pointer of snapshot.
Compaction

Compaction processes individual log file and eventually deletes them.
It runs as background thread.
It determines what part of log file is current, and what part of data is no longer used.
Exact way:

replay individual log file
if recid offset matches index table, it means that it is current
if old/current is bellow certain ratio (30%) reinsert all current data, so thei are at beginning of log.
file now only contains old data, so it can be deleted.

Optionally compaction could only discard data older than certain age.
In this case it behaves similar way as snapshot. It takes log-file-end pointer as parameter.
Index table is replayed to match required age etc.
However current data (old data, but current for snapshot) can not be moved to end of log file,
since they would become invisible to snapshot. I am not sure howto handle case yet,

perhaps create second version of log file (LOG_N_B) with updated structure.
Locking

Log file should allow parallel writes and reads. The lock structure will be done similar way
as StoreWAL:

global structural lock needed for increasing log-file-end pointer and file overflows
global commit lock to prevent rollback/commit ran at the same time
segmented read-write recid locks to prevent record reads while it is being updated.

Major challenge is to minimize operations under global structural locks. For example if new record is inserted,
the structural lock is only used to increase end-of-log pointer. Data itself are written outside structural lock.
Questions:

Should index-table diff be in separate file?
File-sync on additional file could add some performance overhead.
Perhaps add new log-file instruction which would contain index-file diff after every commit.
This instruction could exist only at begining of log file, to make index replay fast.
Rollback
Current design have rollback instruction, which invalidates all data since last tx-start
However perhaps it would be better on rollback to revert log-file-end pointer to tx-start
and discard uncommitted data. It would make replay simpler, not sure about durability impacts.
Per-file-commit
I am not sure how file sync behaves. I think one of the options could be to overflow log file
on each commit.
1MB dumps
If transactions are disabled we could store current log file in memory and only dump it after log file overflows.
Async sync
We are dealing with many small files. If transactions are disabled, we need way to file-sync them asynchronously.
File-system and many files in single dir
EXT4 and other older FS do not handle 1M+ files in single directory well. We need new subdir for every 10000 files.