fbeauchamp/RFC centralized block store.md

## RFC centralized block store.md

      
    Raw
  

              RFC centralized block store.md
            
          
    The problem


Merging is resource expensive
Multiple processes modify the vhd directory, creating some interesting race conditions (hi hanjo)

The solution


don't move file, only move indexes
use an existing solution to handle concurrency : a database

database schema

disk (id*, label)
blockAddress(id*, hash, diskId, offset) // (diskId, offset) is unique
// don't use an id, here, I want to ensure we won't update the hash of an idea
// also sqlite need to have unique constraint for foreign key
// so we can't use blockStorage directly as a foreign key of blockAddress
block(hash)
blockStorage(hash, storage, blockStatusId)   // (hash, storage) is unique
blockStatus(id*, label) // UPLOADING, CREATING, CREATED, DELETING, ...
Storage(id*, label)
Lock(path*, from, by, taskType {UPLOAD, DELETE, ..}) // handle lock in database from is a datetime, by is the process keeping it


upload process

save vhd metadata in database

for each block  of the vhd
    if block does not exists in DB with success state
      obtain file lock (handle stale lock)
        if block is still missing
           create it with status uploading, hash and size
           upload ( refresh lock every minute )
           update status to ok
        else
           // already uploaded by another backup nothing to do
       dispose lock
    add block to blockAddress table

upload a flat file blockIndex => hash + vhd metadata to ensure restorability even if database is broken

the flat file contains also the ancestors blocks ensuring it's not modified by merging ancestors. Its max size would be 32 MB for a 2TB vhd. It can be generated in one query using the with keyword https://www.sqlite.org/lang_with.html
the flat file should have the *.vhd extension in the same path as today's backup, allowing it to be restored from a XO, even if it don't have the database, as long as the installation have a VHD class able to read it.
Merging process

BEGIN TRANSACTION

UPDATE INTO blockaddress
SET diskId = <childDiskId > WHERE hash IN
select parent.hash
  FROM blackAddress parent
    LEFT JOIN blackAddress  child
      ON parent.diskId = child.diskId
        AND parent.offser = child.offset
    WHERE parent.diskId IN (list)
      AND child.diskId IS NULL -- not already on child

-- here all the block still in the blockAddress of parent are block that should be deleted
UPDATE disk SET status ='DELETING' where id = <parentDiskId>
DELETE FROM blockAddress WHERE  diskId = <parentDiskId>

COMMIT
// here the UI is ok
launch the vacuuming
// here we really got space back
restoring backup process


using https://www.sqlite.org/lang_with.html it's possible to construct the block list in one pass from database
it's also possible to use the flat file  to have one source of truth

deleting backup process


predict the space freed by querying the database looking for blocks only used in this disk AND without chaining to a children
delete the disk and blockAdresses linked to the backups
launch the vacuuming

vacuuming

after merge / delete backup

list all disk with status DELETING
remove disks metadata and files
delete disks records in database
obtain lock on a batch of blocks without blockAddress
delete the files that are successfully locked.
remove block records from database
dispose lock

this process is the only one that can delete data from remote.
It should run in an isolated process with specific permissions, and ensure as much data consistency as possible before deleting any files.
Locking

obtaining a lock

INSERT INTO Lock(path, type) -- type is great to allow us to restart a broken locked process
SELECT hash, 'DELETING'/'UPLOADING' FROM block
WHERE path = ?
RETURNING path
unicity constraint on the primary key will make this return a error if lock is already acquired by anyone else
stale lock hunter


get older lock
if lock is DELETING

if delete succeed : remove lock
else

if file is ok remove lock
else mark vhd depeneding of this block as broken (theorically none)


else

if file is ok on a remote (correct size and hash)

remove lock


else

delete both if file is incomplete
delete lock if file is complete
mark vhd depending of this block as broken


Extra benefit 1 : deduplication

using the hash of the block size as block key gives us free deduplication
Extra benefit 2 : storage tiering

tracking which block are on each remote enables us to defines scenario where we upload to a faster SR and then move data to a slower one (think of a local backup uploaded to S3, then to glacier)
Extra benefit 3 : immutability


blocks are never modified/moved , and can be easily checked against corruption or external modification
indexes are not modified by merging

since no file modification occurs anymore, we can rely on the object lock of S3 to give a strong guaranty to our users, with a legal value. It's a key feature of ransomware mitigation tools
users can also activate legal hold preventing us to delete any files on the remote. And we can have an api extracts the files to be deleted in aws compatible format, allowing the user to use an external tool to delete unneeded files without ever giving this authorization to XO.
Extra benefit 4 : data viz

having data in database allow us to show some fancy visualisation of jobs processes / progress, without even reworking the Tasks
tradeoffs


database structure change means migration, and it's tedious, especially on a 10GB+ database
backuping this database can't rely on the current backup  process. We need to execute an database export regularly + a reindex process that can rebuild the database from listing and bat
the smaller the block , the bigger the database . With 2MB block like Vhd that means 500 K records per TB, but with 1KB blocks it means 1G records =>  Not really usable as is for NBD
sqlite doesn't really like distant access, it means we have to think of something for the proxy or used a full fledged database (postgresql for example)
backup speed may be slower when writing to SR with low concurrency, but I'm betting on the fact that dedup + compression + no merge  will speed up the process globally

preliminary tests

sqlite handle quite well a database with ~ 10M blocks , which mean 20TB of backups (and with a dedup ratio of 5 and a compression ratio of 2 it means the users sees it as 200TB of backup).
the resulting sqlite file is 15GB , so about 10 000 smaller than the saved storage
I would advise using a client server database like postgresql  this will allow more flexibility and more security
disaster recovery

Let's take a scenario where the database is corrupted by a bad hardware, a human error or a ransomware
We assume that the file storage is safe with object lock

list all blocks per remote and fill block and blockStorage
list all vm -> disks and fill disk table
list all BAT and fill blockAddress
if blockAddress references a missing block => mark the vhd as broken

file level restoration


create a nbdkit plugin to read from our exploded vhd : https://archive.fosdem.org/2019/schedule/event/nbdkit/ . Don't forget to handle authentification . Example for vmdk : https://rwmj.wordpress.com/2021/01/01/read-and-writing-vmware-vmdk-disks/
use compression filter https://www.libguestfs.org/nbdkit-gzip-plugin.1.html to decompress
decrypt it : https://rwmj.wordpress.com/2022/05/14/nbdkit-now-supports-luks-encryption/
use partition filter https://libguestfs.org/nbdkit-partitioning-plugin.1.html
mount this nbd device
do our magic with partition , mount it locally and list/restore the files

steps


choose database (in fact it will be postgresql)
create a subclass of VhdAbstract called VhdDirectoryIndexed reading and writing from database
update openVhd
update createVhdDirectoryFromStream
test backups , restoration, dedup. Can be released in Alpha
implement a merge fast track if both VHD are VhdDirectoryIndexed
implement vacuum
measure speed gain against 4. Test with immutable FS. Can be released in Beta
implement data tiering (replication)
implement data buffering (backup to a fast remote, then moving all data to a slower and cheaper one)
implement restoration without database (disaster recovery)
implement file level restoration