Skip to content

Instantly share code, notes, and snippets.

@fbeauchamp
Last active May 27, 2022 16:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fbeauchamp/4da6feac47cb73ae4dc8ed891092a304 to your computer and use it in GitHub Desktop.
Save fbeauchamp/4da6feac47cb73ae4dc8ed891092a304 to your computer and use it in GitHub Desktop.

The problem

  • Merging is resource expensive
  • Multiple processes modify the vhd directory, creating some interesting race conditions (hi hanjo)

The solution

  • don't move file, only move indexes
  • use an existing solution to handle concurrency : a database

database schema

disk (id*, label)
blockAddress(id*, hash, diskId, offset) // (diskId, offset) is unique
// don't use an id, here, I want to ensure we won't update the hash of an idea
// also sqlite need to have unique constraint for foreign key
// so we can't use blockStorage directly as a foreign key of blockAddress
block(hash)
blockStorage(hash, storage, blockStatusId)   // (hash, storage) is unique
blockStatus(id*, label) // UPLOADING, CREATING, CREATED, DELETING, ...
Storage(id*, label)
Lock(path*, from, by, taskType {UPLOAD, DELETE, ..}) // handle lock in database from is a datetime, by is the process keeping it

upload process

save vhd metadata in database

for each block  of the vhd
    if block does not exists in DB with success state
      obtain file lock (handle stale lock)
        if block is still missing
           create it with status uploading, hash and size
           upload ( refresh lock every minute )
           update status to ok
        else
           // already uploaded by another backup nothing to do
       dispose lock
    add block to blockAddress table

upload a flat file blockIndex => hash + vhd metadata to ensure restorability even if database is broken

the flat file contains also the ancestors blocks ensuring it's not modified by merging ancestors. Its max size would be 32 MB for a 2TB vhd. It can be generated in one query using the with keyword https://www.sqlite.org/lang_with.html

the flat file should have the *.vhd extension in the same path as today's backup, allowing it to be restored from a XO, even if it don't have the database, as long as the installation have a VHD class able to read it.

Merging process

BEGIN TRANSACTION

UPDATE INTO blockaddress
SET diskId = <childDiskId > WHERE hash IN
select parent.hash
  FROM blackAddress parent
    LEFT JOIN blackAddress  child
      ON parent.diskId = child.diskId
        AND parent.offser = child.offset
    WHERE parent.diskId IN (list)
      AND child.diskId IS NULL -- not already on child

-- here all the block still in the blockAddress of parent are block that should be deleted
UPDATE disk SET status ='DELETING' where id = <parentDiskId>
DELETE FROM blockAddress WHERE  diskId = <parentDiskId>

COMMIT

// here the UI is ok

launch the vacuuming

// here we really got space back

restoring backup process

deleting backup process

  • predict the space freed by querying the database looking for blocks only used in this disk AND without chaining to a children
  • delete the disk and blockAdresses linked to the backups
  • launch the vacuuming

vacuuming

after merge / delete backup

  • list all disk with status DELETING
  • remove disks metadata and files
  • delete disks records in database
  • obtain lock on a batch of blocks without blockAddress
  • delete the files that are successfully locked.
  • remove block records from database
  • dispose lock

this process is the only one that can delete data from remote. It should run in an isolated process with specific permissions, and ensure as much data consistency as possible before deleting any files.

Locking

obtaining a lock

INSERT INTO Lock(path, type) -- type is great to allow us to restart a broken locked process
SELECT hash, 'DELETING'/'UPLOADING' FROM block
WHERE path = ?
RETURNING path

unicity constraint on the primary key will make this return a error if lock is already acquired by anyone else

stale lock hunter

  • get older lock
  • if lock is DELETING
    • if delete succeed : remove lock
    • else
      • if file is ok remove lock
      • else mark vhd depeneding of this block as broken (theorically none)
  • else
    • if file is ok on a remote (correct size and hash)
      • remove lock
    • else
      • delete both if file is incomplete
      • delete lock if file is complete
      • mark vhd depending of this block as broken

Extra benefit 1 : deduplication

using the hash of the block size as block key gives us free deduplication

Extra benefit 2 : storage tiering

tracking which block are on each remote enables us to defines scenario where we upload to a faster SR and then move data to a slower one (think of a local backup uploaded to S3, then to glacier)

Extra benefit 3 : immutability

  • blocks are never modified/moved , and can be easily checked against corruption or external modification
  • indexes are not modified by merging

since no file modification occurs anymore, we can rely on the object lock of S3 to give a strong guaranty to our users, with a legal value. It's a key feature of ransomware mitigation tools

users can also activate legal hold preventing us to delete any files on the remote. And we can have an api extracts the files to be deleted in aws compatible format, allowing the user to use an external tool to delete unneeded files without ever giving this authorization to XO.

Extra benefit 4 : data viz

having data in database allow us to show some fancy visualisation of jobs processes / progress, without even reworking the Tasks

tradeoffs

  • database structure change means migration, and it's tedious, especially on a 10GB+ database
  • backuping this database can't rely on the current backup process. We need to execute an database export regularly + a reindex process that can rebuild the database from listing and bat
  • the smaller the block , the bigger the database . With 2MB block like Vhd that means 500 K records per TB, but with 1KB blocks it means 1G records => Not really usable as is for NBD
  • sqlite doesn't really like distant access, it means we have to think of something for the proxy or used a full fledged database (postgresql for example)
  • backup speed may be slower when writing to SR with low concurrency, but I'm betting on the fact that dedup + compression + no merge will speed up the process globally

preliminary tests

sqlite handle quite well a database with ~ 10M blocks , which mean 20TB of backups (and with a dedup ratio of 5 and a compression ratio of 2 it means the users sees it as 200TB of backup). the resulting sqlite file is 15GB , so about 10 000 smaller than the saved storage

I would advise using a client server database like postgresql this will allow more flexibility and more security

disaster recovery

Let's take a scenario where the database is corrupted by a bad hardware, a human error or a ransomware

We assume that the file storage is safe with object lock

  1. list all blocks per remote and fill block and blockStorage
  2. list all vm -> disks and fill disk table
  3. list all BAT and fill blockAddress
  4. if blockAddress references a missing block => mark the vhd as broken

file level restoration

steps

  1. choose database (in fact it will be postgresql)
  2. create a subclass of VhdAbstract called VhdDirectoryIndexed reading and writing from database
  3. update openVhd
  4. update createVhdDirectoryFromStream
  5. test backups , restoration, dedup. Can be released in Alpha
  6. implement a merge fast track if both VHD are VhdDirectoryIndexed
  7. implement vacuum
  8. measure speed gain against 4. Test with immutable FS. Can be released in Beta
  9. implement data tiering (replication)
  10. implement data buffering (backup to a fast remote, then moving all data to a slower and cheaper one)
  11. implement restoration without database (disaster recovery)
  12. implement file level restoration
@fbeauchamp
Copy link
Author

fbeauchamp commented May 27, 2022

alternative : don't use vhd format (with its parsing) , but directly the block api vatesfr/xen-orchestra#3123

  • 64KB blocks means 15 M blocks / To
  • compressiojn may not be needed with the smaller block size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment