klauspost/bucket-disk-usage.md Secret

## bucket-disk-usage.md

      
    Raw
  

              bucket-disk-usage.md
            
          
    WIP: This is work in progress. Final implementation may diverge from what is described here.
Goal

The goal of this change is to make disk usage scanning less resources intensive and scale better on large installations.
We want to update totals while the scan is going on instead of having to wait for a full scan.
The derived goal is to be able to control the number of 'stat' operations on directory elements by only selectively do a deep scan below a certain level.
This will enable us to do less intensive scanning and still have fairly accurate disk utilization numbers.
Overall Design

We want to do progressive directory scanning that relies more on deterministic eventual consistency.
Everything is done on separate buckets. Buckets are divided among the drives in a set. This allows to run less aggresive scans while retaining the overall speed of the current scan.
In each bucket, each folder to a certain level is scanned and assigned 128bit a hash based purely on its path.
bucket/folder1/folderA/
bucket/folder2/folderB/folderX/
bucket/folder2/folderB/folderY/folderE/
bucket/folder3/
bucket/file.ext

Each directory is represented in memory as:
type Directory struct {
	OwnSize                  int64
	OwnObjectsCount          uint64
	OwnObjectsSizesHistogram [dataUsageBucketMax]uint64

    Children   [][16]byte
}

Entries are kept as a map[[16]byte]Directory, and should be able to quickly serialize to disk to the bucket. It is not the aim to keep all folders in this struct, but instead only to a specific level, down to level 2.
This means that disk usage will be precise down for 1 level, but folderA, folderB will not contain any children, but will contain totals for everything below it. The hash of folderA/folderB will then determine which run they are recalculated.
Scans will be done per level instead of fully recursive. If we decide to no longer add children totals will be total size including aggregated children.
Algorithm.

We decide a cycles which is the number of cycles after which we want fully correct data usage information on unmodified content. A reasonable number for this could be 10 or 16.
Scan will be done like this per bucket:
cycle = (lastCycle + 1) % cycles. Initialize to 0 if no history is stored in bucket. If there is no history this value will not be used anyway.


Scan top-level directory.

Collect stat info on all objects found at top-level (fs only) into OwnSize.
Update ChildTotal entry with total of all known children.


Scan all subdirectories (folder1, folder2, etc) unconditionally, maybe start with unknown folders.

If file, stat, add to own OwnSize and continue.
If known subdirectory, use hash % cycle = ScanCount to determine whether to rescan. If not, simply add stored total to ChildTotal and continue.
All new folders are scanned to full depth and added to ChildTotal.


When done return found hashes and total size to parent, which will update its own total. The scanner can optionally save the new state as it receives updates from subfolders so progress isn't lost at a server restart.
When bucket is complete, set lastCycle = cycle.
This will ensure that new/removed folders on 2 levels are reflected quickly in a scan and that 'deeper' changes are propagated within cycles runs.
For XL this means objects at top level of a bucket will be exact, but objects with prefixes will be lazily updated, but new/deleted objects at level 2 will be accounted for instantly. Same with directories at these two levels.
Every time a prefix folder has been scanned the total of the bucket can be updated.
On server startup, or when required, either by a scan or request, load the history from the current bucket. and the total size of the bucket is known.
The data is self-healing, since invalid or missing data will simply mean the data will be recreated on next scan. Old entries can be removed by simply adding children of top level and their children. Anything else is obsolete.
Disk usage calculation can still run on random XL node/disk similar to how it does now.
Conclusion/Thoughts

This will enable a less monolithic disk space calculation while picking up new/removed prefixes. Disk space stats will be available at server startup and scan progress can be saved for large cycle times.
This should allow us to do a less aggressive disk scanning, and do scanning at a slower overall speed, but have it run continuosly instead.
A thing to remember is that XL "eats" a directory level, since data is stored in a directory and not in a file at the prefix level.
Separating OwnSize and ChildTotal is only really needed for the bucket, but there isn't much harm in it.
Disk/memory usage of a bucket with N directories, each with an average of M folders: bytes = 40 + (16 * N) + (40 * M * N). Memory will additionally require M * N map[[16]byte] entries.
Example; bucket with 250 prefixes, 1000 objects in each: bytes = 40 + (16 * 256) + (40 * 256 * 1000) = ~10MB.
This could include a full scan one level deeper, but that could lead to a lot of data to keep in memory for known directories as per the caclulation above.
String hashes could be reduced to 64 bits to halve the memory disk used. Implementation should make it easy to adjust hash size. With 8 bytes the probability of ~ 1/590,298,172 of a random collision with 250000 directories (if my math is correct), so 8 bytes could indeed be enough for this. Adding and order of magnitude (10x) more directories reduces this by 2 orders of magnitude, so there it begins getting suspect.
A dynamic throttling setting could be put in to allow tweaking the sleep coefficient of each crawler, so it can be reduced for users who are not as interested in recently updated disk usage.
Future development

Tracking of 'dirty' paths

By monitoring PUT/DELETE operations it is possible to track changed paths and keep a bloom filter for this data. This can help prioritize paths to scan.
The bloom filter can identify paths that have not changed, and the few collisions will only result in a marginal extra workload. This can be implemented on either a bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 250k entries. A bloom table of this size is about 680 bytes when serialized.
To not force a full scan of all paths that have changed shortCycle bloom filters would need to be kept, so we guarantee that dirty paths have been scanned within shortCycle runs. Until shortCycle bloom filters have been collected all paths are considered dirty.
Bloom filters have the advantage they can be created in a distributed manner and merged later, so there is no need for a centralized bloom filter. This allows servers to independently keep bloom filters and no interserver communication is needed.
When starting scans the bloom filters will need to be collected and reset consistently. When collecting bloom filters the merged bloom filter should be saved and kept around in case of an interrupted operation. Individual servers are responsible for persisting the bloom filter on a regular basis and on shutdown. Each server has a single global bloom filter across all sets.
Bloom filter management

The usage scan leader will also be in charge of keeping track of bloom filter global state.
The global state is index X and Y.

Y is the index of the current scan. This is the bloom index that is currently being written to by all servers.
X is the oldest bloom index the crawler master is interested in. Typically this will be Y-cycles.

 * Scan start. 
 * Each server returns merged bloom filter after index X until (but not including) index Y.
 * Each server starts a new bloom filter with index Y if it doesn't exist (likely).
 * (Each server can dump bloom filters up to index X)
 * Scan runs using the merged bloom filters.
 * On successful scan main server increments the global index X and lower index Y.

This ensures that scans are resumable and that servers will be kept in sync. It also ensures that calls to switch bloom filters are idempotent, since calls with index Y will just continue to add data to the existing bloom filter.
As an error recovery mechanism the servers should be able to return the current Y, so a master server can recovers its state if it should be lost.
REST API

type BloomFilterResponse struct {
	// Current index being written to.
	CurrentIdx uint64
	// Oldest index in the returned bloom filter.
	OldestIdx uint64
	// Newest Index in the returned bloom filter.
	NewestIdx uint64
	// Are all indexes between oldest and newest filled?
	Complete bool
	// Binary data of the bloom filter.
	Filter []byte
}

// CycleBloomFilter will cycle the bloom filter to start recording to index y if not already.
// The response will contain a bloom filter starting at index x up to, but not including index y.
// If y is 0, the response will not update y, but return the currently recorded information
// from the current x to y-1.
func CycleBloomFilter(ctx context.Context, x, y uint64) (*BloomFilterResponse, error) {
	return nil, nil
}
Dynamic Scan levels

To accomodate for more diverse data structures it could be helpful to have the levels be adaptive to the number of folders found.