WIP: This is work in progress. Final implementation may diverge from what is described here.
The goal of this change is to make disk usage scanning less resources intensive and scale better on large installations.
We want to update totals while the scan is going on instead of having to wait for a full scan.
The derived goal is to be able to control the number of 'stat' operations on directory elements by only selectively do a deep scan below a certain level.
This will enable us to do less intensive scanning and still have fairly accurate disk utilization numbers.
We want to do progressive directory scanning that relies more on deterministic eventual consistency.
Everything is done on separate buckets. Buckets are divided among the drives in a set. This allows to run less aggresive scans while retaining the overall speed of the current scan.
In each bucket, each folder to a certain level is scanned and assigned 128bit a hash based purely on its path.
bucket/folder1/folderA/
bucket/folder2/folderB/folderX/
bucket/folder2/folderB/folderY/folderE/
bucket/folder3/
bucket/file.ext
Each directory is represented in memory as:
type Directory struct {
OwnSize int64
OwnObjectsCount uint64
OwnObjectsSizesHistogram [dataUsageBucketMax]uint64
Children [][16]byte
}
Entries are kept as a map[[16]byte]Directory
, and should be able to quickly serialize to disk to the bucket. It is not the aim to keep all folders in this struct, but instead only to a specific level, down to level 2.
This means that disk usage will be precise down for 1 level, but folderA
, folderB
will not contain any children, but will contain totals for everything below it. The hash of folderA
/folderB
will then determine which run they are recalculated.
Scans will be done per level instead of fully recursive. If we decide to no longer add children totals will be total size including aggregated children.
We decide a cycles
which is the number of cycles after which we want fully correct data usage information on unmodified content. A reasonable number for this could be 10 or 16.
Scan will be done like this per bucket:
cycle = (lastCycle + 1) % cycles
. Initialize to 0 if no history is stored in bucket. If there is no history this value will not be used anyway.
-
Scan top-level directory.
- Collect stat info on all objects found at top-level (fs only) into
OwnSize
. - Update
ChildTotal
entry with total of all known children.
- Collect stat info on all objects found at top-level (fs only) into
-
Scan all subdirectories (
folder1
,folder2
, etc) unconditionally, maybe start with unknown folders.- If file, stat, add to own
OwnSize
and continue. - If known subdirectory, use
hash % cycle = ScanCount
to determine whether to rescan. If not, simply add stored total toChildTotal
and continue. - All new folders are scanned to full depth and added to
ChildTotal
.
- If file, stat, add to own
When done return found hashes and total size to parent, which will update its own total. The scanner can optionally save the new state as it receives updates from subfolders so progress isn't lost at a server restart.
When bucket is complete, set lastCycle = cycle
.
This will ensure that new/removed folders on 2 levels are reflected quickly in a scan and that 'deeper' changes are propagated within cycles
runs.
For XL this means objects at top level of a bucket will be exact, but objects with prefixes will be lazily updated, but new/deleted objects at level 2 will be accounted for instantly. Same with directories at these two levels.
Every time a prefix folder has been scanned the total of the bucket can be updated.
On server startup, or when required, either by a scan or request, load the history from the current bucket. and the total size of the bucket is known.
The data is self-healing, since invalid or missing data will simply mean the data will be recreated on next scan. Old entries can be removed by simply adding children of top level and their children. Anything else is obsolete.
Disk usage calculation can still run on random XL node/disk similar to how it does now.
This will enable a less monolithic disk space calculation while picking up new/removed prefixes. Disk space stats will be available at server startup and scan progress can be saved for large cycle times.
This should allow us to do a less aggressive disk scanning, and do scanning at a slower overall speed, but have it run continuosly instead.
A thing to remember is that XL "eats" a directory level, since data is stored in a directory and not in a file at the prefix level.
Separating OwnSize
and ChildTotal
is only really needed for the bucket, but there isn't much harm in it.
Disk/memory usage of a bucket with N directories, each with an average of M folders: bytes = 40 + (16 * N) + (40 * M * N)
. Memory will additionally require M * N
map[[16]byte] entries.
Example; bucket with 250 prefixes, 1000 objects in each: bytes = 40 + (16 * 256) + (40 * 256 * 1000) = ~10MB
.
This could include a full scan one level deeper, but that could lead to a lot of data to keep in memory for known directories as per the caclulation above.
String hashes could be reduced to 64 bits to halve the memory disk used. Implementation should make it easy to adjust hash size. With 8 bytes the probability of ~ 1/590,298,172 of a random collision with 250000 directories (if my math is correct), so 8 bytes could indeed be enough for this. Adding and order of magnitude (10x) more directories reduces this by 2 orders of magnitude, so there it begins getting suspect.
A dynamic throttling setting could be put in to allow tweaking the sleep coefficient of each crawler, so it can be reduced for users who are not as interested in recently updated disk usage.
By monitoring PUT/DELETE operations it is possible to track changed paths and keep a bloom filter for this data. This can help prioritize paths to scan.
The bloom filter can identify paths that have not changed, and the few collisions will only result in a marginal extra workload. This can be implemented on either a bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 250k entries. A bloom table of this size is about 680 bytes when serialized.
To not force a full scan of all paths that have changed shortCycle
bloom filters would need to be kept, so we guarantee that dirty paths have been scanned within shortCycle
runs. Until shortCycle
bloom filters have been collected all paths are considered dirty.
Bloom filters have the advantage they can be created in a distributed manner and merged later, so there is no need for a centralized bloom filter. This allows servers to independently keep bloom filters and no interserver communication is needed.
When starting scans the bloom filters will need to be collected and reset consistently. When collecting bloom filters the merged bloom filter should be saved and kept around in case of an interrupted operation. Individual servers are responsible for persisting the bloom filter on a regular basis and on shutdown. Each server has a single global bloom filter across all sets.
The usage scan leader will also be in charge of keeping track of bloom filter global state.
The global state is index X
and Y
.
Y
is the index of the current scan. This is the bloom index that is currently being written to by all servers.X
is the oldest bloom index the crawler master is interested in. Typically this will beY-cycles
.
* Scan start.
* Each server returns merged bloom filter after index X until (but not including) index Y.
* Each server starts a new bloom filter with index Y if it doesn't exist (likely).
* (Each server can dump bloom filters up to index X)
* Scan runs using the merged bloom filters.
* On successful scan main server increments the global index X and lower index Y.
This ensures that scans are resumable and that servers will be kept in sync. It also ensures that calls to switch bloom filters are idempotent, since calls with index Y
will just continue to add data to the existing bloom filter.
As an error recovery mechanism the servers should be able to return the current Y
, so a master server can recovers its state if it should be lost.
type BloomFilterResponse struct {
// Current index being written to.
CurrentIdx uint64
// Oldest index in the returned bloom filter.
OldestIdx uint64
// Newest Index in the returned bloom filter.
NewestIdx uint64
// Are all indexes between oldest and newest filled?
Complete bool
// Binary data of the bloom filter.
Filter []byte
}
// CycleBloomFilter will cycle the bloom filter to start recording to index y if not already.
// The response will contain a bloom filter starting at index x up to, but not including index y.
// If y is 0, the response will not update y, but return the currently recorded information
// from the current x to y-1.
func CycleBloomFilter(ctx context.Context, x, y uint64) (*BloomFilterResponse, error) {
return nil, nil
}
To accomodate for more diverse data structures it could be helpful to have the levels be adaptive to the number of folders found.