The main goal of this change is to make healing less resource intensive by continously doing healing checks instead of doing it at fixed 30 day intervals at "full" speed.
Furthermore we would like heal checks to be more resumable, so an aborted heal check on a server restart doesn't mean the healing has to start all over again or be postponed 30 days.
Also we would like to utilize the crawling functionality to avoid having to do resource intensive bucket listing of content when scanning.
Healing checks should be made at a reduced rate compared to usage scans. We target that objects are heal scanned 1 in every 1024 cycles.
This means that if average scan cycle time is one hour every object is checked once every 40 days.
This would mean that some size checks would also do a heal check. Heals should have a blocking way to do the check, so it is possible to adjust crawl speed if slow.
Heal checks of course needs to access all shards, so it would be a significant load.
Since we how record the full path of elements up to 2 levels deep we now have the ability to precisely identify what has been removed since the last run on the 2 first object prefix levels.
This means that if we see files missing we check the set if the file has been removed. If not we will immediately do healing for this item.
Full disk healing should remain as it is. Meaning when we detect a new drive and kick off full disk healing we should not do additional heal checks.
Disks that we are currently healing should not be picked up for crawling, so we need to record that information so this can be avoided. This may be useful for other operations as well, so the disk is never picked for reading until healed.
This is the tricky part. We skip significant parts of the tree with bloom filters and we would like to keep that consistent.
For dirty paths we make a predictable determination on whether objects should be heal checked.
This means that objects on bucket+prefix level have a 1 in 1024 chance of being picked up for heal check, since these objects will always be scanned when path is dirty.
Objects at deeper levels will have a 1 in 64 chance of being picked up. Combined with the 1 in 16 chance of the folder being picked up this gives a 1 in 1024 chance that any given object is picked up for healing check.
For clean paths, there is now a 1 in 32 chance that a clean bucket will be picked up for heal check. Objects at bucket and 1 prefix levels have a 1 in 32 chance of being heal checked. This should still keep cycle times reasonable as non-selected objected are completely skipped.
For deeper directories there is also a 1 in 32 chance they will be selected, but all files within this subtree will be heal checked.
This allows to do less traversal and keep cycle times reasonable.
Overview of probabilities
Depth -> | Bucket | Prefix 1 | Prefix 2+ | |
---|---|---|---|---|
Folder crawl chance clean | 1/32 | 1/1 | 1/32 | Overrides bloom filter |
Folder crawl chance dirty | 1/1 | 1/1 | 1/16 | Existing probablilities |
Object heal chance clean | 1/32 | 1/32 | 1/1 | |
Object heal chance dirty | 1/1024 | 1/1024 | 1/64 | |
Combined chance | 1/1024 | 1/1024 | 1/1024 | Must be the same both dirty and clean and for objects at all levels |
The actual selection will not be random, but will be based on the hash of the path.
If a parent path is dirty, but the current isn't, it will be picked in in 1/32 and objects at that level will be picked with the same probablility.
Buckets/paths that change dirty/clean status between cycles will skew probabilities slightly for individual objects, but will overall remain the same.
Paths going in and out of the bloom filter will make it possible for heal checks to be further apart than 1024, but on average the time between heal checks should still remain the same.
No, it will be suboptimal @fwessels
First of all its IO should be dedicated to restoring data. Second of all object listing will be useless/unreliable from it anyway since it only has partial data. Even assuming read/write speed is the same, with this disk restoring at 100% IO, the average load on other disks will only be (n+1)/(2n) where n is the number of data shards in a set, so they will be less loaded than the disk that is restoring and presumably faster.
Writes are usually slower, in that case the load on other disks will be even less compared to the disk being restored.