klauspost/crawlheal.md Secret

## crawlheal.md

      
    Raw
  

              crawlheal.md
            
          
    Continuous Healing

Goal

The main goal of this change is to make healing less resource intensive by continously doing healing checks instead of doing it at fixed 30 day intervals at "full" speed.
Furthermore we would like heal checks to be more resumable, so an aborted heal check on a server restart doesn't mean the healing has to start all over again or be postponed 30 days.
Also we would like to utilize the crawling functionality to avoid having to do resource intensive bucket listing of content when scanning.
Design

Healing checks should be made at a reduced rate compared to usage scans. We target that objects are heal scanned 1 in every 1024 cycles.
This means that if average scan cycle time is one hour every object is checked once every 40 days.
This would mean that some size checks would also do a heal check. Heals should have a blocking way to do the check, so it is possible to adjust crawl speed if slow.
Heal checks of course needs to access all shards, so it would be a significant load.
Quick detection of deleted vs missing files.

Since we how record the full path of elements up to 2 levels deep we now have the ability to precisely identify what has been removed since the last run on the 2 first object prefix levels.
This means that if we see files missing we check the set if the file has been removed. If not we will immediately do healing for this item.
Full disk healing

Full disk healing should remain as it is. Meaning when we detect a new drive and kick off full disk healing we should not do additional heal checks.
Disks that we are currently healing should not be picked up for crawling, so we need to record that information so this can be avoided. This may be useful for other operations as well, so the disk is never picked for reading until healed.
Healing and bloom filters?

This is the tricky part. We skip significant parts of the tree with bloom filters and we would like to keep that consistent.
For dirty paths we make a predictable determination on whether objects should be heal checked.
This means that objects on bucket+prefix level have a 1 in 1024 chance of being picked up for heal check, since these objects will always be scanned when path is dirty.
Objects at deeper levels will have a 1 in 64 chance of being picked up. Combined with the 1 in 16 chance of the folder being picked up this gives a 1 in 1024 chance that any given object is picked up for healing check.
For clean paths, there is now a 1 in 32 chance that a clean bucket will be picked up for heal check. Objects at bucket and 1 prefix levels have a 1 in 32 chance of being heal checked. This should still keep cycle times reasonable as non-selected objected are completely skipped.
For deeper directories there is also a 1 in 32 chance they will be selected, but all files within this subtree will be heal checked.
This allows to do less traversal and keep cycle times reasonable.
Overview of probabilities


Depth ->
Bucket
Prefix 1
Prefix 2+


Folder crawl chance clean
1/32
1/1
1/32
Overrides bloom filter


Folder crawl chance dirty
1/1
1/1
1/16
Existing probablilities


Object heal chance clean
1/32
1/32
1/1


Object heal chance dirty
1/1024
1/1024
1/64


Combined chance
1/1024
1/1024
1/1024
Must be the same both  dirty and clean and for objects at all levels


The actual selection will not be random, but will be based on the hash of the path.
If a parent path is dirty, but the current isn't, it will be picked in in 1/32 and objects at that level will be picked with the same probablility.
Buckets/paths that change dirty/clean status between cycles will skew probabilities slightly for individual objects, but will overall remain the same.
Paths going in and out of the bloom filter will make it possible for heal checks to be further apart than 1024, but on average the time between heal checks should still remain the same.
Depth ->	Bucket	Prefix 1	Prefix 2+
Folder crawl chance clean	1/32	1/1	1/32	Overrides bloom filter
Folder crawl chance dirty	1/1	1/1	1/16	Existing probablilities
Object heal chance clean	1/32	1/32	1/1
Object heal chance dirty	1/1024	1/1024	1/64
Combined chance	1/1024	1/1024	1/1024	Must be the same both dirty and clean and for objects at all levels