Skip to content

Instantly share code, notes, and snippets.

@klauspost
Last active July 17, 2020 08:25
Show Gist options
  • Save klauspost/792fe25c315caf1dd15c8e79df124914 to your computer and use it in GitHub Desktop.
Save klauspost/792fe25c315caf1dd15c8e79df124914 to your computer and use it in GitHub Desktop.
crawling heal checks

Continuous Healing

Goal

The main goal of this change is to make healing less resource intensive by continously doing healing checks instead of doing it at fixed 30 day intervals at "full" speed.

Furthermore we would like heal checks to be more resumable, so an aborted heal check on a server restart doesn't mean the healing has to start all over again or be postponed 30 days.

Also we would like to utilize the crawling functionality to avoid having to do resource intensive bucket listing of content when scanning.

Design

Healing checks should be made at a reduced rate compared to usage scans. We target that objects are heal scanned 1 in every 1024 cycles.

This means that if average scan cycle time is one hour every object is checked once every 40 days.

This would mean that some size checks would also do a heal check. Heals should have a blocking way to do the check, so it is possible to adjust crawl speed if slow.

Heal checks of course needs to access all shards, so it would be a significant load.

Quick detection of deleted vs missing files.

Since we how record the full path of elements up to 2 levels deep we now have the ability to precisely identify what has been removed since the last run on the 2 first object prefix levels.

This means that if we see files missing we check the set if the file has been removed. If not we will immediately do healing for this item.

Full disk healing

Full disk healing should remain as it is. Meaning when we detect a new drive and kick off full disk healing we should not do additional heal checks.

Disks that we are currently healing should not be picked up for crawling, so we need to record that information so this can be avoided. This may be useful for other operations as well, so the disk is never picked for reading until healed.

Healing and bloom filters?

This is the tricky part. We skip significant parts of the tree with bloom filters and we would like to keep that consistent.

For dirty paths we make a predictable determination on whether objects should be heal checked.

This means that objects on bucket+prefix level have a 1 in 1024 chance of being picked up for heal check, since these objects will always be scanned when path is dirty.

Objects at deeper levels will have a 1 in 64 chance of being picked up. Combined with the 1 in 16 chance of the folder being picked up this gives a 1 in 1024 chance that any given object is picked up for healing check.

For clean paths, there is now a 1 in 32 chance that a clean bucket will be picked up for heal check. Objects at bucket and 1 prefix levels have a 1 in 32 chance of being heal checked. This should still keep cycle times reasonable as non-selected objected are completely skipped.

For deeper directories there is also a 1 in 32 chance they will be selected, but all files within this subtree will be heal checked.

This allows to do less traversal and keep cycle times reasonable.

Overview of probabilities

Depth -> Bucket Prefix 1 Prefix 2+
Folder crawl chance clean 1/32 1/1 1/32 Overrides bloom filter
Folder crawl chance dirty 1/1 1/1 1/16 Existing probablilities
Object heal chance clean 1/32 1/32 1/1
Object heal chance dirty 1/1024 1/1024 1/64
Combined chance 1/1024 1/1024 1/1024 Must be the same both dirty and clean and for objects at all levels

The actual selection will not be random, but will be based on the hash of the path.

If a parent path is dirty, but the current isn't, it will be picked in in 1/32 and objects at that level will be picked with the same probablility.

Buckets/paths that change dirty/clean status between cycles will skew probabilities slightly for individual objects, but will overall remain the same.

Paths going in and out of the bloom filter will make it possible for heal checks to be further apart than 1024, but on average the time between heal checks should still remain the same.

@fwessels
Copy link

Looks pretty good, couple of minor remarks:

Disks that we are currently healing should not be picked up for crawling, so we need to record that information so this can be avoided. This may be useful for other operations as well, so the disk is never picked for reading until healed.

Not sure if we need to keep track of disks that are being healed for not participating in reading -- if a disk is half healed, it is ok to help out with reading again ....

@klauspost
Copy link
Author

if a disk is half healed, it is ok to help out with reading again ....

No, it will be suboptimal @fwessels

First of all its IO should be dedicated to restoring data. Second of all object listing will be useless/unreliable from it anyway since it only has partial data. Even assuming read/write speed is the same, with this disk restoring at 100% IO, the average load on other disks will only be (n+1)/(2n) where n is the number of data shards in a set, so they will be less loaded than the disk that is restoring and presumably faster.

Writes are usually slower, in that case the load on other disks will be even less compared to the disk being restored.

@fwessels
Copy link

For sure it is suboptimal from a performance point of view.

Just not sure if the additional code complexity outweights the benefits for the rare event of swapping out a disk.

@klauspost
Copy link
Author

It is required for this, otherwise there will be 2 processes trying to heal the same data, so we will need to add the code anyway.

@abperiasamy
Copy link

It is important to have one universal crawler for healing, ILM, accounting and any other future needs. While the requirements are different between the healer and the accounting subsystem, it is possible to converge the crawler. It is important to have only one crawler scan at any given time. Different subsystems will register their callbacks and they can choose to no-up. Some tasks are best left to MRF, when the list is known. Missing or partial objects are also detected by the quorum check. Healer takes care of clearing partial objects. Key idea is to separate the crawler and the subsystem code. Crawler should focus on efficient and smart scans. Subsystems will determine themselves in the callbacks to take some action or not.

Lets discuss more on the phone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment