The goal of this feature is to provide fully scalable object listing capcbility that can scale to millions of objects.
The main change will be that RAM usage should not depend on the number of objects beyond a certain extent.
We want to provide a way of sharing information about listings between requests so less duplicate work will be performed if multiple requests are overlapping.
We want these caches to be available with as few lookups/IO as possible. Preferably a single lookup to determine if any path is cached.
This feature will only be available for distributed data.
We want to keep a per SET cache of metadata. When listing across multiple sets/zones this data is merged to provide the final output.
A list of available caches are kept per CLUSTER. This does not contain the scan data itself, but only which data is available.
Each cache entry contains the following information:
- Unique ID of the scan (provided by the scan initiator).
- Bucket of the scan.
- Is this a recursive scan?
- Timestamp of scan start/end.
- Root of the scan. Eg.
prefix/prefix2
. - Status of the scan: Running/Finished/Errored.
- Error, if any.
- Last time the cache was handed out.
- Last time a scanner sent an update.
- Bloom filter cycle when the scan started/ended.
Scanning this list will make it quick to identify if potential data is available for listing a given bucket+prefix.
The bloom filter index will help identify if a rescan is required to get current data.
The unique ID will be returned for continuation tokens. If this token is present the list will be used unconditionally regardless of whether updates has been made. This allows any server in the cluster to fulfill the request.
When a new scan started an entry is added to the list. The list data is protected by locks.
Each SET will contain a cache for each scan with the collected list of METADATA for all objects in the given bucket/prefix in the set.
This will be stored as a sorted STREAM of object path (string) and raw content of xl.meta
(binary). See Appendix A for size estimates.
This will allow to quickly scan through the file to identify the parts of the file that is interesting and only decode these.
All operations below the zone level will operate on this data type, so results sent back to callers will be in (string, metadata) format.
When the bloomcycle in the cache is no longer being tracked.
This is the only rule that will invalidate a cache. It is kept around until that is met, but may be replaced with a more recent one.
We will keep caches around even if the root path is marked dirty in the bloom filter. There may come a request for a path deeper than the root which may not be dirty.
The same path can be scanned while the previous result is being served.
Partial listing can be done 2 ways:
- Request files that are sorted after a given object name. This can be used for continuing list scans.
- Request all files that have a specified prefix only.
Each request will have a maximum number of entries returned. The caller can specify if the total should include or exclude versioned files that are marked as deleted.
In both cases the results can be done with a single read through the list, filtering out the elements requested. Since the list is ordered per set it should be fast to forward to interesting data and stop once a condition is met. This is a fully streaming operation and never requires more than the number of requested objects to be kept in memory.
For merging data from multiple sets/zones a the maximum number of objects will be requested from each and a simple merge sort can be used to provide a consistent listing without having to ever have more than 2x the number of requested objects in memory.
The handler for all zones will manage initiating scans and will handle results and keep the status updated. If the client disconnects a scan will continue running until done.
The scans do not distinguish between versioned/nonversioned request. Version information is always stored in the cache and can optionally be discarded by the caller.
The scan for each set will be similar to how listing is done now, except that results are collected on disk instead of being kept in memory and forwarded. The results are stored as they are found on disk.
Once (unordered) results have been written to disk it will be read back with max 10.000 objects, sorted and stored. If this results in more than one list they are merged into a single final list. This will reduce the total memory footprint to at most 10.000 objects for any bucket size.
At no point will all objects be required to be in memory.
This should offer the same level of performance as existing listobjects for cold requests. We need to crawl the folder structure, so not much will change there. Resuming a list will be similar to now, except that data should not be evicted. A bloom cycle is at lest 5 minutes and the bloom filter tracks at least 16 cycles. So 16x5=80 minutes is the earliest data will be evicted.
The only other way specific data could be evicted would be a new scan being initiated on the same path, replacing the old scan. In that case I propose we simply switch to returning data from the new scan, so changes from after the first scan will become visible. This is as far as I can tell the same behaviour as we have currently.
There are a few choices we have to make to trade off consistency vs cache hits. Let's set up a scenario:
Path
bucket/prefix
is listed by client A. We have no cache. When we start the bloom filter forbucket/prefix
is dirty. Pathbucket/prefix
is then requested by client B.
Since the bloom filter is dirty we have no information on whether changes have occurred between client A and B requesting.
In which (if any) cases should the request from client B use the cache from the previous request?
- If the listing for client A hasn't finished generating yet?
- Never? Will ensure that client B will get all events that happened before the request was sent.
- Always? As long as we are in the same bloom cycle as when listing from client a started? This will give quite long time-to-consistency but obviously have much better cache hit ratio.
Having full "list after update" consistency will in many cases make it impossible to reuse cached data across requests unless the bloom filter is perfectly clean for that path. So doing 5 ListObjects
on the same path will start 5 scans. Allowing for re-using data from in-progress scans will reduce that scenario.
A "Metadata change list" as discussed below would enable full consistency and using the cache, but that may be a bit too complex for initial implementation. However it could be added in the future.
When crawling a set of 4-16 disks we have the option to crawl less than all disks and merge the results.
For the initial implementation we will crawl 3 randomly chosen disks that are not in a healing state.
We have the option to ask more disks and only use the fastest results and cancel the rest. This has the downside that we will favor disks with less entries, so we may be amplifying bad results.
It is expected there will be conflicting information returned by drives. This will mainly be caused by each walker getting to a specific entry at different times. Other causes are unsynchronized changes requiring healing.
We don't have much information on directories. Therefore directories are determined on a 50% vote by the queried drives, meaning at least 50% should have the directory.
If any objects are returned from within the directory it is always added (should be determinable by inspecting previous and deferring it until next element has been resolved). Will only work for recursive listings.
Disagreement is resolved in the following order:
If there is disagreement, but modtime of latest version > query start time always include latest.
If object is versioned see if one is a later version of another, choose that one.
If some disks report an object exists but other not resolve by requesting versioned fileinfo from the set, assume that to be truth.
Generate xlMetaV2
from that, or load it.
The only conflicts that can occur at a zone level will be objects that exist in two different zones. These conflicts are resolved by selecting the newest entry (and maybe logging an error).
As described the cache state needs to be retained per bucket.
However, considering that it should now also be updated during a listing I think I will have to revise my plan of implementing it through pure LOCKING and a clusterwide object per bucket.
We need a system that can maintain a per bucket cache state in memory and only occationally persist the state to disk. The operations will be quick and be an authority on the current state. Thankfully the operations are neither CPU nor memory intensive, since it just retains an index.
What seems to make sense is to use a deterministic host (peerRESTClient) based on the bucket and have each host be 'resposible' for maintaining a number of buckets. It is important to note than contrary to the existing implementation it will only maintain the state. The listing itself is still handled by each host, this server will only coordinate cache requests and updates.
This means that this host will be asked to check if a specific path in a bucket has a cached information, retain information about the caches currently being built and their completion state.
This should allow us to scalable and reliably share information about the caches across requests and not have individual requests fight for locks.
If a server is unavailable, we will have to assume no cache. It should of course be very rare that a server is completely down for a longer time.
- The request comes in with a given ID. This is centrally controlled.
- We attempt to load the file metadata for the ID.
- If it exists we use the metadata to determine which part we should start at.
- If not done yet, wait and retry.
- Load the actual data and filter out the results we want.
Building cache:
- Start listing.
- When enough have been listed return the initiating request at once.
- When 5000 entries have been listed save them as the first part.
- For each 5000 entries create a new object (id-partnum)
- Update the metadata of the first object with the name of the first+last entry.
- Continue until listing complete. Rather simple but requires some 'plumbing' and error handling is tricky. I think on errors the parts are just deleted, but maybe we want to communicate upstream
It would be possible to track metadata updates and deleted performed within a bloom cycle and store these as append-only streams for indexed paths.
This would make it possible to reconstruct current state by reading the existing cache and merge in changes to objects from the cache bloom cycle until now.
This design will be easy to extend to return objects with specific properties and allow quick metadata filtering, effectively allowing database-like operations.
The crawler, by design, only checks one disk when crawling and therefore doesn't know enough about the disk state to provide information reliable enough for listing without having to double check data. This defeats the purpose of the cache.
The MinIO server is by design not made to contain a persistent database of objects. We will not reconsider this design choice since it provides its own downsides.
Did a small test. I utilized the crawler just to collect metadata to get an idea of sizes.
It is stored as [string(objectpath), []byte(metadata-as-stored-on-disk), string(objectpath2), ....]
.
That should allow to quickly scan object names and only decode if we are interested and everything is fully streaming. I made it so all objects below a given level are combined into one file. Here it mean that all files with bucket/go_113/src/
prefix is in a single file.
I added S2 compression and it reduces the total by ~50% which is fine since it is basically free. Encryption is enabled, so metadata is about 900 bytes per file due to keys. Versions should compress even better. zstandard compressed size is also included, but uses more CPU.
go/src folder, 6.119 Files, 729 Folders
-rwxrwxrwx 1 klaus klaus 5708590 Aug 27 16:37 .prefix.meta
-rwxrwxrwx 1 klaus klaus 2304935 Aug 27 16:37 .prefix.meta.s2
-rwxrwxrwx 1 klaus klaus 1719831 Aug 27 16:37 .prefix.meta.zst