After implementing the metacache for improved listings, we will investigate the pros and cons and look for possible modifications to the current listing code.
Some of the features of v2 has proved counterproductive (as a typical v2) and we should evaluate which modifications would make sense long-term.
- Investigate possible improvements to the current listing code.
- Investigate possible simplifications and their tradeoffs.
While the cache appears to be working fine in most setups, there are some failure scenarios where MinIO doesn't react very well.
Some clients will requests listings even when they are only interested in one or a few entries.
The listing continues even if the client has stopped fetching results wasting significant amounts of IO.
For non-recursive listings this is not a big problem except in huge folders, but for recursive listings this can lead to a lot of wasted IO.
Requests does not do early filtering. We optimistically build a cache for the top level of the request in case others would be interested. While this is sometimes true it wastes significant resources.
Requests for bucket/prefix/file
will result in a full scan of the prefix even if the client is really only interested in one result.
Only a limited number of folders would actually be interesting and we can reduce IO in those cases.
Stateful listings are meant for quickly being able to resume listings, and enabling resuming on any machine in the cluster.
Listings are stored on blocks of 5000 entries per block so listings can resume from inprogress listings.
Listings are stored on each set and forwarded/combined as needed.
Blocks are written on all sets with listing information. Blocks are 5K entries in size and typically 3-400KB per block. Blocks are written to all sets on the server.
- Allows to filter more agressively on list builder, forward to item, only return matching prefix.
- Drop bloom filters and not cosider them since there is no sharing any more.
When listing on disk we will start at the specified offset and not list the direcory that contains the requested entry.
If all results are returned nothing will be persisted to disk.
Don't store individual set listings, but store the cluster-wide merged listing instead. The server initiating the listing (or continuing it, see below) is responsible for merging and storing the merged list.
This still allow listing to resume from any server and reduce read amplification on reads.
Continuing listings is relatively expensive, since some state has to be re-created to reach the listing position.
However, listings should stop when a given number of entries have been listed, say the limit of the first list request.
If the listing is continued, meaning we see the list ID as a continuation, we will resume the listing until it is done.
This will keep the listing going when a lot of objects have been requested.
Each block will contain information on how to resume listing.
Since we no longer will be sharing caches, the only thing needed by the manager is to expire and delete outdated listings. It would be reasonable for the manager to control resuming a listing, but with the actual listing being done by the host receiving the request.
Lists can reasonably be removed after 30 minutes of inactivity.
Keeping stateful listings on disks allows unlimited continuation and re-requests of listings. This uses disk IOPs, but allows for clean operations.
An option would be to have listings in memory.
When servers restart the in-memory data would be lost and all listings would have to resume with the slight penalty of restoring that state. I worried about the potential of a "stampeding herd" problem when servers restart since listing requests will all have to restart at the same time.
Also, this doesn't scale too well as memory is a finite resource, and likely less so than disk space.
The proposal has little in terms of new code, mainly it will be a reorganization and cleanup of the existing.
The only "new" feature would be blocks have resume information within them, so anyone can pick up listing.