Skip to content

Instantly share code, notes, and snippets.

@klauspost
Last active June 15, 2021 11:37
Show Gist options
  • Save klauspost/410fad038bc60b40700b3499a3c45363 to your computer and use it in GitHub Desktop.
Save klauspost/410fad038bc60b40700b3499a3c45363 to your computer and use it in GitHub Desktop.
Listing v3

Listing revamp

After implementing the metacache for improved listings, we will investigate the pros and cons and look for possible modifications to the current listing code.

Some of the features of v2 has proved counterproductive (as a typical v2) and we should evaluate which modifications would make sense long-term.

Goal

  • Investigate possible improvements to the current listing code.
  • Investigate possible simplifications and their tradeoffs.

Background

While the cache appears to be working fine in most setups, there are some failure scenarios where MinIO doesn't react very well.

A) Huge recursive listings are requested, but abandoned after first results are returned.

Some clients will requests listings even when they are only interested in one or a few entries.

The listing continues even if the client has stopped fetching results wasting significant amounts of IO.

For non-recursive listings this is not a big problem except in huge folders, but for recursive listings this can lead to a lot of wasted IO.

B) Requests does not do early filtering.

Requests does not do early filtering. We optimistically build a cache for the top level of the request in case others would be interested. While this is sometimes true it wastes significant resources.

Requests for bucket/prefix/file will result in a full scan of the prefix even if the client is really only interested in one result.

Only a limited number of folders would actually be interesting and we can reduce IO in those cases.

C) Listing results in disk IO

Stateful listings are meant for quickly being able to resume listings, and enabling resuming on any machine in the cluster.

Listings are stored on blocks of 5000 entries per block so listings can resume from inprogress listings.

Listings are stored on each set and forwarded/combined as needed.

Blocks are written on all sets with listing information. Blocks are 5K entries in size and typically 3-400KB per block. Blocks are written to all sets on the server.

Proposal

X) Drop cache sharing.

  • Allows to filter more agressively on list builder, forward to item, only return matching prefix.
  • Drop bloom filters and not cosider them since there is no sharing any more.

When listing on disk we will start at the specified offset and not list the direcory that contains the requested entry.

X) No state saved if all results are returned.

If all results are returned nothing will be persisted to disk.

X) Keep merged list only.

Don't store individual set listings, but store the cluster-wide merged listing instead. The server initiating the listing (or continuing it, see below) is responsible for merging and storing the merged list.

This still allow listing to resume from any server and reduce read amplification on reads.

X) Stop/block listing

Continuing listings is relatively expensive, since some state has to be re-created to reach the listing position.

However, listings should stop when a given number of entries have been listed, say the limit of the first list request.

If the listing is continued, meaning we see the list ID as a continuation, we will resume the listing until it is done.

This will keep the listing going when a lot of objects have been requested.

Each block will contain information on how to resume listing.

X) Cache manager simplificaions

Since we no longer will be sharing caches, the only thing needed by the manager is to expire and delete outdated listings. It would be reasonable for the manager to control resuming a listing, but with the actual listing being done by the host receiving the request.

Lists can reasonably be removed after 30 minutes of inactivity.

Discussion

Stateful vs In-memory

Keeping stateful listings on disks allows unlimited continuation and re-requests of listings. This uses disk IOPs, but allows for clean operations.

An option would be to have listings in memory.

When servers restart the in-memory data would be lost and all listings would have to resume with the slight penalty of restoring that state. I worried about the potential of a "stampeding herd" problem when servers restart since listing requests will all have to restart at the same time.

Also, this doesn't scale too well as memory is a finite resource, and likely less so than disk space.

Implementation

The proposal has little in terms of new code, mainly it will be a reorganization and cleanup of the existing.

The only "new" feature would be blocks have resume information within them, so anyone can pick up listing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment