Skip to content

Instantly share code, notes, and snippets.

@antirez
Last active September 9, 2019 09:00
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save antirez/b2eb293819666ee104c7fcad71986eb7 to your computer and use it in GitHub Desktop.
Save antirez/b2eb293819666ee104c7fcad71986eb7 to your computer and use it in GitHub Desktop.
New keyspace representation and expire algorithm of Redis: preliminary analysis

Preliminary analysis of an alternative implementation of Redis expires

Background

A couple of months ago I started experimenting with replacing two key parts of the Redis internals:

  1. The in memory representation of the keyspace.
  2. The algorithm used by Redis in order to expire keys.

The first task is actually needed in order to perform the second, but in general the gaol of the experiment was to attempt to see if it was possible to catch on some technical debt related to the way the expire algorithm worked, or if what we had in place was already optimal: I started the work knowing in advance that there was the risk that I had to throw away all the work I was doing in case it had severe regressions. However sometimes to advance a system there is to go forward and try to improve it in some bold way even if what is in place looks reasonable.

What were such technical debts, or if we want to use a different wording, what were the shortcomings of the expire algorithm and the old representation of the key space?

The old algorithm used random sampling on the hash table representing the keys inside a Redis database, in order to find already expired ones. So the algorithm was not precise: keys could be removed in an order different than their logical expire. This was not a huge deal but was very noticeable once we implemented, years ago, a notification (via Pub/Sub) for each expired key. Another problem was that it was very hard to expire a subset of keys at all, especially in the case where the key space contains a mix of keys that are going to expire very soon, and a much larger set of keys that will expire after a long time. In such a case, random sampling is very costly, we are going to hit again and again keys that are not expired, and there is to do a lot of sampling to find the few that instead need to be evicted because expired.

Moreover, my suspicious was that the random sampling process was also very costly from the POV of the CPU. Apparently this was not so true, because the new implementation does not look to be much more efficient so far. But such idea was definitely one of the motivations.

On the other hand, the new implementation of the expire algorithm that is described in the next section, also required the key object itself to store the expire time, that was before just stored in another hash table. This was required because the new implementation uses a sorted data structure (by expire time) to represent the expires. In order to lookup a key in the tree of expires, we need to know the exact expire time of the key. But with the old representation of keys, there was no place to store such expire: the key in the hash table was just an “SDS” string. So it was also required to substitute keys in the hash table with “key objects”, C structures with a few fields.

And, in turn, this was another long desired thing: finally we could move things like the LRU/LFU metadata where they belong, in the key! Not in the associated object. When maxmemory with an eviction policy is enabled, Redis cannot even use the small shared integers as values (a memory optimization) because otherwise all the shared objects would have the same LFU/LRU fields in common among many keys. Not just that, finally we also had a few spare flags both in the keys objects and in the value objects, that could be used in many future improvements of Redis.

Implementation details

The new algorithm is conceptually very simple. The key object stored in the hash table representing the keys is now the following C structure:

#define KEY_FLAG_EXPIRE (1<<0)
typedef struct redisKey {
    uint32_t len;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    unsigned flags:KEY_FLAGS_BITS;
    uint64_t expire;
    char name[];
} rkey;

So when a key has an expire, it is flagged with KEY_FLAG_EXPIRE, and the expire time, in milliseconds unix time, is set in the “expire” field. At the same time, the same key goes inside a radix tree, in the following format:

<expire time in big endian byte order> + <key object pointer>

Because the prefix is the expire time, and is in byte order, the radix tree, which is lexicographically sorted, would serve as a list of keys sorted by expire time. To expire keys there was no longer need to perform any random sampling: just seek the radix tree at the start, and if the current time is greater than the expire time of the first key, expire the key, and continue to the next, until the condition is true.

Such implementation is conceptually a big step forward compared to the old, but software is the land of compromises, and unfortunately such vanilla implementation also tends to use more memory, and also has regressions in terms of loading times of RDB files, and potentially other more speed regressions yet to discover. See later for more details.

Status of the implemementation

The implementation is definitely just alpha quality: given the possibility that the code is to throw away, I had no interest in finishing the 5% details. However what is implemented is definitely written in a very clean way like final production code (but of course it has bugs). I didn’t even try to run the Redis test against this implementation, it’s definitely not such level of quality. It is the MVP needed to evaluate it.

For this reason, before to validate my findings below, I will check better if the implementation is sane enough for the results to be considered valid. For instance it should not have memory leaks and valgrind should not complain when doing obvious operations.

Evaluation of the implementation

I ran a few benchmarks comparing the new and old implementation, finding a few very concerning regressions:

Memory usage:

5M keys without expires: before: 444 MB after: 673 MB (+50% regression)

5M keys with expires: before: 548 MB after: 806 MB (+50% regression)

As you can see the memory usage regression is unacceptable, however there are potential simple fixes for this problem. See later in the next sections.

RDB loading and saving performances:

RDB saving of 5000000 keys without expires: before: 3.93 seconds after: 3.70 seconds (Very small improvement)

RDB loading of 5000000 keys with expires: before: 6.5 seconds after: 10.5 seconds (Huge regression)

Saving 5000000 keys with expires: before: 4.9 seconds after: 4.0 seconds (20% improvement)

RDB loading of 5000000 keys without expires: before: 4.7 seconds after: 4.7 seconds

The RDB loading times regression with expires is the most worrying: profiling the code shows that the additional time is needed in order to populate the radix tree, in the raxInsert() function. Probably it is more costly than populating the hash table of expires for a few reasons, even if normally the radix tree is actually faster than the hash table:

  1. During the loading step we resize the expires hash table immediately to the right size, so in that specific task the hash table is faster than usually.
  2. The old implementation uses a shared key between the main dictionary and the expires dictionary, this probably saves time as well: no allocation / deallocation needed in the expires hash table, less cache misses.
  3. The radix tree speed may depend greatly on the exact workload: unix times in milliseconds, having many common prefixes, will continuously split nodes. The hash table performance characteristics instead are a lot more stable. To be honest this is probably also very dependent on the exact RDB I was loading, where keys expire times where just a few milliseconds aways each other, likely causing even more node splitting. Yet the regression is quite severe so it is likely to be there in general.

Key access and expire performances

The new version is apparently a bit faster, for instance the following benchmark:

redis-benchmark -P 32 -r 5000000 -n 10000000 get __rand_int__

Reports 571k ops/sec VS the old 520k ops/sec.

Such difference is more marked if we try to access the key TTL, which is now inside the key and does not require an external hash table access, so substituting the GET command to the TTL command in the above benchmark results into performances going from 500 ops/sec in the old version to 600 ops/sec in the new one.

Initial benchmarks don’t show any improvement nor regression in the Redis overall speed in presence of many expires that are both created and evicted. I ran the following test that should indeed show some obvious difference if any, but found none:

redis-benchmark -P 32 -n 1000000000 -r 10000000 SET __rand_int__ foo EX 1

Such benchmark sets a number of keys with an expire of 1 second, so there is constantly a flow of expire events happening, while at the same time we populate the database. However both the implementations report the same IOPs for this test, I guess that what is happening is that the time we lose in setExpire() is compensated by the faster expire cycle.

Techniques to reduce the memory impact

The speed regression may not be trivially solvable, unless we do some effort in order to speedup the radix tree implementation itself, which may be indeed possible. However the memory regression could be strongly mitigated in the following way. Let’s recall the current redisKey structure for a moment:

#define KEY_FLAG_EXPIRE (1<<0)
typedef struct redisKey {
    uint32_t len;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    unsigned flags:KEY_FLAGS_BITS;
    uint64_t expire;
    char name[];
} rkey;

When no expire is set, we are losing the “expire” field which is 8 bytes. We can instead just allocate a modified version of the structure when there is no expire. Moreover the key length field of 32 bit is often definitely a big waste. So the structure could be redefined as such:

#define KEY_FLAG_EXPIRE (1<<0)
typedef struct redisKey {
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    unsigned flags:KEY_FLAGS_BITS;
    char sds_name[];
} rkey;

Basically the key name is now stored as SDS, which has a variable length header (or, if you prefer, could be stored in an “SDS alike way”). If we want to really have SDS, we can just use an accessor function to compute the offset of the SDS pointer inside the key allocation. Moreover, the expire could be stored at the end of everything as additional 8 bytes, only if the KEY_FLAG_EXPIRE is set. This however will make the implementation more complex and will require reallocation of the key object when the expire is removed or set in an existing key.

An alternative could be to accept the memory regression, but win memory in some other way. For instance when the key has no expire set, the 8 bytes of the expire could be used to store directly the value of the key, that may translate to an huge memory reduction, not having to allocate an robj for the value itself. In this case we could flag the key with two special flags:

  • KEY_FLAG_INTVAL
  • KEY_FLAG_STRVAL

That signal that our key has no associated value since the expire field itself holds an integer or a short string.

Alternative implementations

An alternative to all this, is to retain the old implementation, and use instead this exact radix tree technique just to augment the old expire cycle. Basically we could take a capped radix tree with, for instance, 1% the size of the current expires size.

Every time we run the expire cycle, if a key we sample has an expire that is nearest in time compared to the current oldest expire we have in the tree, we drop the old one, and insert the new key into the radix tree. We don’t bother at all about trying to update the radix tree if the keys are modified, we use the tree just as an optimization. Later, when we call the expire cycle, instead of doing just random sampling, we run the tree instead. Such optimization may do two things:

  1. Improve the precision, but without any guarantee about the expire order.
  2. Avoid the problem we have right now that up to 25% of logically expired keys can continue to use memory, because we are able to expire keys that are near-to-expire much more efficiently. The exit condition of the loop would no longer be to exit if we can’t find at least 3 expires every 4 sampled keys, but could be to exit only if that condition is true and, at the same time, the tree of keys to expire is empty.
@tuananh
Copy link

tuananh commented Jun 20, 2019

even if you use it for optimization, the rdb loading is still slower because it still have to do raxInsert() right?

@d4nshields
Copy link

d4nshields commented Jun 21, 2019

On the one hand, random expire seems wasteful of CPU time by considering keys that may not expire soon, and yet it only costs 8 bytes on the central redisKey struct to keep track, and on the other hand, one can envision several potential strategies that would prioritise the expirations on insert and allow them to be addressed in order, but the net cost in bytes per key is huge by comparison.

I have run tests on a priority queue expire strategy, and there are indeed interesting results, but definitely at the cost of memory. At what cost memory, CPU? And what is a formula to express the tradeoff? Perhaps this decision should be left to application developers or even dev-ops. If interested I can provide more details and numbers. Let the debate begin!

@JesperTreetop
Copy link

JesperTreetop commented Jul 6, 2019

The key structure size could be shaved down by combining the key length and the expiration. When I say "value" below, I mean this combination, not the value of the Redis key. All this is provided only as a thought experiment, and without deep knowledge of Redis's internals.

To begin with, expiration timestamps when living in Redis memory could be relative to the initial timestamp Redis was launched, under the assumption that expiration timestamps in the range 0..[millisecond before Redis launch] are not meaningful. (If they would have expired before now, why even return them?) This is also under the assumption that these short forms are only used by Redis when in use, and that the RDB form still contains an absolute timestamp, as it always must have done.

Similarly, although they are not impossible or precluded, it is possible to assume that most keys are significantly shorter than the max allowed size (2^29 bytes).

As an example, these combinations could be used:

 A:  1LLL LLLL  LLLL LLLL  LLLL LLLL  LLLL LLLL
        L = 31-bit key length, no expiration
 B:  01XX XXXX  XXXX XXXX  XXXX XXXX  XXXX XXXX   XXXX XXXX  XLLL LLLL  LLLL LLLL  LLLL LLLL
        L = 23-bit key length [0 .. 8 388 608 bytes]
        X = 39-bit expiration [0 .. 17.42 years]
 C:  001L LLLL  LLLL LLLL  LLLL LLLL  LLLL LLLL   ZZZZ ZZZZ  ZZZZ ZZZZ  ZZZZ ZZZZ  ZZZZ ZZZZ
        L = 29-bit key length
        Z = 32-bit pointer to expiration in side table

In essence, this would:

A: let the length for keys with no expiration take up a 32-bit integer

B: let the length+expiration for keys of reasonable length and reasonable expiration be packed into a 64-bit integer (the split could be changed a few bits in either direction depending on which capacity is more likely)

C: let all other keys have the expiration be stored in a side table (the 29-bit key length is enough to store all valid keys since the max key size is 512 mebibytes)

This would complicate reading key length and expiration somewhat (especially requiring everyone to know about the base timestamp), and it would sell out performance for the extreme cases for C (where the key length is above 8 388 608 bytes, or the expiration is later than 17.42 years than Redis launch). It would likely not affect many real-world, non-contrived scenarios, except where expiration is set to a comically large value. It would be possible to artificially cap expiration dates later than 17.42 years to 2^39 to make this less likely to happen, at the cost of observable changes in behavior (aside from affecting very patient users with very reliable servers, it would also change the expiration date used by TTL).

B would be significantly less effective on a machine where Redis has been running for more than 17 years. Redis 1.0 was released just over 10 years ago and users would be encouraged to upgrade by then, if nothing else.

It's possible that there could be one union case for A and one for B/C, where in A the whole value could use just a 32-bit unsigned integer, but in B/C, a 64-bit unsigned integer would be used. It's also possible that for C, the pointer to expiration could be the key pointer itself, and a 32-bit unsigned integer could be used for the whole value instead.

@josiahcarlson
Copy link

So you've got a timestamp and an sds. What if you didn't store the sds? What if instead, you stored the lowest 32 bits of the hash of the sds? Then, as long as your number of slots is less than 2**32, you at least know exactly which slot to look in. You can then check each key for the specific expiration, and have what you're looking for.

At this point, you don't even really need the radix tree and its complexity, maybe just a sorted set, or you could just store the pair in a 64 bit int, and toss it into a sorted intset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment