antirez/keyspace.md Secret

## keyspace.md

      
    Raw
  

              keyspace.md
            
          
    Preliminary analysis of an alternative implementation of Redis expires

Background

A couple of months ago I started experimenting with replacing two key parts of the Redis internals:

The in memory representation of the keyspace.
The algorithm used by Redis in order to expire keys.

The first task is actually needed in order to perform the second, but in general the gaol of the experiment was to attempt to see if it was possible to catch on some technical debt related to the way the expire algorithm worked, or if what we had in place was already optimal: I started the work knowing in advance that there was the risk that I had to throw away all the work I was doing in case it had severe regressions. However sometimes to advance a system there is to go forward and try to improve it in some bold way even if what is in place looks reasonable.
What were such technical debts, or if we want to use a different wording, what were the shortcomings of the expire algorithm and the old representation of the key space?
The old algorithm used random sampling on the hash table representing the keys inside a Redis database, in order to find already expired ones. So the algorithm was not precise: keys could be removed in an order different than their logical expire. This was not a huge deal but was very noticeable once we implemented, years ago, a notification (via Pub/Sub) for each expired key. Another problem was that it was very hard to expire a subset of keys at all, especially in the case where the key space contains a mix of keys that are going to expire very soon, and a much larger set of keys that will expire after a long time. In such a case, random sampling is very costly, we are going to hit again and again keys that are not expired, and there is to do a lot of sampling to find the few that instead need to be evicted because expired.
Moreover, my suspicious was that the random sampling process was also very costly from the POV of the CPU. Apparently this was not so true, because the new implementation does not look to be much more efficient so far. But such idea was definitely one of the motivations.
On the other hand, the new implementation of the expire algorithm that is described in the next section, also required the key object itself to store the expire time, that was before just stored in another hash table. This was required because the new implementation uses a sorted data structure (by expire time) to represent the expires. In order to lookup a key in the tree of expires, we need to know the exact expire time of the key. But with the old representation of keys, there was no place to store such expire: the key in the hash table was just an “SDS” string. So it was also required to substitute keys in the hash table with “key objects”, C structures with a few fields.
And, in turn, this was another long desired thing: finally we could move things like the LRU/LFU metadata where they
belong, in the key! Not in the associated object. When maxmemory with an eviction policy is enabled, Redis cannot even use the small shared integers as values (a memory optimization) because otherwise all the shared objects would have the same LFU/LRU fields in common among many keys. Not just that, finally we also had a few spare flags both in the keys objects and in the value objects, that could be used in many future improvements of Redis.
Implementation details

The new algorithm is conceptually very simple. The key object stored in the hash table representing the keys is now the following C structure:
#define KEY_FLAG_EXPIRE (1<<0)
typedef struct redisKey {
    uint32_t len;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    unsigned flags:KEY_FLAGS_BITS;
    uint64_t expire;
    char name[];
} rkey;

So when a key has an expire, it is flagged with KEY_FLAG_EXPIRE, and the expire time, in milliseconds unix time, is set in the “expire” field. At the same time, the same key goes inside a radix tree, in the following format:
<expire time in big endian byte order> + <key object pointer>

Because the prefix is the expire time, and is in byte order, the radix tree, which is lexicographically sorted, would serve as a list of keys sorted by expire time. To expire keys there was no longer need to perform any random sampling: just seek the radix tree at the start, and if the current time is greater than the expire time of the first key, expire the key, and continue to the next, until the condition is true.
Such implementation is conceptually a big step forward compared to the old, but software is the land of compromises, and unfortunately such vanilla implementation also tends to use more memory, and also has regressions in terms of loading times of RDB files, and potentially other more speed regressions yet to discover. See later for more details.
Status of the implemementation

The implementation is definitely just alpha quality: given the possibility that the code is to throw away, I had no interest in finishing the 5% details. However what is implemented is definitely written in a very clean way like final production code (but of course it has bugs). I didn’t even try to run the Redis test against this implementation, it’s definitely not such level of quality. It is the MVP needed to evaluate it.
For this reason, before to validate my findings below, I will check better if the implementation is sane enough for the results to be considered valid. For instance it should not have memory leaks and valgrind should not complain when doing obvious operations.
Evaluation of the implementation

I ran a few benchmarks comparing the new and old implementation, finding a few very concerning regressions:
Memory usage:
5M keys without expires:
before: 444 MB
after: 673 MB (+50% regression)
5M keys with expires:
before: 548 MB
after: 806 MB (+50% regression)
As you can see the memory usage regression is unacceptable, however there are potential simple fixes for this problem.
See later in the next sections.
RDB loading and saving performances:
RDB saving of 5000000 keys without expires:
before: 3.93 seconds
after: 3.70 seconds (Very small improvement)
RDB loading of 5000000 keys with expires:
before: 6.5 seconds
after: 10.5 seconds (Huge regression)
Saving 5000000 keys with expires:
before: 4.9 seconds
after: 4.0 seconds (20% improvement)
RDB loading of 5000000 keys without expires:
before: 4.7 seconds
after: 4.7 seconds
The RDB loading times regression with expires is the most worrying: profiling the code shows that the additional time is needed in order to populate the radix tree, in the raxInsert() function. Probably it is more costly than populating the hash table of expires for a few reasons, even if normally the radix tree is actually faster than the hash table:

During the loading step we resize the expires hash table immediately to the right size, so in that specific task the hash table is faster than usually.
The old implementation uses a shared key between the main dictionary and the expires dictionary, this probably saves time as well: no allocation / deallocation needed in the expires hash table, less cache misses.
The radix tree speed may depend greatly on the exact workload: unix times in milliseconds, having many common prefixes, will continuously split nodes. The hash table performance characteristics instead are a lot more stable. To be honest this is probably also very dependent on the exact RDB I was loading, where keys expire times where just a few milliseconds aways each other, likely causing even more node splitting. Yet the regression is quite severe so it is likely to be there in general.

Key access and expire performances

The new version is apparently a bit faster, for instance the following benchmark:
redis-benchmark -P 32 -r 5000000 -n 10000000 get __rand_int__

Reports 571k ops/sec VS the old 520k ops/sec.
Such difference is more marked if we try to access the key TTL, which is now inside the key and does not require an external hash table access, so substituting the GET command to the TTL command in the above benchmark results into performances going from 500 ops/sec in the old version to 600 ops/sec in the new one.
Initial benchmarks don’t show any improvement nor regression in the Redis overall speed in presence of many expires that are both created and evicted. I ran the following test that should indeed show some obvious difference if any, but found none:
redis-benchmark -P 32 -n 1000000000 -r 10000000 SET __rand_int__ foo EX 1

Such benchmark sets a number of keys with an expire of 1 second, so there is constantly a flow of expire events happening, while at the same time we populate the database. However both the implementations report the same IOPs for this test, I guess that what is happening is that the time we lose in setExpire() is compensated by the faster expire cycle.
Techniques to reduce the memory impact

The speed regression may not be trivially solvable, unless we do some effort in order to speedup the radix tree implementation itself, which may be indeed possible. However the memory regression could be strongly mitigated in the following way. Let’s recall the current redisKey structure for a moment:
#define KEY_FLAG_EXPIRE (1<<0)
typedef struct redisKey {
    uint32_t len;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    unsigned flags:KEY_FLAGS_BITS;
    uint64_t expire;
    char name[];
} rkey;

When no expire is set, we are losing the “expire” field which is 8 bytes. We can instead just allocate a modified version of the structure when there is no expire. Moreover the key length field of 32 bit is often definitely a big waste. So the structure could be redefined as such:
#define KEY_FLAG_EXPIRE (1<<0)
typedef struct redisKey {
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    unsigned flags:KEY_FLAGS_BITS;
    char sds_name[];
} rkey;

Basically the key name is now stored as SDS, which has a variable length header (or, if you prefer, could be stored in an “SDS alike way”). If we want to really have SDS, we can just use an accessor function to compute the offset of the SDS pointer inside the key allocation. Moreover, the expire could be stored at the end of everything as additional 8 bytes, only if the KEY_FLAG_EXPIRE is set. This however will make the implementation more complex and will require reallocation of the key object when the expire is removed or set in an existing key.
An alternative could be to accept the memory regression, but win memory in some other way. For instance when the key has no expire set, the 8 bytes of the expire could be used to store directly the value of the key, that may translate to an huge memory reduction, not having to allocate an robj for the value itself. In this case we could flag the key with two special flags:

KEY_FLAG_INTVAL
KEY_FLAG_STRVAL

That signal that our key has no associated value since the expire field itself holds an integer or a short string.
Alternative implementations

An alternative to all this, is to retain the old implementation, and use instead this exact radix tree technique just to augment the old expire cycle. Basically we could take a capped radix tree with, for instance, 1% the size of the current expires size.
Every time we run the expire cycle, if a key we sample has an expire that is nearest in time compared to the current oldest expire we have in the tree, we drop the old one, and insert the new key into the radix tree. We don’t bother at all about trying to update the radix tree if the keys are modified, we use the tree just as an optimization. Later, when we call the expire cycle, instead of doing just random sampling, we run the tree instead. Such optimization may do two things:

Improve the precision, but without any guarantee about the expire order.
Avoid the problem we have right now that up to 25% of logically expired keys can continue to use memory, because we are able to expire keys that are near-to-expire much more efficiently. The exit condition of the loop would no longer be to exit if we can’t find at least 3 expires every 4 sampled keys, but could be to exit only if that condition is true and, at the same time, the tree of keys to expire is empty.