Skip to content

Instantly share code, notes, and snippets.

@hodgesrm
Forked from filimonov/skipping_idx2_en.md
Last active March 21, 2022 03:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hodgesrm/d1a68dfa2edf1cbd06454f582ce4448b to your computer and use it in GitHub Desktop.
Save hodgesrm/d1a68dfa2edf1cbd06454f582ce4448b to your computer and use it in GitHub Desktop.

Skipping indexes. Part 2. Deeper dive: bloom filter parameters

In the previous article, we talked about how ClickHouse skipping indexes work and what types of skipping indexes are available.

This time let's take a closer look at skipping indexes based on bloom filters.

Picking parameters for bloom filters

As we remember from the 1st part of the article, to add a certain element to the bloom filter, we need to calculate one or more hash functions from it and turn on corresponding bits in the bit vector of the bloom filter.

Obviously, if the length of the bit vector (m) is too short, then after adding a small number of elements all the bits in this vector will be turned on, and the bloom filter will not help to skip the data.

If, on the contrary, you make the bloom filter too long, then its reading/processing cost becomes very high and the benefit of using the bloom filter becomes insignificant or even may lead to slowing queries down.

Similarly, if the cardinality of the set of elements added to one bloom filter (n) is very high, then again all or most of the bits will be turned on, and the bloom filter will work poorly.

It is also possible to calculate several (k) different hash functions from one element, so every element added will turn on several bits in the bloom filter. This allows you to significantly reduce the probability of false-positive responses of the bloom filter (for example, if 2 hash functions are used, then the probability that they BOTH return the same value for DIFFERENT elements is much lower than in the case of a single hash function).

On the other hand, the calculation of hash functions is not free, and the more we use hash functions, the more expensive the use of the bloom filter will be. And with too big a value of k even a single element added to the bloom filter can enable a lot of bits. So too high a k may again lead to the situation when all bits in the Bloom filter are turned on, and it will not work.

Changing these 3 parameters (m,k,n) changes the main characteristic of the bloom filter: the probability of false positives (p). The closer this probability is to zero, the better the bloom filter works, but usually the more expensive it is to use.

Thus, we are faced with the task of selection of optimal parameters (k,m) of the bloom filter, knowing the cardinality of the set n to obtain a satisfactory probability of false positives p.

I will skip a piece of math here (you can check a great Wikipedia article and just say that it comes down to a fairly simple formula connecting these 4 parameters for getting optimal performance.

You can calculate these parameters in a very convenient and visual calculator: https://hur.st/bloomfilter/

The formulas work great, but there is one disadvantage: in real life there are costs of calculating hashes, reading / writing the bloom filters, and some other costs. So practical results may differ from the theoretical expectations.

Let's take a deeper look at what it looks like in practice and how it will work in ClickHouse.

Dataset

Because real datasets tend to have uneven/complex data distribution, interpretation of the results can be difficult. Therefore, for simplicity / in order to obtain understandable and predictable results, we will conduct a study on a synthetic dataset with 'ideal' and very simple characteristics.

Let's create a table with a single column containing 32-character long random strings:

create table source
engine = MergeTree 
ORDER BY tuple()
as select
 substring(replaceRegexpAll(randomPrintableASCII(100), '[^a-zA-Z0-9]', ''), 1, 32) as s
from numbers_mt(20500000)
where length(s) = 32
LIMIT 20000000;

Let's check its properties:

SELECT
    min(length(s)),
    max(length(s)),
    count(),
    uniqExact(s)
FROM source
min(length(s)) max(length(s)) count() uniqExact(s) any(s)
32 32 20000000 20000000 j4M5pMncRhnuOnaQeBS4nICs0BUU4GFu

Our table has exactly 20 million unique random strings, each exactly 32 characters long, and consisting of only alphanumerical characters.

Because each value is unique, within each granule of 8192 rows, there will be exactly 8192 unique elements in each granule.

We will be using tokenbf_v1 index, because it allows us to tune all parameters of bloom filters. It actually tokenizes the string, but since our strings contain only alphanumeric characters, every row / string will have exactly 1 token.

In our tests we will be creating a skipping index with GRANULARITY 1, i.e. covering single granule. Our table is using the default index_granularity (8192) so there will be 8192 rows in each granule, with each row being unique. Therefore, the number of elements added to the bloom filter (n) will be exactly 8192.

Using a formula relating the probability of false positives to the optimal bloom filter size and the number of hash functions, let's display a table for several different p:

WITH
    8192 AS n,
    number + 1 AS k,
    [0.0001, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1] AS p_all,
    arrayMap(p -> ((-k) / ln(1 - exp(ln(p) / k))), p_all) AS bits_per_elem,
    arrayMap(r -> ceil(ceil(r * n) / 8), bits_per_elem) AS m_all
SELECT
    k,
    m_all[1] AS `m for p=0.0001`,
    m_all[2] AS `m for p=0.001`,
    m_all[3] AS `m for p=0.005`,
    m_all[4] AS `m for p=0.01`,
    m_all[5] AS `m for p=0.025`,
    m_all[6] AS `m for p=0.05`,
    m_all[7] AS `m for p=0.1`
FROM numbers(10)
k m for p=0.0001 m for p=0.001 m for p=0.005 m for p=0.01 m for p=0.025 m for p=0.05 m for p=0.1
1 10239488 1023488 204288 101888 40446 19964 9720
2 203775 63734 27927 19439 11900 8092 5388
3 64637 29158 16382 12661 8882 6686 4924
4 38877 20919 13251 10776 8081 6397 4957
5 29672 17700 12033 10086 7872 6425 5137
6 25322 16163 11514 9848 7896 6580 5374
7 22950 15368 11321 9824 8032 6794 5636
8 21551 14959 11300 9914 8227 7040 5912
9 20696 14772 11381 10073 8457 7304 6192
10 20171 14723 11526 10273 8708 7578 6475

Here the rows are the number of hash functions, and the columns are the probability of a false positive. And in the number in the table is the optimal size of the vector (in bytes) to obtain the desired probability with such a number of hash functions.

It is clearly seen that when using a single hash function, much longer vectors must be used. And the lower the probability of false positives that is needed, the longer the bloom filter must be.

It seems that the filter is about 8Kb in size and 2-5 hash functions should give good results.

Let's now look the opposite: how the probabilities (p) will look like for different sizes of bloom filters (m) and different numbers of hash functions (k):

SET output_format_decimal_trailing_zeros=1;

WITH
    8192 AS n,
    number + 1 AS k,
    [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536] AS m_all,
    arrayMap(m -> ((m * 8) / n), m_all) AS bits_per_elem,
    arrayMap(r -> toDecimal64(power(1 - exp((-k) / r), k), 3), bits_per_elem) AS p_all
SELECT
    k,
    p_all[1] AS `p for m=128`,
    p_all[2] AS `p for m=256`,
    p_all[3] AS `p for m=512`,
    p_all[4] AS `p for m=1024`,
    p_all[5] AS `p for m=2048`,
    p_all[6] AS `p for m=4096`,
    p_all[7] AS `p for m=8192`,
    p_all[8] AS `p for m=16384`,
    p_all[9] AS `p for m=32768`,
    p_all[10] AS `p for m=65536`
FROM numbers(10);
k p for m=128 p for m=256 p for m=512 p for m=1024 p for m=2048 p for m=4096 p for m=8192 p for m=16384 p for m=32768 p for m=65536
1 0.9996 0.9816 0.8646 0.6321 0.3934 0.2211 0.1175 0.0605 0.0307 0.0155
2 0.9999 0.9993 0.9637 0.7476 0.3995 0.1548 0.0489 0.0138 0.0036 0.0009
3 0.9999 0.9999 0.9925 0.8579 0.4688 0.1468 0.0305 0.0049 0.0007 0.0000
4 0.9999 0.9999 0.9986 0.9287 0.5589 0.1596 0.0239 0.0023 0.0001 0.0000
5 1.0000 0.9999 0.9997 0.9667 0.6516 0.1849 0.0216 0.0013 0.0000 0.0000
6 1.0000 0.9999 0.9999 0.9852 0.7360 0.2198 0.0215 0.0009 0.0000 0.0000
7 1.0000 0.9999 0.9999 0.9936 0.8068 0.2628 0.0229 0.0007 0.0000 0.0000
8 1.0000 0.9999 0.9999 0.9973 0.8625 0.3124 0.0254 0.0005 0.0000 0.0000
9 1.0000 0.9999 0.9999 0.9988 0.9043 0.3669 0.0292 0.0005 0.0000 0.0000
10 1.0000 1.0000 0.9999 0.9995 0.9346 0.4246 0.0341 0.0004 0.0000 0.0000

As you can see the expectations are that vectors shorter than 2048 will not help at all (false-positivite rate is very close to 100%).

Now let's check those expected results.

Trial 1. Impact of number of hashes

OK, now we check in practice how such a bloom filter will behave. Let's create tables with the same bloom filter size and different number of hash functions.

Note: As mentioned above, I use the tokenbf_v1 index here, because it allows you to flexibly configure all the bloom filter parameters. Since the rows in our synthetic table contain only alphanumeric values, each row = exactly 1 token. Therefore, all calculations above apply to the token_bf filter in our synthetic test. Read more about filter types below.

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_1 ( `s` String, index bf s TYPE tokenbf_v1(8192, 1, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_2 ( `s` String, index bf s TYPE tokenbf_v1(8192, 2, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_4 ( `s` String, index bf s TYPE tokenbf_v1(8192, 4, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_5 ( `s` String, index bf s TYPE tokenbf_v1(8192, 5, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_6 ( `s` String, index bf s TYPE tokenbf_v1(8192, 6, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_7 ( `s` String, index bf s TYPE tokenbf_v1(8192, 7, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_8 ( `s` String, index bf s TYPE tokenbf_v1(8192, 8, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let's check how these indexes are displayed in the system.data_skipping_indices table:

SELECT table, data_compressed_bytes, data_uncompressed_bytes
FROM system.data_skipping_indices
WHERE table LIKE 'bf%'
table data_compressed_bytes data_uncompressed_bytes
bf_8192_1 17474135 20013056
bf_8192_2 20082538 20013056
bf_8192_3 20097632 20013056
bf_8192_4 20098229 20013056
bf_8192_5 20098727 20013056
bf_8192_6 20099313 20013056
bf_8192_7 20099477 20013056
bf_8192_8 20099494 20013056

You can see that the compression ratio for them is quite poor. That is expected, since bloom filter vectors have random properties if they have optimal size.

Let's check how the insert speed looks like in these tables:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec.
INSERT INTO bf_8192_1 SELECT * FROM source;   -- Elapsed: 2.888 sec.
INSERT INTO bf_8192_2 SELECT * FROM source;   -- Elapsed: 3.051 sec.
INSERT INTO bf_8192_3 SELECT * FROM source;   -- Elapsed: 3.234 sec.
INSERT INTO bf_8192_4 SELECT * FROM source;   -- Elapsed: 3.413 sec.
INSERT INTO bf_8192_5 SELECT * FROM source;   -- Elapsed: 3.567 sec.
INSERT INTO bf_8192_6 SELECT * FROM source;   -- Elapsed: 3.800 sec.
INSERT INTO bf_8192_7 SELECT * FROM source;   -- Elapsed: 4.035 sec.
INSERT INTO bf_8192_8 SELECT * FROM source;   -- Elapsed: 4.221 sec.

As you can see, the insertion speed predictably becomes lower with the addition of a bloom filter (we need to do more when inserting) and with an increase in the number of hash functions (more hash-functions need to be calculated). It is noticeable that the addition of the index itself slows down storage of the column by almost 45%, and each additional hash function adds another 8% slowdown

Note: in this case, only one column is indexed, so the difference is more clearly visible here. In non-synthetic tests with more columns, it will be less noticeable.

Now let's look at the read speed:

Reading speed was tested with the following script:

echo "select count() from bf_tests.bf_no_index where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-benchmark -i 100 -c 1

The number of skipped granules was taken from the query execution log.

table granulas skipped (from 2443) best query time (sec) row read (thousands) MB read QPS
bf_no_index 0 0.110 20000 820.00 8.001
bf_8192_1 2158 0.031 2330 95.72 23.873
bf_8192_2 2319 0.016 1020 41.65 54.591
bf_8192_3 2370 0.012 598.02 24.52 73.524
bf_8192_4 2373 0.010 573.44 23.51 80.132
bf_8192_5 2380 0.011 516.10 21.16 80.615
bf_8192_6 2383 0.011 491.52 20.15 83.641
bf_8192_7 2377 0.010 540.67 22.17 86.980
bf_8192_8 2371 0.011 589.82 24.18 74.192

As you can see, the reading speed directly depends on the number of "skipped" granules and the number of skipped granules behaves in accordance with the theoretical predictions obtained from the formulas.

Now we understand how the number of hash functions affects the speed of selects / inserts. Now let's try to figure out the effect of changing the size of the bloom filter vector.

Trial 2. Impact of vector length

This time let's create a set of tables that differ only in the length of the bloom filter vector:

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_2048_3 ( `s` String, index bf s TYPE tokenbf_v1(2048, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_4096_3 ( `s` String, index bf s TYPE tokenbf_v1(4096, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_16384_3 ( `s` String, index bf s TYPE tokenbf_v1(16384, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_32768_3 ( `s` String, index bf s TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_65536_3 ( `s` String, index bf s TYPE tokenbf_v1(65536, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_131072_3 ( `s` String, index bf s TYPE tokenbf_v1(131072, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_262144_3 ( `s` String, index bf s TYPE tokenbf_v1(262144, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let's check how it impact inserts:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec.
INSERT INTO bf_2048_3 SELECT * FROM source;   -- Elapsed: 2.864 sec.
INSERT INTO bf_4096_3 SELECT * FROM source;   -- Elapsed: 2.866 sec.
INSERT INTO bf_8192_3 SELECT * FROM source;   -- Elapsed: 2.904 sec.
INSERT INTO bf_16384_3 SELECT * FROM source;  -- Elapsed: 2.969 sec.
INSERT INTO bf_32768_3 SELECT * FROM source;  -- Elapsed: 3.244 sec.
INSERT INTO bf_65536_3 SELECT * FROM source;  -- Elapsed: 3.438 sec.
INSERT INTO bf_131072_3 SELECT * FROM source; -- Elapsed: 3.798 sec.
INSERT INTO bf_262144_3 SELECT * FROM source; -- Elapsed: 4.299 sec.

Obviously a larger bloom filter means "more data needs to be written", but the slowdown for each doubling of the bloom filter size is not very significant.

Let's look at the actual size of these bloom filters inside data_skipping_indices:

SELECT table, data_compressed_bytes, data_uncompressed_bytes
FROM system.data_skipping_indices
WHERE table LIKE 'bf%3'
ORDER BY data_uncompressed_bytes ASC
table data_compressed_bytes data_uncompressed_bytes
bf_2048_3 5022270 5003264
bf_4096_3 10049670 10006528
bf_8192_3 20097632 20013056
bf_16384_3 39241113 40026112
bf_32768_3 63488764 80052224
bf_65536_3 101119994 160104448
bf_131072_3 133740272 320208896
bf_262144_3 183368525 640417792

You can see here that for very long (and far from optimal) bloom filter vector sizes the compession becomes better, because there are a lot 'empty' segments of the vector in that case.

Just for reference, here is the size of the column itself:

SELECT
    database,
    table,
    name,
    data_compressed_bytes,
    data_uncompressed_bytes,
    marks_bytes
FROM system.columns
WHERE table = 'bf_262144_3'
database table name data_compressed_bytes data_uncompressed_bytes marks_bytes
default bf_262144_3 s 662653819 660000000 58776

As you can see, even for the largest bloom filter selected, the size is still smaller than the column size.

Now let's look at the speed of requests. We can check it this way:

echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-client --send_logs=trace   2>&1 | grep dropped
echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'"  | clickhouse-benchmark -i 100 -d 0 -c 1

The results:

table dropped granulas (from 2443) QPS fastest query (sec.)
bf_no_index 0 8.001 0.110
bf_2048_3 1293 12.588 0.068
bf_4096_3 2074 28.467 0.029
bf_8192_3 2370 78.234 0.011
bf_16384_3 2434 59.685 0.014
bf_32768_3 2442 22.844 0.033
bf_65536_2 2442 10.880 0.076
bf_131072_3 2442 7.635 0.116
bf_262144_3 2442 4.543 0.205

As expected, a longer bloom filter means fewer false positives, which is clearly seen from the increase in the number of "skipped" granules in accordance with the expectations from the formulas.

But at the same time, it is obvious that longer bloom filters are not 'free'. And when the length of the vector grows, the overhead of reading / processing the bloom filter itself begins to significantly exceed the benefit that it gives. It is clearly visible as a degradation of QPS with vector length growth.

              poor quality of bf (lot of false positives), but bf is small
                    │                                     and cheap to use
                    │
                    │        ┌──────── optimal value
             ▲      │        │
  query speed│      │        │
             │      │        │         high quality of bf (less of false positives)
             │      │        │               │  but bf itself is huge and expensive
             │      │        ▼               │     to use
             │      ▼      xxxxxxxxxxx       │
             │          xxxx         xxxx    │
             │        xxx                xxx ▼
             │      xxx                    xxxx
             │     xx                         xxx
             │   xx                             xxx
             │  xx                                xx
             │  x
          ───┼───────────────────────────────────────────►
             │                               bf vector size

Summary

Bloom filter indexes can be tricky to tune but our experiments have revealed a number of lessons that help set parameters correctly in production systems.

  • Knowing the number of elements in the set (m), you can accurately predict the behavior of the bloom filter, and choose the optimal parameters using the formulas from the article above. The next article will provide practical guidelines for evaluating m in real data.

  • When changing the index_granularity of a table or the GRANULARITY of an index, you need to change the bloom filter parameters, because it directly impacts the m parameter.

  • The more hash functions we use, the more expensive the bloom filter calculation will be. Therefore, it is undesirable to use values ​​greater than 5, since this significantly slows down inserts, and also leads to "oversaturation" of the vector with included bits, thus performance degradation.

That being said, a 1 hash function would require a fairly long vector to get a satisfactory false positive rate. Therefore, it is preferable to use from 2 to 5 hash functions, while increasing the size of the vector proportionally.

  • Vectors that are too long despite the low probability of false positives are very expensive to use. Therefore, it is preferable to use bloom filters that are orders of magnitude smaller than the column size.

  • Internally in the formulas for calculation of bloom filter parameter called bits_per_element is used. And because of that one of the simplest 'rough' ways of picking good bloom filter parameters is: "take 3 hash functions, and the bloom filter of the size (in bytes) equal to the cardinality of the set". It can be used as a simple 'rule of thumb', but of course, in real life, you may need to do more checks/estimations.

In the next part of the article, we will look at how the bloom filters work and can be used / tunes with real data in a non-synthetic example. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment