Skip to content

Instantly share code, notes, and snippets.

@zsfelfoldi
Last active June 24, 2024 03:20
Show Gist options
  • Save zsfelfoldi/1ea898e5ad242c70f5b800cda7233adf to your computer and use it in GitHub Desktop.
Save zsfelfoldi/1ea898e5ad242c70f5b800cda7233adf to your computer and use it in GitHub Desktop.
title description author discussions-to status type category created
<New log filter data structure>
<An efficient and light client friendly replacement for block header bloom filters>
<Zsolt Felföldi (@zsfelfoldi)>
<URL>
Draft
<Standards Track>
<Core>
<2024-06-20>

Abstract

Replace the fixed 2048 bit log event bloom filters in block headers with a new data structure that can adapt to the changing number of events per block and consistently guarantee a sufficiently low false positive ratio. Unlike the per-block bloom filters, the proposed structure allows searching for specific events by accessing only a small portion of the entire dataset which can also be proven with a Merkle proof, making the search both efficient and light client friendly.

As an additional benefit, the new structure provides more precise position information on the potential hits (block number and transaction index) allowing the client to only look up a single receipt for each potential hit.

Motivation

Bloom filters are only useful as long as they are sufficiently sparse. False positive ratio rises rapidly with the density of 1 bits. In the currently existing filter each log address and topic sets 3 out of a fixed length of 2048 bits which resulted in sufficiently sparse filters in the beginning but with the increase of the block gas limits the false positive ratio soon became practically useless.

Another issue with per-block filters in headers is that searching for an event requires access to the entire header chain of the searched section (5-600 bytes per block). The main purpose of making logs a part of consensus was to be able to prove these events with a low amount of data and making trustless interaction with smart contracts possible without operating a full node. While a receipt with a Merkle proof can prove the existence of such events (also worth noting here that since the merge the historical state roots structure of the beacon chain also allows proving the canonicalness of the host blocks without possessing the entire header chain), a log filter structure can prove the non-existence of more such events. The proposed structure using the proposed constants allows a trustless proof of all potential hits within a block range with at least two orders of magnitude less data than the bloom filters currently embedded in headers.

Specification

Terms and definitions

  • log value: either an address value or a topic value. Values are globally mapped to a linear index space, with each LOG opcode generating one address value and 0..4 topic values, each assigned to the consecutive log value index.
  • filter map: a sparse map_width by map_height map on which a fixed values_per_map number of log values are marked with a single mark (see details below). Filter maps are also indexed, the log value index range covered by each map index is map_index * values_per_map to (map_index+1) * values_per_map - 1.
  • filter epoch: a group of filter maps stored in the hash tree in a way so that they can be efficiently retrieved in a single Merkle multiproof if required. The map index range covered by each epoch index is epoch_index * maps_per_epoch to (epoch_index+1) * maps_per_epoch - 1.

Log value mapping

The log value mapping functions row_index and column_index take a log value and its position in the linear index space and return a position of the filter map where a mark for the given log value should be placed.

def row_index(log_value_index, log_value):
    map_index = log_value_index // values_per_map
    epoch_index = map_index // maps_per_epoch
    return SHA2(log_value + uint64_littleEndian(epoch_index)) % map_height

def column_index(log_value_index, log_value):
    value_subindex = log_value_index % values_per_map
    map_index = log_value_index // values_per_map
    transform_hash = SHA2(log_value + uint64_littleEndian(map_index))
    return column_transform(value_subindex, transform_hash)

The function column_transform performs a reversible quasi-random transformation on the range 0 .. map_width-1. It can be realized with a series of reversible operations (modulo multiplication with odd numbers, modulo addition, binary XOR, bit rotation) where the second arguments of these operations are sections of the column_hash.

def column_transform(value_subindex, transform_hash):
    //TODO

def reverse_transform(column_index, transform_hash):
    //TODO

Note that with a non-reversible column_transform the searcher would need to iterate through the entire values_per_map range for each map row and searched log value and check whether there is a mark in the resulting column_index, indicating a potential hit. Having a matching reverse_transform function makes it possible to just iterate through the actual marks found in the map row and see whether they can be transformed back into a potentially valid value_subindex:

def potential_match_index(map_index, column_index, log_value):
    transform_hash = SHA2(log_value + uint64_littleEndian(map_index))
    potential_value_subindex = reverse_transform(column_index, transform_hash)
    if potential_value_subindex < values_per_map:
        return map_index * values_per_map + potential_value_subindex
    return -1

Consensus data format

Block headers

Beginning at the execution timestamp FORK_TIMESTAMP, execution clients MUST replace the logs_bloom field of the header schema with two new field: the log_filter_root (the root hash of the filter map hash tree structure) and log_value_pointer (the first unoccupied position in the log value index space after processing the block).

The resulting RLP encoding of the header is therefore:

rlp([
    parent_hash,
    0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, # ommers hash
    coinbase,
    state_root,
    txs_root,
    receipts_root,
    log_filter_root,
    log_value_pointer,
    0, # difficulty
    number,
    gas_limit,
    gas_used,
    timestamp,
    extradata,
    prev_randao,
    0x0000000000000000, # nonce
    base_fee_per_gas,
    withdrawals_root,
    blob_gas_used,
    excess_blob_gas,
    parent_beacon_block_root,
])

Receipts

Beginning at the execution timestamp FORK_TIMESTAMP, execution clients MUST replace the bloom field of the transaction receipt schema with a new field, the log_value_pointer (the first unoccupied position in the log value index space after executing the transaction).

The resulting RLP encoding of the receipt is therefore:

rlp([
    type,
    post_state,
    status,
    cumulative_gas_used,
    log_value_pointer,
    logs,
])

Log filter map

Each row of the filter map is encoded as a series of little endian binary encoded column indices. With the proposed value of map_width = 2**32 this results in a simple and efficient encoding as a series of uint32 values. Since the order of entries in this list should not affect the filter algorithm, for simplicity the indices are not sorted but simply appended to the list in the original order of the log events. In the rare case of column index collision (two events generating the same column index in the same row) the index is also added twice. This simplifies the in-memory maintenance of the tree and also allows clients to maintain only a single version of the tree and easily roll back events if necessary.

Note that the number of indices in a row of a single map may vary, the global average number of entries per row is values_per_map / map_height while the upper limit is values_per_map.

The SHA2 hashes of encoded rows are stored in an SSZ binary tree:

    log_filter_tree: Vector[Vector[Vector[Bytes32, maps_per_epoch], map_height], max_epoch_history]

Proposed constants

  • map_width: 2**32
  • map_height: 2**12
  • values_per_map: 2**16
  • maps_per_epoch: 2**6
  • max_epoch_history: 2**24

Rationale

Two dimensional mapping

The currently existing bloom filter maps the log events of each block onto a one dimensional bit vector. Searching in such vectors requires accessing and processing the entire vectors from each block in the search range while only a few bits of each of those vectors are required in order to search for a specific value. To achieve higher bandwidth efficiency, filter data is mapped in two dimensions where the same log value is mapped onto the same row during an entire filter epoch, making it possible to access only the interesting rows and making a typical search for a small number of values in a long range of blocks more efficient by multiple orders of magnitude.

Log value index space

Mapping log values on their own linear index space and adding pointers to that space in headers and receipts has the advantage that it produces a filter structure with uniform density regardless of the varying number of log values per block. It also allows mapping each address and topic value separately to consecutive indices and implementing specific address/topic pattern filters.

An alternative approach would be to use an indexing based primarily on block number and transaction index, which would result in a variable number of log values per map. Keeping map density constant would require adjusting map width which would add a significant extra complexity.

Sparse horizontal mapping

False positive rate depends on map density and number of marks placed on the map per log value. It can be estimated as false_positive_rate = (values_per_map * marks_per_value / map_width / map_height) ** marks_per_value. The proposed data structure achieves a low false positive rate by choosing a high map_width while only adding one mark per value. An alternative would be using multiple marks per log value and a lower map_width. With a fixed target false_positive_rate the necessary map_width can be calculated as follows:

avg_marks_per_row = values_per_map * marks_per_value / map_height
map_width = avg_marks_per_row / (false_positive_rate ** (1 / marks_per_value))

One important factor that depends on map size and marks per value is data efficiency. A wider map requires more bits per mark to encode but also allows less marks per value. A simple way of encoding rows is a binary encoding of each mark column index, requiring ceil(log2(map_width)) bits per mark. A close to optimal compression of sparse bit vectors is also considered for comparison (formula not detailed here).

avg_bits_per_row = ceil(log2(map_width)) * avg_marks_per_row
compressed_avg_bits_per_row = (log2(map_width / avg_marks_per_row) + 1.5) * avg_marks_per_row

With the following values considered constant, data efficiency depends on marks_per_value as follows:

values_per_map = 2**16
map_height = 2**12
false_positive_rate = 1 / 2**28
marks_per_value map_width avg_bits_per_row compressed_avg_bits_per_row
1 2**32 512 472
2 2**19 608 496
3 30964 720 520
4 2**13 832 544

It shows that the smaller encoding of individual mark indices can almost offset the larger number of marks but still, marks_per_value = 1 results in the best data efficiency when the total size of the filter data set is considered. The advantage of very sparse rows is even greater if we consider the data amount to be accessed when searching for a specific value which requires marks_per_value rows to be accessed and matched to each other.

Filter epochs

The filter map tree is structured as a three-level nested vector log_filter_tree[epoch_index][row_index][map_subindex] where epoch_index = map_index // maps_per_epoch and map_subindex = map_index % maps_per_epoch. This structure balances out two costs: the bandwidth cost of Merkle proofs selecting a certain row index and the processing/memory cost of updating filter map tree nodes during block processing.

Selecting a certain row index with a binary Merkle proof costs log2(map_height) hashes. A typical search needs to retrieve the relevant rows of a large section of consecutive filter maps. Since the row mapping function row_index maps a log value to the same row index in each map of an epoch and these rows are also adjacent in log_filter_tree, the prover and receiver only need to pay the bandwidth cost of row index selection once per epoch. Epoch length should be high enough so that the average encoded row length of maps_per_epoch rows is significantly higher than the size of log2(map_height) hashes.

On the other hand, changing the row mapping per epoch helps to even out row density fluctuations over the long term and therefore maps_per_epoch should not be unnecessarily high. Another reason to not make epochs very long is that the processing cost of updating log_filter_tree during block processing is increased if epoch length is high and the filter tree is split by row index close to the root. Processing a single block typically updates lots of rows but only touches the latest map index and/or appends a new one. Updating each row generates log2(maps_per_epoch) new tree nodes.

Backwards Compatibility

The existing log filter API (eth_getLogs, eth_newFilter, eth_getFilterLogs, eth_getFilterChanges) can be implemented with the new filter data structure. Applications relying on this API can operate without any change, with a higher performance. The EVM is not affected in any way.

Test Cases

TODO

Reference Implementation

TODO

Security Considerations

Safe access with a remote prover

In order to trustlessly access relevant data with the help of a remote prover, the prover needs to

  • prove canonical execution headers by number in order to determine the log value index range belonging to the searched block number range
  • prove the relevant rows of filter maps based on map index and row index
  • prove an individual transaction receipt with a Merkle proof from a canonical execution header, based on the log value index of potential matches

The relevant parts of filter maps can be proved with SSZ multiproofs from the log_filter_root of the latest execution header. Proving canonical execution headers requires a canonical beacon header from which it is possible to prove any beacon state root through the state_roots and historical_summary structures, from which it is possible to prove the belonging execution block hash. In conclusion, any search with a remove prover is as safe as the client's knowledge about the latest beacon header.

False positives

From the filter maps a set of potential matches can be derived for any block range and log value or pattern of log values. A canonical transaction receipt that covers the log value index for the potential match can prove whether it was actual match or a false positive. The design guarantees that the set of potential matches includes all actual matches but it should also be ensured that an excessive amount false positives will not make the bandwidth and processing costs of the search prohibitively high.

False positives can happen when reverse_transform(column_index, transform_hash) gives a valid potential_value_subindex that is less than values_per_map while the column_index was actually generated by column_transform(value_subindex, another_transform_hash) where another_transform_hash belongs to a different log value. By assuming a random uniform distribution of column indices the chance of a false positive can be estimated as values_per_map / map_width = 1 / 2**16 for each mark in the map row. Since each map contains a maximum of values_per_map marks, by assuming a random uniform distribution of row indices the rate of false positives can be estimated as values_per_map^2 / map_width / map_height = 1 / 2**12 per map or values_per_map / map_width / map_height = 1 / 2**28 per individual log value. Assuming 2**11 log values per block this gives a rate of one false positive per 2**17 blocks.

Though certain log values might be used a lot more than others and therefore the row index distribution might not be entirely uniform, changing the row mapping per filter epoch ensures that even very often used log values will not raise the false positive rate of certain other log values over the long term. A deliberate attack on a certain important log value (address value in this case, since topic values can be freely used by anyone) in order to raise its false positive rate can not be ruled out entirely since with a low amount of filter data generated per log value it is always possible to "mine" another value that generates colliding filter data. The sparse quasi-randomized horizontal mapping makes this attack a lot harder though, since the column index depends on both the log value and the exact log value index, making this attack only possible for block creators who are probably offered MEV rewards for much more lucrative manipulations of the transaction set.

Copyright

Copyright and related rights waived via CC0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment