Skip to content

Instantly share code, notes, and snippets.

@dietmarkuehl
Last active April 19, 2023 12:17
Show Gist options
  • Save dietmarkuehl/b0767e76bc85e29e74f61e994f105ad9 to your computer and use it in GitHub Desktop.
Save dietmarkuehl/b0767e76bc85e29e74f61e994f105ad9 to your computer and use it in GitHub Desktop.
Create a combined hash proposal.

Let's Hash Things Out

At the Albuquerque meeting P0814 "hash_combine() Again" was discussed and accepted for further work by LEWG. Unfortunately, this proposal throws out multiple improvements on hashing supportapproach and disregards a substantial amount of work leading to superior hashing support. This proposal is an attempt to bring the various bits together to avoid accepting an inferior approach into the standard C++ library.

Overview of Hashing-Related Papers

This section merely rehashes other prior papers on hashing and their respective discussions. It is merely present to provide more background and it can be skipped.

There were plenty of hashing-related papers. Some of these try to build on prior work while others don't. Also, some of the relevant dimensions are not necssarily discussed in the various papers. This section tries to yield an overview of relevant papers and their respective discussions. The respective summary sections try to extract the key points of the papers. The summary will invariably lack some details and isn't meant as a replacements for reading the papers but hopefully does truthfully capture the gist of the papers. The links to the committee discussions require a password as the internal discussions cannot be available publicly. Again, the summary should capture the gist of the discussions. Note that the discussion summary mentions relevant points brought up and not all of these represent a consensus view.

Number Date Title Discussion
N3333 2012-01-13 Hashing User-Defined Type in C++1y N3333 Discussion
N3573 2013-03-10 Heterogenous extensions to unordered containers
N3730 2013-08-28 Specializations and namespaces
N3876 2014-01-19 Convenience Functions to Combine Hash Values N3876 and N3898 Discussion
N3898 2014-01-20 Hashing and Fingerprinting N3898 Discussion
N3983 2014-05-07 Hashing tuple-like types
N3980 2014-05-24 Types Don't Know #
N3339 2015-04-09 Message Digest Library for C++
P0029r0 2015-09-21 A Unified Proposal for Composable Hashing P0029 Discussion
P0199r0 2016-02-11 Default Hash
P0513r0 2016-11-10 Poisoning the Hash D0513 Discussion
P0599r1 2017-03-02 noexcept for Hash Functions D0599 Discussion
P0809r0 2017-10-12 Comparing Unordered Containers
P0814r0 2017-10-13 hash_combine() Again P0814 Discussion
P0549r3 2018-02-03 Adjuncts to std::hash

N3333: Hashing User-Defined Type in C++1y

The standard C++ libraries defines various containers based on hashing but no support to create good hashes. Users are bound to create bad hash functions.

Discussion of multiple hashing uses (in-process containers, distributed containers, storing data on disk, string comparison via a hash, security digests). std::hash<> can't fill all these needs and suggests to address in-process unordered contains with std::hash only. To allow updates to the hashing function while avoiding suprises it should yield different results in different processes (non-determinism of hash values). The paper mentions a few approaches how to initialize a seed to produce different results from different processes.

Specializing functionality in namespace std is confusing and leads programmers to also specialize things they are not allowed to specialize. Thus, std::hash should delegate to hash_value() which locates the specialized functionality using ADL. Library code already using std::hash needs to continue using it but user code should ideally use hash_value() directly and/or adl_hash_value() (which readily implements the using dance).

To aid users with their implementation of hash_value() a variadic function template hash_combine() is proposed. A number of properties for hash_combine() are listed. It is also mentioned that hashing functions do require some initialization/finalization which have some costs but that it may not pose a significatn problem in practice. Also, a hash_combine_range() function template is proposed as well as hash_as_bytes().

The functions for computing hashes return a hash_code to make it explicit that the values are intentionally unstable [between different processes]. This use can also help reducing the collision rates and performance but these aspects were not consider sufficient to warrant the extra complexity.

The paper does not contain concrete wording.

  1. Generally there seems to be an argreement of hashes being non-deterministic. However, there are also concerns raised about non-determinism.
  2. Request to create an updated version including hashing and fingerprinting.

N3876: Convenience Functions to Combine Hash Values

The current standard C++ library doesn't provide functionality to create hashes for composite values. Users need to create hashes for composite values, e.g., to make them viable keys in unordered containers. However, it is non-trivial to create a good hash combination and users are prone to use bad approaches for combining hashes.

There are papers on implementing decent hash combining functions and Boost hash_combine() provides corresponding functionality. The Boost functionality is used since years while the approach from N3333 is not.

In quoted comments it is pointed out that hash_combine() should take advantage of the variadic nature of the function. Without doing so the result is harder to optimize well.

The paper proposed the addition of two function templates to <functional>, both returning a std::size_t:

  1. hash_combine(seed, values...) taking a seed to be modified by combining the hashes of the values into it.
  2. hash_val(values...) taking a sequence of values whose hash is to be combined. hash_val() delegates to hash_combine() for the actual work.
  1. The discussions for N3876 and N3898 overlapped and are combined into one discussion.
  2. There maybe a more powerful approach where the hashing/fingerprint algorithm is a parameter but something like N3878 is a first step.
  3. Both a variadic and a range-based interface should be provided.
  4. std::size_t as an intermediate state is bad. Using bigger state results in faster and better hashes.
  5. The interface for hash_val() isn't right.
  6. The conversion to a std::size_t should happen as the last thing.
  7. It was suggested that hash-combing tuple values might work but it is also stated that this doesn't work in all cases.
  8. There will be one chance to tell people how to write hash functions.
  9. The current practice is broken and we need to provide and teach a better approach.
  10. Various comments point to N3333.
  11. LWG2291: std::hash is vulnerable to DoS attacks
  12. No support fo N3876 but support for N3333 and hashing algorithms which are unstable across executions as well as cryptographic algorithms.

N3898: Hashing and Fingerprinting

Going further than N3333 the paper proposes to separate the parts traversing the elements of a struct from the computation of a hash: code computing a hash should nearly read like serializing an object. A concept is defined for a hasher object : the functions from N3333 become member functions of the hasher objects. Also, the standard would provide multiple implementations of hasher objects with different properties with respect to the quality of the hash, whether it is deterministic and the size of the result.

The paper doesn't propose wording.

  1. Pre-meeting some comments were added about the optimizer's ability to do a good job: Passing the full set of arguments to the function by value makes the optimizer's job easier.
  2. Discussion at the meeting was joined with the discussion of N3876 (see above).

P0029r0: A Unified Proposal for Composable Hashing

The paper starts by stating the goals (support efficient and convenient composition; avoid users writing bad hashing functions; dealing with wrapper types; avoid the need of specializing in namespace std). The objective is to unify the properties of proposals N3333 and N3980.

The primary goal of the proposal is to decouple the hashing algorithm (provided by libraries) and the hashing function provided by users. This is done by exposing a type's hashing represention using bytes (as unsigned char). The paper also defines a hash expansion when composing values and states the restriction that a hash representations of a value shouldn't be suffix of a hash representation of any other value of the same type leading to a set of basic operations to combine hashes. The actual hash value is computed based on the hash expansion provided by a type.

The fundamental approach consists of uses of a few functions and traits:

  • hash_value(h, v) obtaining a hash_code for v based on a hash_code passed in.
  • hash_combine(std::move(h), m0, ..., mi) for a hash based on the hash expansion.
  • hash_combin_range(std::move(h), begin, end) for a hash based on a range.
  • is_uniquely_represented for types which can be readily treated as byte arrays.

The used hash_code is move-only. hash_code is used for the default hashing algorithm which is expected to be of good quality and randomized at program start-up. It can be used for user-provided hashing functions and is the hashing function used by hash<T>, i.e., users can provide a hashing function using a concrete type. For some use cases it is useful to provide other hashing function. In that case the provided hashing functions should be templated on the hashing algorithm. All overloads of hash_value() in the standard library should be templatized that way.

Based on experience with users hash_value() will have to be integrated with hash<T>. It is proposed to extend the primary template to implement operator()(T const& t) in terms of hash_value(h, t) (with a suitable rvalue h).

The paper does not propose support for heterogeneous lookup but the proposed infrastructure should be able to cope with that case.

The paper compares various aspect to other proposals and provides a rationale for the respective choices:

  1. Passing the algorithm by-value vs. by-reference: aside from performance considerations, by-value subsumes by-reference. Hence by-value is preferred.
  2. Multiple hashing algorithms: The proposal doesn't require use of a template to support multiple algorithm but the infrastructure supports this use and using a template is required for standard types
  3. Single vs. Multiple initialization: The proposal adds all bytes to the representation based on one initialization, thereby avoiding leakage of implementation details on how the bytes are produced. Doing so should help with supporting heteregeneous lookup.
  4. Hash algorithm API: Providing more arguments in individual calls to hash_combine() provide a broader interface to provide but also more optimization options for algorithm implementers. Thus, the interface is richer than that proposed in N3980.
  5. Contiguous layout: Using a trait to assert that all bytes can be used for hashing yields for an easier to use interface and better integrates with arrays.

The paper contains wording.

  1. Support of heterogenous look-up matters mostly for string and string_view.
  2. No consensus for general heterogenous look-up.
  3. Discussion on hash<T> vs. replacing the hasher: replacing the hasher for hash<T> does require some migration effort.
  4. Discussion on ease of creating a hashing function. The general assessment is seems to be that it isn't unduly hard to create a hashing function based on this proposal.
  5. Discussion on the difference on how the hashing functions are defined. There is concern about needing using std::hash_combine but also mention of the remedy (putting that into a standard function.

Polls:

  1. Should the name be distinguished from existing names: apparently in favor
  2. Should std::hash be changed [to use the hashing infrastructure]: mildly in favor
  3. Should the hashing functions be noexcept: in favor

P0199r0: Default Hash

The paper proposes automatic implementation of hash_value(v) and is_uniquely_reprsented (from P0029r0 by the language. Users would be allowed to override the default implementation were necessary or desired. It explains how and when hash_value(v) is generated and the rationale behind the decisions. It also provides some detail on why an implementation based on [static] reflection isn't viable (based on the reflection papers at the time, at least).

[P0199 Discussion]

Apparently this paper was not discussed at a meeting.

P0513r0: Poisoning the Hash

The paper clarifies the wording on when hash<T> is enabled or disabled for a type T. There isn't much rationale as this paper is mostly addressing a reported defect in the Working Draft.

According the paper hash<T> is disable if neither the user nor the library provides an explicit or partial specialization of the class template hash for T.

The changes from this paper were applied to the Working Draft at the Issaquah meeting.

####D0513 Discussion

Lengthy discussion in LWG on various details. P0513r0 reflects the changes asked for during the discussion.

P0599r1: noexcept for Hash Functions

The paper addresses US140 and changes the wording of some hash<T> specializations.

  1. Some minor concerns with specific decisions.
  2. General consensus with the direction.
  1. Various discussion about noexcept in different contexts.
  2. Consensus that the hash<T> structure operations (ctors, dtor, swap, etc.) should be noexecpt.
  3. Apparently no consensus to make operator()() also always noexcept.
  4. General agreement to take the paper as written.

P0814r0: hash_combine() Again

hash_combine() was proposed in N3876 which, however, was rejected because something better based on N3333 was promised to emerge. Despite the promise nothing better came and C++17 shipped without facilities to combine hashes. Even though possibly inferior hash_combine() is proposed to be added to the standard C++ library.

The core of the proposal is a variadic function template hash_combine() taking a sequence of values and returns a hash value resulting from a combination of the respective value's hash values. The hash values of the respective values are obtained using hash<T>()(arg). The hash_combine() function template supports a first, non-deduced argument for the aggregation type which is defaulted to size_t.

P0814 Discussion in Albuquerque

The discussion brought up a few points:

  1. Channeling hashes through std::size_t for combining hashes (the result of std::hash<T>()(v)) is inefficient and leads to bad hashes.
  2. Baking a hash value into programs creates an ABI dependency.
  3. Nicolai isn't insisting on having hash_combine() in the standard C++ library as long as there is something available.
  4. Combining hashes should not be required to be associative.
  5. Both variadic and range versions should be provided with the potential of leveraging a Ranges library.
  6. The most likely location for the functionality is <functional>.
  7. There was consent that hash_combine() should be accepted.

P0549r3: Adjuncts to std::hash

The paper proposes the addition of a few tools around the ability to producing hashes for different types:

  1. is_enabled_hash<T>
  2. hash_for<T>, is_hashable_<T> and is_hashable_v<T>
  3. hash_value(v)
  4. is_nothrow_hashable<T> and is_nothrow_hashable_v<T>

The function hash_value(v) delegates computation of a hash to hash_for<T>{}(std::forward<T>(v)). hash_for<T> in turn is a using alias for hash<remove_cvref_t<T>>

The paper includes proposed wording.

P0549 Discussion

P0549r3 or its earlier revisision were not discussed at any of the meetings.

@HowardHinnant
Copy link

HowardHinnant commented Oct 12, 2021

It looks like this feature was added after N3980 was written.

In all but a few HashAlgorithms, this switch will be set to native so that is_contiguously_hashable is equivalent to is_uniquely_represented. But in the example sha256 HashAlgorithm it is set to big. This is so that different computers across a network can each create a SHA-256 hash of an object, and compare that hash across the network. If the hash is equal, then the hashed contents are equal, else they are not. One node in the network can be a little endian machine and another a big endian machine, and the SHA-256 comparison still works.

And yes, this has worked well. std::endian is now part of C++ with the exact same semantics and syntax as xstd::endian in the repository. We use a SHA-512 hash_append adaptor in the XRP blockchain network modeled after this example code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment