dietmarkuehl/lets-hash-things-over.md

## lets-hash-things-over.md

      
    Raw
  

              lets-hash-things-over.md
            
          
    Let's Hash Things Out

At the Albuquerque meeting P0814
"hash_combine() Again" was discussed and accepted for further
work by LEWG. Unfortunately, this proposal throws out multiple
improvements on hashing supportapproach and disregards a substantial
amount of work leading to superior hashing support. This proposal
is an attempt to bring the various bits together to avoid accepting
an inferior approach into the standard C++ library.
Overview of Hashing-Related Papers

This section merely rehashes other prior papers on hashing and their
respective discussions. It is merely present to provide more
background and it can be skipped.
There were plenty of hashing-related papers. Some of these try to
build on prior work while others don't. Also, some of the relevant
dimensions are not necssarily discussed in the various papers. This
section tries to yield an overview of relevant papers and their
respective discussions. The respective summary sections try to
extract the key points of the papers. The summary will invariably
lack some details and isn't meant as a replacements for reading the
papers but hopefully does truthfully capture the gist of the papers.
The links to the committee discussions require a password as the
internal discussions cannot be available publicly. Again, the summary
should capture the gist of the discussions. Note that the discussion
summary mentions relevant points brought up and not all of these
represent a consensus view.


Number
Date
Title
Discussion


N3333
2012-01-13
Hashing User-Defined Type in C++1y
N3333 Discussion


N3573
2013-03-10
Heterogenous extensions to unordered containers


N3730
2013-08-28
Specializations and namespaces


N3876
2014-01-19
Convenience Functions to Combine Hash Values
N3876 and N3898 Discussion


N3898
2014-01-20
Hashing and Fingerprinting
N3898 Discussion


N3983
2014-05-07
Hashing tuple-like types


N3980
2014-05-24
Types Don't Know #


N3339
2015-04-09
Message Digest Library for C++


P0029r0
2015-09-21
A Unified Proposal for Composable Hashing
P0029 Discussion


P0199r0
2016-02-11
Default Hash


P0513r0
2016-11-10
Poisoning the Hash
D0513 Discussion


P0599r1
2017-03-02
noexcept for Hash Functions
D0599 Discussion


P0809r0
2017-10-12
Comparing Unordered Containers


P0814r0
2017-10-13
hash_combine() Again
P0814 Discussion


P0549r3
2018-02-03
Adjuncts to std::hash


N3333: Hashing User-Defined Type in C++1y

The standard C++ libraries defines various containers based on
hashing but no support to create good hashes. Users are bound to
create bad hash functions.
Discussion of multiple hashing uses (in-process containers, distributed
containers, storing data on disk, string comparison via a hash,
security digests). std::hash<> can't fill all these needs and
suggests to address in-process unordered contains with std::hash
only. To allow updates to the hashing function while avoiding
suprises it should yield different results in different processes
(non-determinism of hash values). The paper mentions a few
approaches how to initialize a seed to produce different results
from different processes.
Specializing functionality in namespace std is confusing and leads
programmers to also specialize things they are not allowed to
specialize. Thus, std::hash should delegate to hash_value()
which locates the specialized functionality using ADL. Library code
already using std::hash needs to continue using it but user code
should ideally use hash_value() directly and/or adl_hash_value()
(which readily implements the using dance).
To aid users with their implementation of hash_value() a variadic
function template hash_combine() is proposed. A number of properties
for hash_combine() are listed. It is also mentioned that hashing
functions do require some initialization/finalization which have
some costs but that it may not pose a significatn problem in practice.
Also, a hash_combine_range() function template is proposed as well
as hash_as_bytes().
The functions for computing hashes return a hash_code to make it
explicit that the values are intentionally unstable [between different
processes]. This use can also help reducing the collision rates and
performance but these aspects were not consider sufficient to warrant
the extra complexity.
The paper does not contain concrete wording.
N3333 Discussion |in Kona


Generally there seems to be an argreement of hashes being
non-deterministic. However, there are also concerns raised
about non-determinism.
Request to create an updated version including hashing and
fingerprinting.

N3876: Convenience Functions to Combine Hash Values

The current standard C++ library doesn't provide functionality to
create hashes for composite values. Users need to create hashes for
composite values, e.g., to make them viable keys in unordered
containers. However, it is non-trivial to create a good hash
combination and users are prone to use bad approaches for combining
hashes.
There are papers on implementing decent hash combining functions and
Boost hash_combine()
provides corresponding functionality. The Boost functionality is used
since years while the approach from N3333 is not.
In quoted comments it is pointed out that hash_combine() should
take advantage of the variadic nature of the function. Without doing
so the result is harder to optimize well.
The paper proposed the addition of two function templates to
<functional>, both returning a std::size_t:

hash_combine(seed, values...) taking a seed to be modified
by combining the hashes of the values into it.
hash_val(values...) taking a sequence of values whose hash
is to be combined. hash_val() delegates to hash_combine()
for the actual work.

N3876 and N3898 Discussion in Issaquah


The discussions for N3876 and
N3898 overlapped and are combined into
one discussion.
There maybe a more powerful approach where the hashing/fingerprint
algorithm is a parameter but something like N3878
is a first step.
Both a variadic and a range-based interface should be provided.
std::size_t as an intermediate state is bad. Using bigger state
results in faster and better hashes.
The interface for hash_val() isn't right.
The conversion to a std::size_t should happen as the last thing.
It was suggested that hash-combing tuple values might work but
it is also stated that this doesn't work in all cases.
There will be one chance to tell people how to write hash functions.
The current practice is broken and we need to provide and teach a
better approach.
Various comments point to N3333.
LWG2291: std::hash is vulnerable to DoS attacks
No support fo N3876 but support for
N3333 and hashing algorithms which
are unstable across executions as well as cryptographic algorithms.

N3898: Hashing and Fingerprinting

Going further than N3333 the paper proposes
to separate the parts traversing the elements of a struct from the
computation of a hash: code computing a hash should nearly read
like serializing an object. A concept is defined for a hasher
object : the functions from N3333 become
member functions of the hasher objects. Also, the standard would
provide multiple implementations of hasher objects with different
properties with respect to the quality of the hash, whether it is
deterministic and the size of the result.
The paper doesn't propose wording.
N3898 Discussion


Pre-meeting some comments were added about the optimizer's
ability to do a good job: Passing the full set of arguments
to the function by value makes the optimizer's job easier.
Discussion at the meeting
was joined with the discussion of N3876 (see above).

P0029r0: A Unified Proposal for Composable Hashing

The paper starts by stating the goals (support efficient and
convenient composition; avoid users writing bad hashing functions;
dealing with wrapper types; avoid the need of specializing in
namespace std). The objective is to unify the properties of
proposals N3333 and
N3980.
The primary goal of the proposal is to decouple the hashing algorithm
(provided by libraries) and the hashing function provided by users.
This is done by exposing a type's hashing represention using bytes
(as unsigned char). The paper also defines a hash expansion
when composing values and states the restriction that a hash
representations of a value shouldn't be suffix of a hash representation
of any other value of the same type leading to a set of basic
operations to combine hashes. The actual hash value is computed
based on the hash expansion provided by a type.
The fundamental approach consists of uses of a few functions and traits:

hash_value(h, v) obtaining a hash_code for v based on a
hash_code passed in.
hash_combine(std::move(h), m0, ..., mi) for a hash based on the
hash expansion.
hash_combin_range(std::move(h), begin, end) for a hash based on a range.
is_uniquely_represented for types which can be readily treated as byte arrays.

The used hash_code is move-only. hash_code is used for the
default hashing algorithm which is expected to be of good quality
and randomized at program start-up. It can be used for user-provided
hashing functions and is the hashing function used by hash<T>,
i.e., users can provide a hashing function using a concrete type.
For some use cases it is useful to provide other hashing function.
In that case the provided hashing functions should be templated on
the hashing algorithm. All overloads of hash_value() in the standard
library should be templatized that way.
Based on experience with users hash_value() will have to be
integrated with hash<T>. It is proposed to extend the primary
template to implement operator()(T const& t) in terms of
hash_value(h, t) (with a suitable rvalue h).
The paper does not propose support for heterogeneous lookup but the
proposed infrastructure should be able to cope with that case.
The paper compares various aspect to other proposals and provides
a rationale for the respective choices:

Passing the algorithm by-value vs. by-reference: aside from
performance considerations, by-value subsumes by-reference. Hence
by-value is preferred.
Multiple hashing algorithms: The proposal doesn't require use
of a template to support multiple algorithm but the infrastructure
supports this use and using a template is required for standard
types
Single vs. Multiple initialization: The proposal adds all bytes
to the representation based on one initialization, thereby
avoiding leakage of implementation details on how the bytes are
produced.  Doing so should help with supporting heteregeneous
lookup.
Hash algorithm API: Providing more arguments in individual calls
to hash_combine() provide a broader interface to provide but
also more optimization options for algorithm implementers. Thus,
the interface is richer than that proposed in
N3980.
Contiguous layout: Using a trait to assert that all bytes can
be used for hashing yields for an easier to use interface and
better integrates with arrays.

The paper contains wording.
P0029 Discussion | at Kona


Support of heterogenous look-up matters mostly for string and string_view.
No consensus for general heterogenous look-up.
Discussion on hash<T> vs. replacing the hasher: replacing the hasher for
hash<T> does require some migration effort.
Discussion on ease of creating a hashing function. The general assessment is
seems to be that it isn't unduly hard to create a hashing function based on this
proposal.
Discussion on the difference on how the hashing functions are
defined. There is concern about needing using std::hash_combine
but also mention of the remedy (putting that into a standard function.

Polls:

Should the name be distinguished from existing names: apparently in favor
Should std::hash be changed [to use the hashing infrastructure]: mildly in favor
Should the hashing functions be noexcept: in favor

P0199r0: Default Hash

The paper proposes automatic implementation of hash_value(v) and
is_uniquely_reprsented (from P0029r0
by the language. Users would be allowed to override the default
implementation were necessary or desired. It explains how and when
hash_value(v) is generated and the rationale behind the decisions.
It also provides some detail on why an implementation based on
[static] reflection isn't viable (based on the reflection papers
at the time, at least).
[P0199 Discussion]

Apparently this paper was not discussed at a meeting.
P0513r0: Poisoning the Hash

The paper clarifies the wording on when hash<T> is enabled or
disabled for a type T. There isn't much rationale as this paper is
mostly addressing a reported defect in the Working Draft.
According the paper hash<T> is disable if neither the user nor
the library provides an explicit or partial specialization of the
class template hash for T.
The changes from this paper were applied to the Working Draft at
the Issaquah meeting.
####D0513 Discussion
Lengthy discussion in LWG on various details. P0513r0
reflects the changes asked for during the discussion.
P0599r1: noexcept for Hash Functions

The paper addresses US140 and changes the wording of some hash<T>
specializations.
D0599 Discussion at Kona


Some minor concerns with specific decisions.
General consensus with the direction.

D0599 Discussion at Kona


Various discussion about noexcept in different contexts.
Consensus that the hash<T> structure operations (ctors, dtor,
swap, etc.) should be noexecpt.
Apparently no consensus to make operator()() also always noexcept.
General agreement to take the paper as written.

P0814r0: hash_combine() Again

hash_combine() was proposed in N3876
which, however, was rejected because something better based on
N3333 was promised to emerge. Despite the
promise nothing better came and C++17 shipped without facilities
to combine hashes. Even though possibly inferior hash_combine()
is proposed to be added to the standard C++ library.
The core of the proposal is a variadic function template hash_combine()
taking a sequence of values and returns a hash value resulting from
a combination of the respective value's hash values. The hash values
of the respective values are obtained using hash<T>()(arg). The
hash_combine() function template supports a first, non-deduced
argument for the aggregation type which is defaulted to size_t.
P0814 Discussion in Albuquerque

The discussion brought up a few points:

Channeling hashes through std::size_t for combining hashes
(the result of std::hash<T>()(v)) is inefficient and leads to bad
hashes.
Baking a hash value into programs creates an ABI dependency.
Nicolai isn't insisting on having hash_combine() in the
standard C++ library as long as there is something available.
Combining hashes should not be required to be associative.
Both variadic and range versions should be provided with the
potential of leveraging a Ranges library.
The most likely location for the functionality is <functional>.
There was consent that hash_combine() should be accepted.

P0549r3: Adjuncts to std::hash

The paper proposes the addition of a few tools around the ability to
producing hashes for different types:

is_enabled_hash<T>
hash_for<T>, is_hashable_<T> and is_hashable_v<T>
hash_value(v)
is_nothrow_hashable<T> and is_nothrow_hashable_v<T>

The function hash_value(v) delegates computation of a hash to
hash_for<T>{}(std::forward<T>(v)). hash_for<T> in turn is a
using alias for hash<remove_cvref_t<T>>
The paper includes proposed wording.
P0549 Discussion

P0549r3 or its earlier revisision were
not discussed at any of the meetings.
Number	Date	Title	Discussion
N3333	2012-01-13	Hashing User-Defined Type in C++1y	N3333 Discussion
N3573	2013-03-10	Heterogenous extensions to unordered containers
N3730	2013-08-28	Specializations and namespaces
N3876	2014-01-19	Convenience Functions to Combine Hash Values	N3876 and N3898 Discussion
N3898	2014-01-20	Hashing and Fingerprinting	N3898 Discussion
N3983	2014-05-07	Hashing tuple-like types
N3980	2014-05-24	Types Don't Know #
N3339	2015-04-09	Message Digest Library for C++
P0029r0	2015-09-21	A Unified Proposal for Composable Hashing	P0029 Discussion
P0199r0	2016-02-11	Default Hash
P0513r0	2016-11-10	Poisoning the Hash	D0513 Discussion
P0599r1	2017-03-02	`noexcept` for Hash Functions	D0599 Discussion
P0809r0	2017-10-12	Comparing Unordered Containers
P0814r0	2017-10-13	`hash_combine()` Again	P0814 Discussion
P0549r3	2018-02-03	Adjuncts to `std::hash`