Skip to content

Instantly share code, notes, and snippets.

@dietmarkuehl
Last active April 19, 2023 12:17
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dietmarkuehl/b0767e76bc85e29e74f61e994f105ad9 to your computer and use it in GitHub Desktop.
Save dietmarkuehl/b0767e76bc85e29e74f61e994f105ad9 to your computer and use it in GitHub Desktop.
Create a combined hash proposal.

Let's Hash Things Out

At the Albuquerque meeting P0814 "hash_combine() Again" was discussed and accepted for further work by LEWG. Unfortunately, this proposal throws out multiple improvements on hashing supportapproach and disregards a substantial amount of work leading to superior hashing support. This proposal is an attempt to bring the various bits together to avoid accepting an inferior approach into the standard C++ library.

Overview of Hashing-Related Papers

This section merely rehashes other prior papers on hashing and their respective discussions. It is merely present to provide more background and it can be skipped.

There were plenty of hashing-related papers. Some of these try to build on prior work while others don't. Also, some of the relevant dimensions are not necssarily discussed in the various papers. This section tries to yield an overview of relevant papers and their respective discussions. The respective summary sections try to extract the key points of the papers. The summary will invariably lack some details and isn't meant as a replacements for reading the papers but hopefully does truthfully capture the gist of the papers. The links to the committee discussions require a password as the internal discussions cannot be available publicly. Again, the summary should capture the gist of the discussions. Note that the discussion summary mentions relevant points brought up and not all of these represent a consensus view.

Number Date Title Discussion
N3333 2012-01-13 Hashing User-Defined Type in C++1y N3333 Discussion
N3573 2013-03-10 Heterogenous extensions to unordered containers
N3730 2013-08-28 Specializations and namespaces
N3876 2014-01-19 Convenience Functions to Combine Hash Values N3876 and N3898 Discussion
N3898 2014-01-20 Hashing and Fingerprinting N3898 Discussion
N3983 2014-05-07 Hashing tuple-like types
N3980 2014-05-24 Types Don't Know #
N3339 2015-04-09 Message Digest Library for C++
P0029r0 2015-09-21 A Unified Proposal for Composable Hashing P0029 Discussion
P0199r0 2016-02-11 Default Hash
P0513r0 2016-11-10 Poisoning the Hash D0513 Discussion
P0599r1 2017-03-02 noexcept for Hash Functions D0599 Discussion
P0809r0 2017-10-12 Comparing Unordered Containers
P0814r0 2017-10-13 hash_combine() Again P0814 Discussion
P0549r3 2018-02-03 Adjuncts to std::hash

N3333: Hashing User-Defined Type in C++1y

The standard C++ libraries defines various containers based on hashing but no support to create good hashes. Users are bound to create bad hash functions.

Discussion of multiple hashing uses (in-process containers, distributed containers, storing data on disk, string comparison via a hash, security digests). std::hash<> can't fill all these needs and suggests to address in-process unordered contains with std::hash only. To allow updates to the hashing function while avoiding suprises it should yield different results in different processes (non-determinism of hash values). The paper mentions a few approaches how to initialize a seed to produce different results from different processes.

Specializing functionality in namespace std is confusing and leads programmers to also specialize things they are not allowed to specialize. Thus, std::hash should delegate to hash_value() which locates the specialized functionality using ADL. Library code already using std::hash needs to continue using it but user code should ideally use hash_value() directly and/or adl_hash_value() (which readily implements the using dance).

To aid users with their implementation of hash_value() a variadic function template hash_combine() is proposed. A number of properties for hash_combine() are listed. It is also mentioned that hashing functions do require some initialization/finalization which have some costs but that it may not pose a significatn problem in practice. Also, a hash_combine_range() function template is proposed as well as hash_as_bytes().

The functions for computing hashes return a hash_code to make it explicit that the values are intentionally unstable [between different processes]. This use can also help reducing the collision rates and performance but these aspects were not consider sufficient to warrant the extra complexity.

The paper does not contain concrete wording.

  1. Generally there seems to be an argreement of hashes being non-deterministic. However, there are also concerns raised about non-determinism.
  2. Request to create an updated version including hashing and fingerprinting.

N3876: Convenience Functions to Combine Hash Values

The current standard C++ library doesn't provide functionality to create hashes for composite values. Users need to create hashes for composite values, e.g., to make them viable keys in unordered containers. However, it is non-trivial to create a good hash combination and users are prone to use bad approaches for combining hashes.

There are papers on implementing decent hash combining functions and Boost hash_combine() provides corresponding functionality. The Boost functionality is used since years while the approach from N3333 is not.

In quoted comments it is pointed out that hash_combine() should take advantage of the variadic nature of the function. Without doing so the result is harder to optimize well.

The paper proposed the addition of two function templates to <functional>, both returning a std::size_t:

  1. hash_combine(seed, values...) taking a seed to be modified by combining the hashes of the values into it.
  2. hash_val(values...) taking a sequence of values whose hash is to be combined. hash_val() delegates to hash_combine() for the actual work.
  1. The discussions for N3876 and N3898 overlapped and are combined into one discussion.
  2. There maybe a more powerful approach where the hashing/fingerprint algorithm is a parameter but something like N3878 is a first step.
  3. Both a variadic and a range-based interface should be provided.
  4. std::size_t as an intermediate state is bad. Using bigger state results in faster and better hashes.
  5. The interface for hash_val() isn't right.
  6. The conversion to a std::size_t should happen as the last thing.
  7. It was suggested that hash-combing tuple values might work but it is also stated that this doesn't work in all cases.
  8. There will be one chance to tell people how to write hash functions.
  9. The current practice is broken and we need to provide and teach a better approach.
  10. Various comments point to N3333.
  11. LWG2291: std::hash is vulnerable to DoS attacks
  12. No support fo N3876 but support for N3333 and hashing algorithms which are unstable across executions as well as cryptographic algorithms.

N3898: Hashing and Fingerprinting

Going further than N3333 the paper proposes to separate the parts traversing the elements of a struct from the computation of a hash: code computing a hash should nearly read like serializing an object. A concept is defined for a hasher object : the functions from N3333 become member functions of the hasher objects. Also, the standard would provide multiple implementations of hasher objects with different properties with respect to the quality of the hash, whether it is deterministic and the size of the result.

The paper doesn't propose wording.

  1. Pre-meeting some comments were added about the optimizer's ability to do a good job: Passing the full set of arguments to the function by value makes the optimizer's job easier.
  2. Discussion at the meeting was joined with the discussion of N3876 (see above).

P0029r0: A Unified Proposal for Composable Hashing

The paper starts by stating the goals (support efficient and convenient composition; avoid users writing bad hashing functions; dealing with wrapper types; avoid the need of specializing in namespace std). The objective is to unify the properties of proposals N3333 and N3980.

The primary goal of the proposal is to decouple the hashing algorithm (provided by libraries) and the hashing function provided by users. This is done by exposing a type's hashing represention using bytes (as unsigned char). The paper also defines a hash expansion when composing values and states the restriction that a hash representations of a value shouldn't be suffix of a hash representation of any other value of the same type leading to a set of basic operations to combine hashes. The actual hash value is computed based on the hash expansion provided by a type.

The fundamental approach consists of uses of a few functions and traits:

  • hash_value(h, v) obtaining a hash_code for v based on a hash_code passed in.
  • hash_combine(std::move(h), m0, ..., mi) for a hash based on the hash expansion.
  • hash_combin_range(std::move(h), begin, end) for a hash based on a range.
  • is_uniquely_represented for types which can be readily treated as byte arrays.

The used hash_code is move-only. hash_code is used for the default hashing algorithm which is expected to be of good quality and randomized at program start-up. It can be used for user-provided hashing functions and is the hashing function used by hash<T>, i.e., users can provide a hashing function using a concrete type. For some use cases it is useful to provide other hashing function. In that case the provided hashing functions should be templated on the hashing algorithm. All overloads of hash_value() in the standard library should be templatized that way.

Based on experience with users hash_value() will have to be integrated with hash<T>. It is proposed to extend the primary template to implement operator()(T const& t) in terms of hash_value(h, t) (with a suitable rvalue h).

The paper does not propose support for heterogeneous lookup but the proposed infrastructure should be able to cope with that case.

The paper compares various aspect to other proposals and provides a rationale for the respective choices:

  1. Passing the algorithm by-value vs. by-reference: aside from performance considerations, by-value subsumes by-reference. Hence by-value is preferred.
  2. Multiple hashing algorithms: The proposal doesn't require use of a template to support multiple algorithm but the infrastructure supports this use and using a template is required for standard types
  3. Single vs. Multiple initialization: The proposal adds all bytes to the representation based on one initialization, thereby avoiding leakage of implementation details on how the bytes are produced. Doing so should help with supporting heteregeneous lookup.
  4. Hash algorithm API: Providing more arguments in individual calls to hash_combine() provide a broader interface to provide but also more optimization options for algorithm implementers. Thus, the interface is richer than that proposed in N3980.
  5. Contiguous layout: Using a trait to assert that all bytes can be used for hashing yields for an easier to use interface and better integrates with arrays.

The paper contains wording.

  1. Support of heterogenous look-up matters mostly for string and string_view.
  2. No consensus for general heterogenous look-up.
  3. Discussion on hash<T> vs. replacing the hasher: replacing the hasher for hash<T> does require some migration effort.
  4. Discussion on ease of creating a hashing function. The general assessment is seems to be that it isn't unduly hard to create a hashing function based on this proposal.
  5. Discussion on the difference on how the hashing functions are defined. There is concern about needing using std::hash_combine but also mention of the remedy (putting that into a standard function.

Polls:

  1. Should the name be distinguished from existing names: apparently in favor
  2. Should std::hash be changed [to use the hashing infrastructure]: mildly in favor
  3. Should the hashing functions be noexcept: in favor

P0199r0: Default Hash

The paper proposes automatic implementation of hash_value(v) and is_uniquely_reprsented (from P0029r0 by the language. Users would be allowed to override the default implementation were necessary or desired. It explains how and when hash_value(v) is generated and the rationale behind the decisions. It also provides some detail on why an implementation based on [static] reflection isn't viable (based on the reflection papers at the time, at least).

[P0199 Discussion]

Apparently this paper was not discussed at a meeting.

P0513r0: Poisoning the Hash

The paper clarifies the wording on when hash<T> is enabled or disabled for a type T. There isn't much rationale as this paper is mostly addressing a reported defect in the Working Draft.

According the paper hash<T> is disable if neither the user nor the library provides an explicit or partial specialization of the class template hash for T.

The changes from this paper were applied to the Working Draft at the Issaquah meeting.

####D0513 Discussion

Lengthy discussion in LWG on various details. P0513r0 reflects the changes asked for during the discussion.

P0599r1: noexcept for Hash Functions

The paper addresses US140 and changes the wording of some hash<T> specializations.

  1. Some minor concerns with specific decisions.
  2. General consensus with the direction.
  1. Various discussion about noexcept in different contexts.
  2. Consensus that the hash<T> structure operations (ctors, dtor, swap, etc.) should be noexecpt.
  3. Apparently no consensus to make operator()() also always noexcept.
  4. General agreement to take the paper as written.

P0814r0: hash_combine() Again

hash_combine() was proposed in N3876 which, however, was rejected because something better based on N3333 was promised to emerge. Despite the promise nothing better came and C++17 shipped without facilities to combine hashes. Even though possibly inferior hash_combine() is proposed to be added to the standard C++ library.

The core of the proposal is a variadic function template hash_combine() taking a sequence of values and returns a hash value resulting from a combination of the respective value's hash values. The hash values of the respective values are obtained using hash<T>()(arg). The hash_combine() function template supports a first, non-deduced argument for the aggregation type which is defaulted to size_t.

P0814 Discussion in Albuquerque

The discussion brought up a few points:

  1. Channeling hashes through std::size_t for combining hashes (the result of std::hash<T>()(v)) is inefficient and leads to bad hashes.
  2. Baking a hash value into programs creates an ABI dependency.
  3. Nicolai isn't insisting on having hash_combine() in the standard C++ library as long as there is something available.
  4. Combining hashes should not be required to be associative.
  5. Both variadic and range versions should be provided with the potential of leveraging a Ranges library.
  6. The most likely location for the functionality is <functional>.
  7. There was consent that hash_combine() should be accepted.

P0549r3: Adjuncts to std::hash

The paper proposes the addition of a few tools around the ability to producing hashes for different types:

  1. is_enabled_hash<T>
  2. hash_for<T>, is_hashable_<T> and is_hashable_v<T>
  3. hash_value(v)
  4. is_nothrow_hashable<T> and is_nothrow_hashable_v<T>

The function hash_value(v) delegates computation of a hash to hash_for<T>{}(std::forward<T>(v)). hash_for<T> in turn is a using alias for hash<remove_cvref_t<T>>

The paper includes proposed wording.

P0549 Discussion

P0549r3 or its earlier revisision were not discussed at any of the meetings.

@mavam
Copy link

mavam commented Oct 5, 2021

Great summary!

Has anything happened about the topic of hashing in the past years?

@dietmarkuehl
Copy link
Author

dietmarkuehl commented Oct 5, 2021 via email

@mavam
Copy link

mavam commented Oct 6, 2021

Too bad. Howard did such great work getting N3333 off the grounds. I guess we'll be stuck in hand-rolled solutions for a while.

We're actually using N3333 for a while now, and it works really well. Only a few things bothered me:

  1. The explicit cast to get to the result is not obvious at the call site
  2. ADL as customization point for hash_append sometimes gets in the way
  3. Some hash algorithms have an optimized one-shot path (e.g., XXHash) that is at odds with the standard construct-add-finsh-state cycle

@dietmarkuehl
Copy link
Author

dietmarkuehl commented Oct 6, 2021 via email

@mavam
Copy link

mavam commented Oct 7, 2021

I wish I had the bandwidth to join you. We should definitely sync up with @HowardHinnant before starting something. I'm curious to hear what he'd do differently after all the years.

That said, I'm more than happy to "co-invest", bounce off ideas, but pushing the effort forward would require a dedicated lead other than me.

@HowardHinnant
Copy link

HowardHinnant commented Oct 7, 2021

I'm curious to hear what he'd do differently after all the years.

I would still go with N3980. We've been using it in the XRP blockchain network for 7 years now. My understanding is that a variant of it is in use in Bloomberg's BDT (though I haven't confirmed that). I see a lot of uses of it in a github search. My feeling is that it has the right balance of functionality and ease of use between all interested parties:

  1. The casual end user.
  2. The more demanding end user.
  3. The hash algorithm adopter.
  4. The hash algorithm designer.
  5. The C++ std::lib vendor.

I found the unified proposal (P0029r0), overly complicated. Complication brings its own risk of unintended consequences. And I find std::hash, hash_combine, etc, fundamentally broken by design because the method of composition compromises a meticulously crafted hash algorithm (cryptographic or not).

The one thing that I would definitely do differently (if I had a time machine) is simplify and shorten the presentation in N3980. I didn't have time to do that in 2014. N3980 makes the idea look far more complicated than it is.

Perhaps this video presentation of the idea is simpler.

@dietmarkuehl
Copy link
Author

@HowardHinnant Bloomberg’s BDE, indeed, defines a hashAppend in thebalh_hash component. It is based on N4381.

I believe there were some people who made load noises that the hashing strategy needs to be based on value types. In any case, we should probably resurrect that effort at some point. … and I seem to gather that we is actually I.

@HowardHinnant
Copy link

and I seem to gather that we is actually I.

Dang it, I just had to file bug with github that they are missing the laughing out loud emoji. :-)

@mavam
Copy link

mavam commented Oct 7, 2021

Oh, for some reason I thought N3333 was Types Don't Know #. When I said earlier I like N3333, I meant N3980. We're using that for several years now without an issue.

Only ADL feels in the way and I would go with struct specializations instead of using free functions. @HowardHinnant, would you still go with hash_append in 2021, as opposed to, say, struct hasher<T>?

I also thought about using std::span<const std::byte> as input to operator() in the HashAlgorithm, which feels a bit more modern.

[..] hashing strategy needs to be based on value types [..]

In the sense of N3980, a HashAlgorithm is usually a value type, since the state is mostly a concoction of primitive types.

@HowardHinnant
Copy link

would you still go with hash_append in 2021, as opposed to, say, struct hasher<T>?

I haven't given it a lot of thought, but right now I'm about 85% sure of staying with hash_append. My first thought on that is that this function will be commonly written by every type author. And it is much more convenient to define a friend function than it is to break your namespace, specialize std::hasher<T>, then restart your namespace.

@dietmarkuehl
Copy link
Author

@mavam The normal approach for customization is use of ADL, e.g., for swap(). There is a push towards use of tag_invoke() for customization (see, e.g., P2300) although that would add to the complication Howard wants to avoid. tag_invoke() is an interesting variation on customization still using ADL. I think the approach is reasonable for infrastructure developers who are used to jumping through hoops to cover maximum generality and performance.

As an aside: the "value type" comment refers to the objection of passing the hash algorithm by reference and rather wanting to pass by value. The hash state may actually be pointed to from within the hash algorithm. The implication is, however, that hash algorithm is carefully maintained: passed using move() and returned from the hash_append() function. That also adds complexity. The argument in favour of this approach is that it can be more efficient although I don't have corresponding evidence.

hash_append() is meant to be really easy to use to support developers in doing the Right Thing. @HowardHinnant hints at using hash_append() in the context of Ripple's blockchain which seems to imply that with correct use it can apply to cryptographic hashes - that is normally something where people get really worried (and I'm not sure if it would be advertised as such).

@HowardHinnant
Copy link

Example use: https://github.com/ripple/rippled/blob/develop/src/ripple/protocol/digest.h (SHA-256, SHA-512). Note the result_type of the hashers. But we also use it for non-cryptographic hashes. Here's a seeded example: https://github.com/ripple/rippled/blob/9d89d4c1885e13e3c0aa7cdb4a389c6fbfd3790a/src/ripple/basics/hardened_hash.h

@mavam
Copy link

mavam commented Oct 8, 2021

@dietmarkuehl Thanks for filling me in! Now I understand what you mean by value type. I can see the appeal, but would shy away from it intuitively in C++ (unlike Rust) where this often a footgun.

I found this article also helpful for forming on opinion on tag_invoke: https://brevzin.github.io/c++/2020/12/01/tag-invoke/.

And it is much more convenient to define a friend function than it is to break your namespace, specialize std::hasher, then restart your namespace.

@HowardHinnant Agreed on the ease-of-use part. I think this works if you have exactly one function to provide. But there's also is_contiguously_hashable that users may want to specialize.

I am trying to find answer on how to make it possible to support faster one-shot hashing. uhash currently is a 3-step function. Some algorithms, like xxHash, offer a faster path by going through a single function, as opposed to constructing state, appending, and finalizing. For example:

/* XXH3_64bits():
 * default 64-bit variant, using default secret and default seed of 0.
 * It's the fastest variant. */
XXH_PUBLIC_API XXH64_hash_t XXH3_64bits(const void* data, size_t len);

A later comment says that the incremental construction is slower:

/*******   Streaming   *******/
/*
 * Streaming requires state maintenance.
 * This operation costs memory and CPU.
 * As a consequence, streaming is slower than one-shot hashing.
 * For better performance, prefer one-shot functions whenever applicable.
 */
typedef struct XXH3_state_s XXH3_state_t;
XXH_PUBLIC_API XXH3_state_t* XXH3_createState(void);
XXH_PUBLIC_API XXH_errorcode XXH3_freeState(XXH3_state_t* statePtr);
XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, const XXH3_state_t* src_state);

Let's assume HashAlgorithm has an additional static member function that allows for fast one-shot computation. Then we could dispatch to the optimal implementation in uhash, leveraging is_contiguously_hashable.

@mavam
Copy link

mavam commented Oct 8, 2021

Here's something that I found works as an extension of hash_append to enable a "fast-path": basically a small utility hash<Algorithm>(...) that dispatches to the best implementation available. It requires an additional function in the HashAlgorithm concept, though: a new function make(const void*, size_t that returns a one-shot computation of the digest.

Here is my experiment:

namespace detail {

/// Wraps a one-shot hash in the interface of an incremental hash algorithm.
template <one_shot_hash HashAlgorithm>
class one_shot_wrapper {
public:
  static constexpr detail::endian endian = HashAlgorithm::endian;
  
  using result_type = typename HashAlgorithm::result_type;

  template <class... Args>
  explicit one_shot_wrapper(Args&&... args)
    : h_{std::forward<Args>(args)...} {
  }

  void operator()(const void* data, size_t size) noexcept {
    // Guaranteed to be called exactly once, via contiguously_hashable.
    result_ = h_.make(data, size);
  }

  explicit operator result_type() noexcept {
    return result_;
  }

private:
  HashAlgorithm h_;
  result_type result_;
};

} // namespace detail

/// Generic function to compute a hash over a byte sequence.
/// @param x The value to hash.
/// @param args Optional arguments to seed `HashAlgorithm`.
/// @returns A hash digest of *bytes* using `HashAlgorithm`.
template <incremental_hash HashAlgorithm = default_hash, class T, class... Args>
requires(!contiguously_hashable<T, HashAlgorithm> || !one_shot_hash<HashAlgorithm>)
[[nodiscard]] auto hash(T&& x, Args&&... args) noexcept {
  HashAlgorithm h{std::forward<Args>(args)...};
  hash_append(h, x);
  return static_cast<typename HashAlgorithm::result_type>(h);
}

template <one_shot_hash HashAlgorithm = default_hash, class T, class... Args>
requires(contiguously_hashable<T, HashAlgorithm>)
[[nodiscard]] auto hash(T&& x, Args&&... args) noexcept {
  detail::one_shot_wrapper<HashAlgorithm> h{std::forward<Args>(args)...};
  hash_append(h, x);
  return static_cast<typename HashAlgorithm::result_type>(h);
}

This relies on two concepts:

/// A hash algorithm that supports incremental computation of a hash digest
/// in a construct-add-finish manner.
template <class HashAlgorithm>
concept incremental_hash = requires(HashAlgorithm h) {
  typename HashAlgorithm::result_type;
  h(std::declval<const void*>(), std::declval<size_t>());
  static_cast<typename HashAlgorithm::result_type>(h);
};

/// A hash algorithm that exposes a one-shot computation of a hash digest over
/// a byte sequence.
template <class HashAlgorithm>
concept one_shot_hash = requires(HashAlgorithm h) {
  // clang-format off
  typename HashAlgorithm::result_type;
  { h.make(std::declval<const void*>(), std::declval<size_t>()) }
    -> concepts::same_as<typename HashAlgorithm::result_type>;
  // clang-format on
};

With that, I can write:

namespace {

struct algorithm_base {
  static constexpr detail::endian endian = detail::endian::native;
  using result_type = bool;
};

struct oneshot : algorithm_base {
  result_type make(const void*, size_t) noexcept {
    return true;
  }
};

struct incremental : algorithm_base {
  void operator()(const void*, size_t) noexcept {
  }

  explicit operator result_type() noexcept {
    return false;
  }
};

struct oneshot_and_incremental : algorithm_base {
  result_type make(const void*, size_t) noexcept {
    return true;
  }

  void operator()(const void*, size_t) noexcept {
  }

  explicit operator result_type() noexcept {
    return false;
  }
};

} // namespace

TEST(check the fast path) {
  CHECK(hash<oneshot>(0));
  CHECK(!hash<incremental>(std::byte{42}));
  CHECK(hash<oneshot_and_incremental>(std::byte{42}));
  CHECK(!hash<oneshot_and_incremental>(double{1.0}));
}

@HowardHinnant
Copy link

I don't see how oneshot composes. And if it isn't meant to compose, why does it have to interact with hash_append at all? Can't oneshot just be a custom hash function, or a specialization of std::hash?

@mavam
Copy link

mavam commented Oct 8, 2021

I don't see how oneshot composes.

I doesn't need to compose.

why does it have to interact with hash_append at all? Can't oneshot just be a custom hash function, or a specialization of std::hash?

std::hash requires a size_t return type, that would be too inflexible. The use case for this is as follows:

  • The user wants to compute the hash of an object in the fastest way possible
  • The user doesn't want to distinguish between hash_append (incremental) and potentially faster native function (one-shot)
  • The hash algorithm implementer decides whether a fast path is available and can provide an implementation through the member function make.

In this example, I used a free function hash (admittedly a bit overloaded name) to dispatch on contiguously_hashable, with the goal to choose the fast path when possible, but fall back to hash_append when the hash algorithm only operates incrementally.

@HowardHinnant
Copy link

Does this run the risk of accidentally calling the oneshot version when composing a contiguously_hashable sub-type with another sub-type? Say for example when trying to hash two things, x and y with an initialization seed, where x is contiguously_hashable:

auto h = hash(x, y, seed);

It just seems a little error prone to put these two different techniques under the same name.

@mavam
Copy link

mavam commented Oct 10, 2021

Maybe there is too much freedom when passing variadic arguments down to the one-shot hash or hash function: hash(x, y, seed) would be interpreted as hash x, and pass y, and seed to the constructor of the hash function (or one-shot function).

It just seems a little error prone to put these two different techniques under the same name.

Do you think users would be better of with two different APIs, one for incremental computation and one for one-shot computation? Something like this:

auto incremental_digest = uhash<xxh3>{}(x); // works for all types hash_append
auto oneshot_digest = ohash<xxh3>{}(x); // works only for contiguously_hashable types

I'm curious how you would design the API for this problem statement. My goal was to reuse hash_append for user-defined types that build on contiguously hashable types. What are other options that are more robust in your opinion? Another ADL customization point, e.g., hash_oneshot?

@HowardHinnant
Copy link

I think ohash is a good idea. It will fit just as well into the unordered_ containers. And to the best of my current knowledge, it doesn't require the type author to duplicate anything.

The top level hash function is purposefully exceedingly easy to write so that the user can write whatever kind he needs.

  • Seeded per object.
  • Seeded per application run.
  • Not seeded.

I think all we need is to standardize the "oneshot" API in the HashAlgorithm concept so that people can easily and portably write ohash themselves. Then we don't need to worry about how it is seeded.

@TheThief
Copy link

TheThief commented Oct 11, 2021

I did some investigation into hash combination based on the murmur3 hash function - which can be broken down roughly as follows:

hash the next 32-bit value

    k1 *= c1;
    k1 = ROTL32(k1,15);
    k1 *= c2;

combine it with the working hash

    h1 ^= k1;
    h1 = ROTL32(h1,13); 
    h1 = h1*5+0xe6546b64;

(repeat)

finalise the hash

  h1 ^= len;
  h1 = fmix32(h1);
  *(uint32_t*)out = h1;

most of the hash_combine proposals I've seen in C++ don't seem to separate out the finalisation step - doing it either on every combine (inefficient) or not at all (weak).

So I would be more in favour of an algorithm that uses a hasher object rather than a simple function to combine separately-generated hashes which may be suboptimal in either speed or strength of the resulting hash.

TLDR: I like the newer proposals like hash_append, but greatly dislike hash_combine.

@mavam
Copy link

mavam commented Oct 11, 2021

I think all we need is to standardize the "oneshot" API in the HashAlgorithm concept so that people can easily and portably write ohash themselves. Then we don't need to worry about how it is seeded.

Agreed. ohash would then be a drop-in replacement for uhash as well as std::hash, potentially looking like this:

template <class HashAlgorithm>
struct ohash {
  using result_type = typename HashAlgorithm::result_type;

  template <class T>
  result_type operator()(T const& x) const noexcept {
    HashAlgorithm h;
    return h.make(std::addressof(x), sizeof(T));
  }
};

One thing that I found helpful in the past was runtime selection of seeds, as part of the constructor of HashAlgorithm. This could be achieved through a seeded_hash type that either keeps HashAlgorithm as member, or keeps the seeds as tuple and use std::make_from_tuple. Basically the generic version of what you illustrated with randomly_seeded_hash.

The reason why I'm brining this up in this context is that ohash (or perhaps just the general version of one-shot hashing) needs a way to pass arguments that would typically go in the constructor of an incremental hash algorithm. I proposed a member function make that just takes the bytes, with everything else taken care of up to that point.

Regarding the actual name, make might be too generic, but reminiscent of a "factory of one-shot digests", which is why I proposed it. More refined names, like compute, calculate, etc. might also be worth considering. But if all that's left is naming, we're pretty far along :-).

@mavam
Copy link

mavam commented Oct 11, 2021

@TheThief I'm not sure what the context of your comment is. But it's clear that there's no one-size-fits-all mechanism to combine hashes. It's the job of the (incremental) hash algorithm to do this and the point of HashAlgorithm::operator().

@HowardHinnant
Copy link

Fwiw, I'm not married to the name HashAlgorithm::operator() or other names in N3980.

@TheThief
Copy link

@TheThief I'm not sure what the context of your comment is. But it's clear that there's no one-size-fits-all mechanism to combine hashes. It's the job of the (incremental) hash algorithm to do this and the point of HashAlgorithm::operator().

Pretty much it's a vote of support for this work

@mavam
Copy link

mavam commented Oct 12, 2021

Pretty much it's a vote of support for this work

Great, glad to hear! We are currently implementing what we discussed in this gist in our C++20 code base: tenzir/tenzir#1900.

@lava
Copy link

lava commented Oct 12, 2021

@HowardHinnant , looking at https://github.com/HowardHinnant/hash_append I noticed that you extended the implementation to include an endianness parameter for each hash algorithm. The property of being contiguously hashable is then not a type-level property anymore but depends on the combination of type and hash function. Roughly speaking, conguously_hashable == uniquely_representable + the memory layout corresponds to the semantic memory layout expected by the hash function.

Did you write somewhere about the motivations for this extension? With the benefit of hindsight, do you still think that's the best way to implement endian-independent hashing results in the hash_append framework?

@dietmarkuehl , I'd be happy to join you in reviving the proposal if you're still interested.

@HowardHinnant
Copy link

HowardHinnant commented Oct 12, 2021

It looks like this feature was added after N3980 was written.

In all but a few HashAlgorithms, this switch will be set to native so that is_contiguously_hashable is equivalent to is_uniquely_represented. But in the example sha256 HashAlgorithm it is set to big. This is so that different computers across a network can each create a SHA-256 hash of an object, and compare that hash across the network. If the hash is equal, then the hashed contents are equal, else they are not. One node in the network can be a little endian machine and another a big endian machine, and the SHA-256 comparison still works.

And yes, this has worked well. std::endian is now part of C++ with the exact same semantics and syntax as xstd::endian in the repository. We use a SHA-512 hash_append adaptor in the XRP blockchain network modeled after this example code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment