corollari/comparable-hash.md

## comparable-hash.md

      
    Raw
  

              comparable-hash.md
            
          
    Background

In Onramper, the company I am currently working in, we are trying to minimize the amount of personal data that is stored on our systems, and, as part of that effort, I am currently building a system that will allow us to avoid storing any directly identifiable data while also allowing us to track repeated customers.
A simple way of achieving that would be to hash some unique per-customer data points such as email and call it a day, but the problem with that approach is the lack of a unique salt, which would make the creation of site-wide rainbow tables possible thus severely diminishing security.
Now you might think that adding salt on that hash would be simple, but consider that here we are generating the customer identifier, so storing per-customer salt on the database becomes infeasible because you have no way to index it (you need the salt to generate the identifier but you need the identifier to retrieve the salt). It should also be possible to take an approach where salt can be retrieved with a simple email hash, but then you're back to square one as attackers can just crack that hash with rainbow tables.
So, in order to solve this problem I've started looking at cryptographic alternatives, concretely I'm looking at hash functions that, given different random salt values, would provide outputs that are easily comparable on the main input value, that is, functions for which hash(email, random_1)==hash(email, random_2) (== only refers to a general equality comparison, hashes don't need to be equal, just comparable), as that would allow me to solve the problem by creating a different salt on each transaction and hashing it with that. See the requirements section for more concrete terms.
With that in mind, the first thing I did was draft an impossibility result on the existence of a hash function that would be directly comparable (hashes are equal on the binary level) while also providing extra security against rainbow tables (the main idea is that then you can create a rainbow table with a constant nonce and just directly compare any hashes against that). So, with that out of the way, I started looking into other mechanisms that would enable that.
Currently, the idea that seems to be the most promising is using an associative hash function (h(h(a, b), c)==h(h(a, c), b)) along with the following protocol:

When an initial transaction is created salt1=random() is created and stored along hash1 = hash(email, salt1)
A second transaction would then create and store salt2 and hash2 = hash(email, salt2)
The hashes of these two transactions can be compared with hash(hash1, salt2) == hash(hash2, salt1)

What's left now is finding a hash function that fits these requirements, something which is not as easy as it may seem, as trivial associative hash functions such as hash(a, b) = sha(a)*sha(b) would make it possible for an attacker to isolate hash(email) and thus crack it using rainbow tables. Other trivial functions for which it is not possible to isolate one input parameter if the other is known, such as hash(a, b) = sha(a) AND sha(b) lead to a huge rate of collisions while also leaking a lot of information on the other hash parameter.
So far the best function I've found for this problem uses scalar multiplication in elliptic curves:
Given email, nonce \in N and a generator point G, a hash is defined in the following way:
hash = G*email*nonce

As long as elliptic curve cryptography is not broken it should be impossible to achieve pre-image attacks on this scheme, and comparisons can be done easily by following the protocol outlined before. The main problem with this hashing function is that, if nonce can be inverted, it would be possible for an attacker to obtain G*email by multiplying the inverted nonce by the hash point, and once the nonce has been removed from the hash we are again back to square one (rainbow table attacks are viable).
This is solved by working on a Z/n ring where n is not a prime number and nonce is deliberately forced not to be co-prime with n, thus making it so nonce is not invertable in this ring.
This last solution is making me quite uneasy tho, as I think I don't have a knowledge of elliptic curves that is deep enough to understand all the things that could break if I change that detail. Furthermore, I haven't studied in depth the distribution of numbers that could be inversors of another number in non-prime fields so I am also not sure if the security provided here would be good enough.
Finally, the protocol proposed here has the problem that comparing hashes is quite expensive algorithmically, as all comparisons need to be 1-to-1, so if you were to compare the hashes of n transactions to see which ones come from the same client you'd need to do O(n^2) work. In practical terms I believe that this problem could be heavily mitigated by grouping transactions into groups using low-output-length hashes which wouldn't leak almost any information, but it would be great if a different method that enabled more efficient comparisons could be envisioned.
Requirements

Pretty much the same requirements that a KDF has along with an extra one:

Pre-image resistance: It shouldn't be possible to easily guess the password from the stored data
If cracking a password requires O(n) work then cracking m passwords must require O(n*m) work
Hashes should be identity-comparable between them based on the non-salt input value

Previous work

Monahan et al present an associative and non-commutative hash function based on matrix multiplication in HashFusion – a method for combining cryptographic hash values. Sadly, for hashes built using this scheme it's possible to easily isolate hash(email), thus it is not fit for the problem at hand.
Rabi et al prove some theoretical results on [Associative One-Way Functions][http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0BBAFFFD16BC747B04B485CB1398049D?doi=10.1.1.118.6837&rep=rep1&type=pdf], mainly that associative hash functions exist if and only if P=/=NP. They also provide some examples of these functions but these are really trivial (logical OR and integer/matrix multiplication) so not useful here.
There's been some dicussions on the topic of associative hash functions, but none of them ended up generating any special results nor something that would solve the problem at hand:

https://groups.google.com/forum/#!topic/sci.crypt/X9iqy2lwFsE
https://crypto.stackexchange.com/questions/17935/associative-standard-cryptographic-hash-function
https://www.quora.com/Are-there-any-strong-associative-hash-mixing-functions

Extra: Memory-hardness

I've thought of several schemes that would make the hash functions at hand have some extra memory-hardness to them by embedding a scheme based on the same principles that scrypt uses, but these solutions are not really innovative or special in any way so I'll omit them here.
Discussion

Reddit discussions:

https://www.reddit.com/r/crypto/comments/hupxpb/im_building_an_associative_oneway_function_but/
https://www.reddit.com/r/cryptography/comments/hunjk1/comparable_kdf_hashes/