alex/gist:1537741

## gistfile1.rst

      
    Raw
  

              gistfile1.rst
            
          
    Paul McMillan (Django security guru and all-around smart guy) and I just spent
quite a while discussing this issue, here's what we came up with:
First, this is an issue which, if at all possible should be solved at the
language level. This is because it's often hard to tell where data comes from a
user, and where it's safe.
We're also in agreement that this should be solved at the language level, and
shouldn't be configurable. It will be far to easy to deploy in an enviroment
that's insecure if you get it wrong.
On to a description of what an appropriate solution therefore needs to look
like:
Any and all __hash__ (tp_hash) methods must use some sort of randomized salt,
or delegate their implementation to some other hash method (of builtins/stdlib
this means int, long, float, str, unicode, decimal.Decimal). This is not an
operation that can be done on the resulting hash: this attack works by
generating different data which have the same hash. There are a couple of
levels this can be done on:
Per dictionary: This is the most secure option, however it has two unsolvable
problems: it requires an API change, __hash__ needs to take the salt as a
parameter, and it kills performance on CPython, effectively it means the
ob_shash must be removed. I did this on the 2.7 branch and it was a 14%
slowdown on pystone.
Per process: This is less secure (in theory it is vulnerable to a timing
attack, Paul is doing the math on how difficult to exploit that is), but it's
easier, and requires no API change. Basically you just generate a salt at
process startup time, and then use the global value in all the hash methods.
This defeats a small optimization in PyPy, but that's the only downside
(besides the non-optimal security).
Per binary: This is the least safe, and most useless: 90% of people use their
distro's Python which makes it easy to generate datasets that work against a
large number of people.
Any of these assumes that there's no need for hash stability between python
versions. If that's considered a requirement then the only solution becomes an
alternative mapping (RandomizedDict or a Tree or something) which should be
used in these situations.
Our reccomendation is therefore a per-process hash, since it's sufficient for
99% of cases, security-wise, and doesn't have any strong negatives.
Alex