Skip to content

Instantly share code, notes, and snippets.

@alex
Created December 30, 2011 04:02
Show Gist options
  • Save alex/1537741 to your computer and use it in GitHub Desktop.
Save alex/1537741 to your computer and use it in GitHub Desktop.

Paul McMillan (Django security guru and all-around smart guy) and I just spent quite a while discussing this issue, here's what we came up with:

First, this is an issue which, if at all possible should be solved at the language level. This is because it's often hard to tell where data comes from a user, and where it's safe.

We're also in agreement that this should be solved at the language level, and shouldn't be configurable. It will be far to easy to deploy in an enviroment that's insecure if you get it wrong.

On to a description of what an appropriate solution therefore needs to look like:

Any and all __hash__ (tp_hash) methods must use some sort of randomized salt, or delegate their implementation to some other hash method (of builtins/stdlib this means int, long, float, str, unicode, decimal.Decimal). This is not an operation that can be done on the resulting hash: this attack works by generating different data which have the same hash. There are a couple of levels this can be done on:

Per dictionary: This is the most secure option, however it has two unsolvable problems: it requires an API change, __hash__ needs to take the salt as a parameter, and it kills performance on CPython, effectively it means the ob_shash must be removed. I did this on the 2.7 branch and it was a 14% slowdown on pystone.

Per process: This is less secure (in theory it is vulnerable to a timing attack, Paul is doing the math on how difficult to exploit that is), but it's easier, and requires no API change. Basically you just generate a salt at process startup time, and then use the global value in all the hash methods. This defeats a small optimization in PyPy, but that's the only downside (besides the non-optimal security).

Per binary: This is the least safe, and most useless: 90% of people use their distro's Python which makes it easy to generate datasets that work against a large number of people.

Any of these assumes that there's no need for hash stability between python versions. If that's considered a requirement then the only solution becomes an alternative mapping (RandomizedDict or a Tree or something) which should be used in these situations.

Our reccomendation is therefore a per-process hash, since it's sufficient for 99% of cases, security-wise, and doesn't have any strong negatives.

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment