Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save asmeurer/6046766 to your computer and use it in GitHub Desktop.
Save asmeurer/6046766 to your computer and use it in GitHub Desktop.
What happens when you mess with hashing in Python
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"What happens when you mess with hashing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By Aaron Meurer\n",
"\n",
"This notebook is based off a [wiki page](https://github.com/sympy/sympy/wiki/What-happens-when-you-mess-with-hashing) on the SymPy wiki, which in turn was based on [a message](https://groups.google.com/forum/#!msg/sympy/pJ2jg2csKgU/0nn21xqZEmwJ) to the SymPy mailing list. "
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"What is hashing?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we start, let's have a brief introduction to hashing. A [*hash function*](http://en.wikipedia.org/wiki/Hash_function) is a function that maps a set of objects to a set of integers. There are many kinds of hash functions, which satisfy many different properties, but the most important property that must be satisfied by any hash function is that it be a function (in the mathematical sense), that is, if two objects are equal, then their hash should also be equal.\n",
"\n",
"Usually, the set if integers that the hash function maps to is much smaller than the set of objects, so that there will be multiple objects that hash to the same value. However, generally for a hash function to be useful, the set of integers should be large enough, and the hash function well distributed enough that if two objects hash to the same value, then they are very likely to be equal. \n",
"\n",
"To summarize, a hash function must satisfy the property:\n",
"\n",
"- If two objects are equal, then their hashes should be equal.\n",
"\n",
"And a good hash function should satisfy the property:\n",
"\n",
"- If two objects have the same hash, then they are not likely to be the same object.\n",
"\n",
"Note that, as noted, since there are generally more possible objects than hash values, two objects may hash to the same value. This is called a [hash collision](http://en.wikipedia.org/wiki/Hash_collision), and anything that deals with hashes should be able to deal with them.\n",
"\n",
"This won't be discussed here, but an additional property that a good hash function should satisfy to be useful is this:\n",
"\n",
"- The hash of an object should be cheap to compute."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"What is it used for?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we have a hash function that satisfies the above properties, then we can use it to create from a set of object something called a *hash table*. Suppose we have a collection of objects, and given any object, we want to be able to compute very quickly if that object belongs to our set. We could store these objects in an ordered array, but then to determine if it is in the set, we would have to search potentially through every element of the array (in other words, an $O(n)$) algorithm. \n",
"\n",
"With hashing, we can do better. We create what is known as a [*hash table*](http://en.wikipedia.org/wiki/Hash_table). Instead of storing the objects in an ordered array, we create an array of buckets, each corresponding to some hash values. We then hash each object, and store it into the array corresponding to its hash value (if there are more hash values than buckets, we distribute them using a second hash function, which can be as simple as taking the modulus with respect to the number of buckets, `% n`). \n",
"\n",
"This image from [Wikipedia](http://en.wikipedia.org/wiki/File:Hash_table_3_1_1_0_1_0_0_SP.svg) shows an example\n",
"\n",
"![img](http://upload.wikimedia.org/wikipedia/commons/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg)\n",
"\n",
"To determine if an object is in a hash table, we only have to hash the object, and look in the bucket corresponding to that hash. This is an $O(1)$ algorithm, assuming we have a good hash function, because each bucket will generally hold very few objects, possibly even none."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Hashing in Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python has a built in function that performs a hash called `hash()`. For many objects, the hash is not very surprising. Note, the hashes you see below may not be the same ones you see if you re-run this notebook, because Python hashing depends on the architecture of the machine you are running on, and, in newer versions of Python, hashes are randomized for security purposes."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"10"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(()) # An empty tuple"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"3527539"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash('a')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"12416037344"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Python, not all objects are hashable. For example"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash([]) # An empty list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "unhashable type: 'list'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-4-173238e7b4db>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mhash\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# An empty list\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is because Python has an additional restriction on hashing\n",
"\n",
"- In order for an object to be hashable, it must be immutable\n",
"\n",
"This is important basically because we want the hash of an object to remain the same across the object's lifetime. But if we have a mutable object, then that object itself can change over its lifetime. But then according to our first bullet point above, that object's hash has to change too. \n",
"\n",
"This restriction simplifies hash tables. If we allowed an object's hash to change while it is in a hash table, we would have to move it to a different bucket. Not only is this costly, but the hash table would have to *notice* that this happened; the object itself doesn't know that it is sitting in a hash table, at least not in the Python implementation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Python, there are two objects that correspond to hash tables, `dict` and `set`. A `dict` is a special kind of hash table called an [*associative array*](http://en.wikipedia.org/wiki/Associative_array). An associative array is a hash table where each element of the hash table points to another object. The other object itself is not hashed. \n",
"\n",
"Think of an associative array as a generalization of a regular array (like a `list`). In a `list`, the objects are associated to nonnegative integer indices, like"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l = ['a', 'b', 7]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"'a'"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l[2]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"7"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In an associative array (i.e., a `dict`) we can index objects by anything, so long as the key is hashable."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d = {0: 'a', 'hello': ['world']}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"'a'"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d['hello']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"['world']"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that only the keys need to be hashable. The values can be anything, even unhashable objects like `list`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The uses for associative arrays are boundless. `dict` is one of the most useful data types in the Python language. Some example uses are\n",
"\n",
"- Extension of `list` with \"missing values\". For example, `{0: 'a', 2: 7}` would correspond to the above list `l` with the value `'b'` corresponding to the key `1` removed.\n",
"- Representation of a mathematical function with a finite domain. \n",
"- A poor-man's database. \n",
"- Implementing a [Pythonic version](http://stackoverflow.com/q/60208/161801) of the switch-case statement. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The other type of hash table, `set`, more closely matches the definition I gave above for a hash table. A `set` is just a container of hashable objects. `set`s are unordered, and can only contain one of each object (this is why they are called \"sets,\" because this matches the mathematical definition of a [set](http://en.wikipedia.org/wiki/Set_(mathematics))).\n",
"\n",
"In Python 2.7 or later, you can create a set with `{` and `}`, like `{a, b, c}`. Otherwise, use `set([a, b, c])`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s = {0, (), '2'}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"{0, '2', ()}"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s.add(1)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"{0, 1, '2', ()}"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s.add(0)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"{0, 1, '2', ()}"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A final note: `set` and `dict` are themselves mutable, and hence not hashable! There is an immutable version of set called `frozenset`. There are no immutable dictionaries. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"f = frozenset([0, (), '2'])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"f"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"frozenset({0, '2', ()})"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(f)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"-7776452922777075760"
]
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# A frozenset, unlike a set, can be used as a dictionary key\n",
"d[f] = 'a set'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"{0: 'a', frozenset({0, '2', ()}): 'a set', 'hello': ['world']}"
]
}
],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Creating your own hashable objects"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we move on, there is one final thing we need to know about hashing in Python, which is how to create hashes for custom objects. By default, if we create an object, it will be hashable."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class Nothing(object):\n",
" pass"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"N = Nothing()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(N)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"270498113"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Implementation-wise, the hash is just the object's `id`, which corresponds to its position in memory. This satisfies the above conditions: it is (extremely) cheap to compute, and since by default objects in Python compare unequal to one another, objects with different hashes will be unequal."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"M = Nothing()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"M == N"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"False"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(M)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"270498117"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(M) == hash(N)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"False"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To define a hash function for an object, define the `__hash__` method."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class HashToOne(object):\n",
" def __hash__(self):\n",
" return 1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"HTO = HashToOne()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(HTO)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"1"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To set an object as not hashable, set `__hash__` to `None`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class NotHashable(object):\n",
" __hash__ = None"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"NH = NotHashable()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 33
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(NH)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "unhashable type: 'NotHashable'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-34-1e95f5367377>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mhash\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mNH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m: unhashable type: 'NotHashable'"
]
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, to override the equality operator `==`, define `__eq__`. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class AlwaysEqual(object):\n",
" def __eq__(self, other):\n",
" if isinstance(other, AlwaysEqual):\n",
" return True\n",
" return False"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE1 = AlwaysEqual()\n",
"AE2 = AlwaysEqual()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE1 == AE2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 37,
"text": [
"True"
]
}
],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the key points that I hope you will take away from this notebook is that if you override `__eq__`, you **must** also override `__hash__` to agree. Note that Python 3 will actually require this: in Python 3, you cannot override `__eq__` and not override `__hash__`. But that's as far as Python goes in enforcing these rules, as we will see below. In particular, Python will never actually check that your `__hash__` actually agrees with your `__eq__`. "
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Messing with hashing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now to the fun stuff. What happens if we break some of the invariants that Python expects of hashing. Python expects two key invariants to hold\n",
"\n",
"1. The hash of an object does not change across the object's lifetime (in other words, a hashable object should be immutable).\n",
"\n",
"2. `a == b` implies `hash(a) == hash(b)` (note that the reverse might not hold in the case of a hash collision).\n",
"\n",
"As we shall see, Python expects, but does not enforce either of these."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Example 1: Mutating a hash"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's break rule 1 first. Let's create an object with a hash, and then change that object's hash over its lifetime, and see what sorts of things can happen."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class Bad(object):\n",
" def __init__(self, hash): # The object's hash will be hash\n",
" self.hash = hash\n",
" def __hash__(self):\n",
" return self.hash"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 38
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"b = Bad(1)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(b)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 40,
"text": [
"1"
]
}
],
"prompt_number": 40
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d = {b:42}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d[b]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 42,
"text": [
"42"
]
}
],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"b.hash = 2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 43
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(b)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 44,
"text": [
"2"
]
}
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d[b]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "<__main__.Bad object at 0x101f79d90>",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-45-f23df7077ea7>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0md\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m: <__main__.Bad object at 0x101f79d90>"
]
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we implicitly changed the hash of `b` by mutating the attribute of `b` that is used to compute the hash. As a result, the object is no longer found in a dictionary, which uses the hash to find the object.\n",
"\n",
"Note that Python doesn't prevent me from doing this. I could make it if I want, by making `__setattr__` raise `AttributeError`, but even then I could forcibly change it by modifying the object's `__dict__`. I could try some more fancy things using descriptors, metaclasses, and/or `__getattribute__`, but even then, if I knew what was happening, I could probably find a way to change it. This is what is meant when we say that Python is a \"consenting adults\" language."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Example 2: More mutation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try something even more crazy. Let's make an object that hashes to a different value each time we look at the hash."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class DifferentHash(object):\n",
" def __init__(self):\n",
" self.hashcounter = 0\n",
" def __hash__(self):\n",
" self.hashcounter += 1\n",
" return self.hashcounter"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DH = DifferentHash()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 47
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(DH)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 48,
"text": [
"1"
]
}
],
"prompt_number": 48
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(DH)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 49,
"text": [
"2"
]
}
],
"prompt_number": 49
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(DH)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 50,
"text": [
"3"
]
}
],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obviously, if we use `DH` as a key to a dictionary, then it will not work, because we will run into the same issue we had with `Bad`. But what about putting `DH` in a `set`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset = {DH, DH, DH}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 51
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 52,
"text": [
"{<__main__.DifferentHash at 0x101f79f50>,\n",
" <__main__.DifferentHash at 0x101f79f50>,\n",
" <__main__.DifferentHash at 0x101f79f50>}"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Woah! We put the exact same object in a `set` three times, and it appeared all three times. This is not what is supposed to happen with a set."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{1, 1, 1}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 53,
"text": [
"{1}"
]
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What happens when we do stuff with `DHset`?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset.remove(DH)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "<__main__.DifferentHash object at 0x101f79f50>",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-54-4405e5352cc2>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mDHset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremove\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mDH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m: <__main__.DifferentHash object at 0x101f79f50>"
]
}
],
"prompt_number": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That didn't work, because `set.discard` searches for an object by its hash, which is different by this point."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's make a copy of `DHset`. The `set.copy` method will create a shallow copy (meaning that the set container itself will be different, according to `is` comparison, but the objects themselves will the same, according to `is` comparison). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset2 = DHset.copy()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 55
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset2 == DHset"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 56,
"text": [
"True"
]
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Everything is fine so far. This object is only going to cause trouble if something recomputes its hash. But remember that the whole reason that we had trouble with something like `Bad` above is that Python *doesn't* recompute that hash of an object, unless it has to. So let's do something that will force it to do so: let's pop an object from one of the sets and add it back in."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"D = DHset.pop()\n",
"DHset.add(D)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 57
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 58,
"text": [
"{<__main__.DifferentHash at 0x101f79f50>,\n",
" <__main__.DifferentHash at 0x101f79f50>,\n",
" <__main__.DifferentHash at 0x101f79f50>}"
]
}
],
"prompt_number": 58
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 59,
"text": [
"{<__main__.DifferentHash at 0x101f79f50>,\n",
" <__main__.DifferentHash at 0x101f79f50>,\n",
" <__main__.DifferentHash at 0x101f79f50>}"
]
}
],
"prompt_number": 59
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"DHset == DHset2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 60,
"text": [
"False"
]
}
],
"prompt_number": 60
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There we go. By removing it from the set, we made the set forget about its hash, so it had to be recomputed when we added it again. This version of `DHset` now has a `DH` with a different hash than it had before. Thinking back to `set` being a hash table, in this `DHset`, the three `DH` objects are in different \"buckets\" than they were in before. `DHset.__eq__(DHset2)` notices that the bucket structure is different right away and returns `False`.\n",
"\n",
"By the way, what hash value are we up to these days?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(DH)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 61,
"text": [
"9"
]
}
],
"prompt_number": 61
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Example 3: When `a == b` does not imply `hash(a) == hash(b)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look at point 2. What happens if we create an object with `__eq__` that disagrees with `__hash__`. We actually already have made a class like this, the `AlwaysEqual` object above. Instances of `AlwaysEqual` will always compare equal to one another, but they will not have the same hash, because they will use `object`'s default `__hash__` of `id`. Let's take a closer look at the `AE1` and `AE2` objects we created above."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(AE1)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 62,
"text": [
"270498221"
]
}
],
"prompt_number": 62
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(AE2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 63,
"text": [
"270498197"
]
}
],
"prompt_number": 63
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(AE1) == hash(AE2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 64,
"text": [
"False"
]
}
],
"prompt_number": 64
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{AE1, AE2}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 65,
"text": [
"{<__main__.AlwaysEqual at 0x101f79950>, <__main__.AlwaysEqual at 0x101f79ad0>}"
]
}
],
"prompt_number": 65
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can already see that we have broken one of the key properties of a `set`, which is that it does not contain the same object twice. This can lead to subtle issues. For example, suppose we had a list and we wanted to remove all the duplicate items from it. An easy way to do this is to convert the list to a set and then convert it back to a list. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l = ['a', 'a', 'c', 'a', 'c', 'b']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 66
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(set(l))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 67,
"text": [
"['a', 'c', 'b']"
]
}
],
"prompt_number": 67
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, this method is obviously not going to work for a list of `AlwaysEqual` objects."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE3 = AlwaysEqual()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 68
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l = [AE1, AE1, AE3, AE2, AE3]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 69
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(set(l))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 70,
"text": [
"[<__main__.AlwaysEqual at 0x102c1d590>,\n",
" <__main__.AlwaysEqual at 0x101f79ad0>,\n",
" <__main__.AlwaysEqual at 0x101f79950>]"
]
}
],
"prompt_number": 70
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Actually, what happened here is that the equality that we defined on `AlwaysEqual` was essentially ignored. We got a list of unique items by `id`, instead of by `__eq__`. You can imagine that if `__eq__` were something a little less trivial, where some, but not all, objects are considered equal, that this could lead to very subtle issues. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But there is an issue with the above algorithm. It isn't stable, that is, it removes the ordering that we had on the list. We could do this better by making a new list, and looping through the old one, adding elements to the new list if they aren't already there."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def uniq(l):\n",
" newl = []\n",
" for i in l:\n",
" if i not in newl:\n",
" newl.append(i)\n",
" return newl"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 71
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq(['a', 'a', 'c', 'a', 'c', 'b'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 72,
"text": [
"['a', 'c', 'b']"
]
}
],
"prompt_number": 72
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq([AE1, AE1, AE3, AE2, AE3])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 73,
"text": [
"[<__main__.AlwaysEqual at 0x101f79ad0>]"
]
}
],
"prompt_number": 73
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time, we used `in`, which uses `==`, so we got only one unique element of the list of `AlwaysEqual` objects.\n",
"\n",
"But there is an issue with this algorithm. Checking if something is in a list is $O(n)$, but we have an object that allows checking in $O(1)$ time, namely, a set. So a more efficient version might be:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def uniq2(l):\n",
" newl = []\n",
" newlset = set()\n",
" for i in l:\n",
" if i not in newlset:\n",
" newl.append(i)\n",
" newlset.add(i)\n",
" return newl"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 74
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq2(['a', 'a', 'c', 'a', 'c', 'b'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 75,
"text": [
"['a', 'c', 'b']"
]
}
],
"prompt_number": 75
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq2([AE1, AE1, AE3, AE2, AE3])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 76,
"text": [
"[<__main__.AlwaysEqual at 0x101f79ad0>,\n",
" <__main__.AlwaysEqual at 0x102c1d590>,\n",
" <__main__.AlwaysEqual at 0x101f79950>]"
]
}
],
"prompt_number": 76
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bah! But now, since we used a set, we compared by hashing, not equality, so we are left with three objects again. Notice the extremely subtle difference here. Basically, it is this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE1 in {AE2}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 77,
"text": [
"False"
]
}
],
"prompt_number": 77
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE1 in [AE2]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 78,
"text": [
"True"
]
}
],
"prompt_number": 78
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set containment uses hashing; list containment uses equality. If the two don't agree, then the result of your algorithm will depend on which one you use! \n",
"\n",
"By the way, as you might expect, dictionary containment also uses hashing, and tuple containment uses equality:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE1 in {AE2: 42}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 79,
"text": [
"False"
]
}
],
"prompt_number": 79
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"AE1 in (AE2,)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 80,
"text": [
"True"
]
}
],
"prompt_number": 80
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Example 4: Caching hashing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you ever want to add subtle bizarreness to a system, add some sort of caching, and then do it wrong.\n",
"\n",
"As we noted in the beginning, one important property of a hash function is that it is quick to compute. A nice way to achieve this for heavily cached objects is to cache the value of the cache on the object, so that it only needs to be computed once. The pattern (which is modeled after SymPy's `Basic`) is something like this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class HashCache(object):\n",
" def __init__(self, arg): \n",
" self.arg = arg \n",
" self.hash_cache = None \n",
" def __hash__(self): \n",
" if self.hash_cache is None: \n",
" self.hash_cache = hash(self.arg) \n",
" return self.hash_cache \n",
" def __eq__(self, other): \n",
" if not isinstance(other, HashCache): \n",
" return False \n",
" return self.arg == other.arg "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 81
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`HashCache` is nothing more than a small wrapper around a hashable argument, which caches its hash. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash('a')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 82,
"text": [
"12416037344"
]
}
],
"prompt_number": 82
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashCache('a')\n",
"hash(a)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 83,
"text": [
"12416037344"
]
}
],
"prompt_number": 83
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For ordinary Python builtins, simply recomputing the hash will be faster than the attribute lookup used by `HashCache`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit hash('a')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"10000000 loops, best of 3: 69.9 ns per loop\n"
]
}
],
"prompt_number": 84
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit hash(a)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000000 loops, best of 3: 328 ns per loop\n"
]
}
],
"prompt_number": 85
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But for a custom object, computing the hash may be more computationally expensive. As hashing is supposed to agree with equality (as I hope you've realized by now!), if computing equality is expensive, computing a hash function that agrees with it might be expensive as well. \n",
"\n",
"As a simple example of where this might be useful, consider a highly nested tuple, an object whose hash that is relatively expensive to compute."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = ()\n",
"for i in xrange(1000):\n",
" a = (a,)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 86
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"A = HashCache(a)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 87
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit hash(a)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"100000 loops, best of 3: 9.61 \u00b5s per loop\n"
]
}
],
"prompt_number": 88
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit hash(A)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000000 loops, best of 3: 325 ns per loop\n"
]
}
],
"prompt_number": 89
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far, we haven't done anything wrong. `HashCache`, as you may have noticed, has `__eq__` defined correctly:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"HashCache(1) == HashCache(2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 90,
"text": [
"False"
]
}
],
"prompt_number": 90
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"HashCache(1) == HashCache(1)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 91,
"text": [
"True"
]
}
],
"prompt_number": 91
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But what happens if we mutate HashCache. This is different from examples 1 and 2 above, because we will be mutating what happens with equality testing, but not the hash (because of the cache). \n",
"\n",
"(In the below example, recall that small integers hash to themselves, so `hash(1) == 1` and `hash(2) == 2`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashCache(1)\n",
"d = {a: 42}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 92
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a.arg = 2\n",
"hash(a)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 93,
"text": [
"1"
]
}
],
"prompt_number": 93
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d[a]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 94,
"text": [
"42"
]
}
],
"prompt_number": 94
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we cached the hash of `a`, which was computed as soon as we created the dictionary `d`, it remained unchanged when modified the arg to be `2`. Thus, we can still find the key of the dictionary. But since we have mutated `a`, the equality testing on it has changed. This means that, as with the previous example, we are going to have issues with dicts and sets keeping unique keys and entries (respectively). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashCache(1)\n",
"b = HashCache(2)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 95
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(a)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 96,
"text": [
"1"
]
}
],
"prompt_number": 96
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(b)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 97,
"text": [
"2"
]
}
],
"prompt_number": 97
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"b.arg = 1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 98
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a == b"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 99,
"text": [
"True"
]
}
],
"prompt_number": 99
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(a) == hash(b)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 100,
"text": [
"False"
]
}
],
"prompt_number": 100
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{a, b}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 101,
"text": [
"{<__main__.HashCache at 0x102c32050>, <__main__.HashCache at 0x102c32450>}"
]
}
],
"prompt_number": 101
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 102,
"text": [
"[<__main__.HashCache at 0x102c32050>]"
]
}
],
"prompt_number": 102
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq2([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 103,
"text": [
"[<__main__.HashCache at 0x102c32050>, <__main__.HashCache at 0x102c32450>]"
]
}
],
"prompt_number": 103
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we mutate `b` so that it compares equal to `a`, we start to have the same sort of issues that we had in example 3 with `AlwaysEqual`. Let's look at an instant replay."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashCache(1)\n",
"b = HashCache(2)\n",
"b.arg = 1\n",
"print a == b\n",
"print hash(a) == hash(b)\n",
"print {a, b}\n",
"print uniq([a, b])\n",
"print uniq2([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"True\n",
"True\n",
"set([<__main__.HashCache object at 0x102c32a10>])\n",
"[<__main__.HashCache object at 0x102c32a50>]\n",
"[<__main__.HashCache object at 0x102c32a50>]\n"
]
}
],
"prompt_number": 104
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait a minute, this time it's different! Comparing it to above, it's pretty easy to see what was different this time. I left out the part where I showed the hash of `a` and `b`. When I did that the first time, it cached the hash of `b`, making it forever be `2`, but when I didn't do it the second time, the hash had not been cached yet, so the first time it is computed (in the `print hash(a) == hash(b)` line), `b.arg` has already been changed to `1`.\n",
"\n",
"And herein lies the extreme subtlety: if you mutate an object with that hashes its cache like this, you will run into issues **only if** you had already called some function that hashed the object somewhere. Now just about anything might compute the hash of an object. Or it might not. For example, our `uniq2` function computes the hash of the objects in its input list, because it stores them in a set, but `uniq` does not:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashCache(1)\n",
"b = HashCache(2)\n",
"uniq2([a, b])\n",
"b.arg = 1\n",
"print a == b\n",
"print hash(a) == hash(b)\n",
"print {a, b}\n",
"print uniq([a, b])\n",
"print uniq2([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"True\n",
"False\n",
"set([<__main__.HashCache object at 0x102c32c50>, <__main__.HashCache object at 0x102c32c10>])\n",
"[<__main__.HashCache object at 0x102c32c50>]\n",
"[<__main__.HashCache object at 0x102c32c50>, <__main__.HashCache object at 0x102c32c10>]\n"
]
}
],
"prompt_number": 105
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashCache(1)\n",
"b = HashCache(2)\n",
"uniq([a, b])\n",
"b.arg = 1\n",
"print a == b\n",
"print hash(a) == hash(b)\n",
"print {a, b}\n",
"print uniq([a, b])\n",
"print uniq2([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"True\n",
"True\n",
"set([<__main__.HashCache object at 0x102c32c90>])\n",
"[<__main__.HashCache object at 0x102c32bd0>]\n",
"[<__main__.HashCache object at 0x102c32bd0>]\n"
]
}
],
"prompt_number": 106
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The conclusion is this: don't mess with hashing. The two invariants above are important. Let's restate them here, in bold\n",
"\n",
"1. **The hash of an object must not change across the object's lifetime (in other words, a hashable object should be immutable).**\n",
"\n",
"2. **`a == b` implies `hash(a) == hash(b)` (note that the reverse might not hold in the case of a hash collision).**\n",
"\n",
"If you don't follow these rules, you will run into very subtle issues, because very basic Python operations expect these invariants."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to be able to mutate an object's properties, you have two options. First, make the object unhashable (set `__hash__ = None`). You won't be able to use it in sets or as keys to a dictionary, but you will be free to change the object in place however you want.\n",
"\n",
"A second option is to make all mutable properties non-dependent on hashing or equality testing. This option works well if you just want to cache some internal state that doesn't inherently change the object. Both `__eq__` and `__hash__` should remain unchanged by changes to this state. You may also want to make sure you use proper getters and setters to prevent modification of internal state that equality testing and hashing does depend on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, to keep invariant 2, here are some tips:\n",
"\n",
"- Make sure that the parts of the object that you use to compare equality are not themselves mutable. If they are, then your object cannot itself be mutable. This means that if `a == b` depends on `a.attr == b.attr`, and `a.attr` is a list, then you will need to use a tuple instead (if you want `a` to be hashable). \n",
"\n",
"- You don't have to invent a hash function. If you find yourself doing bitshifts and XORs, you're doing it wrong. Reuse Python's builtin hashable objects. If the hash of your object should depend on the hash of `a` and `b`, define `__hash__` to return `hash((a, b))`. If the order of `a` and `b` does not matter, use `hash(frozenset([a, b]))`. \n",
"\n",
"- As with any kind of cache, don't cache something unless you know that the entire cached state will note be changed over the lifetime of the cache. Hashable objects are actually great for hashes. If they properly satisfy invariant 1, and all the state that should be cached is part of the hash, then you will not need to worry. And the best part is that you can just use `dict` for your cache. \n",
"\n",
"- Unless you really need the performance or memory gains, don't make your objects mutable. This makes programs much harder to reason about. Some functional programming languages take this idea so far that they don't allow any mutable objects.\n",
"\n",
"- Don't worry about the situation where `hash(a) == hash(b)` but `a != b`. This is a hash collision. Unlike the issues we looked at here, hash collisions are expected and checked for in Python. For example, our `HashToOne` object from the beginning will always hash to 1, but different instances will compare unequal. We can see that the right thing is done in every case with them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = HashToOne()\n",
"b = HashToOne()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 107
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a == b"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 108,
"text": [
"False"
]
}
],
"prompt_number": 108
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hash(a) == hash(b)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 109,
"text": [
"True"
]
}
],
"prompt_number": 109
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"{a, b}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 110,
"text": [
"{<__main__.HashToOne at 0x102c32a10>, <__main__.HashToOne at 0x102c32cd0>}"
]
}
],
"prompt_number": 110
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 111,
"text": [
"[<__main__.HashToOne at 0x102c32cd0>, <__main__.HashToOne at 0x102c32a10>]"
]
}
],
"prompt_number": 111
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"uniq2([a, b])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 112,
"text": [
"[<__main__.HashToOne at 0x102c32cd0>, <__main__.HashToOne at 0x102c32a10>]"
]
}
],
"prompt_number": 112
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The only concern with hash collisions is that too many of them can remove the performance gains of `dict` and `set`. \n",
"\n",
"- Finally, we didn't discuss this much here, but don't assume that the hash of your object will be the same across Python sessions. In Python, the `-R` option will enable hash randomization, which randomizes the hash of strings, and any object whose hash is computed from the hash of strings."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%script python -R\n",
"print hash('a')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"-7750608935454338104\n"
]
}
],
"prompt_number": 113
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%script python -R\n",
"print hash('a')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"8897161376854729812\n"
]
}
],
"prompt_number": 114
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment