Skip to content

Instantly share code, notes, and snippets.

@ewjoachim
Last active May 6, 2016 22:05
Show Gist options
  • Save ewjoachim/19d737a808924e8f69a95bcf1cb88a5f to your computer and use it in GitHub Desktop.
Save ewjoachim/19d737a808924e8f69a95bcf1cb88a5f to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Of Hashes and Pythons"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I posted the following code on a [tweet](https://twitter.com/Ewjoachim/status/728610568886718466) today\n",
"and subsequently had a very interesting talk answering [questions](https://twitter.com/codingrixx/status/728616489318789120) about this, which I'll sumarize here. There might also be new material, too."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class A(object):\n",
" def __hash__(self):\n",
" return id(self)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class B(object):\n",
" def __hash__(self):\n",
" return 1"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100 loops, best of 3: 4.63 ms per loop\n"
]
}
],
"source": [
"%%timeit\n",
"{A() for a in range(10000)}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 loop, best of 3: 1.42 s per loop\n"
]
}
],
"source": [
"%%timeit\n",
"{B() for a in range(10000)}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## So what's happening here ?\n",
"\n",
"Here's a textual version of the code above.\n",
"\n",
"First I create a class `A` whose ``__hash__`` method returns the result of `id(self)` which mean more or less \"the value of a pointer on `self`)\n",
"\n",
"Then I create a class `B` whose ``__hash__`` method returns something quite simpler : the constant 1.\n",
"\n",
"Putting ten thousand instances of `A` in a set takes 4 ms, putting 10 thousand instances of `B` should *logically* take less, because the hash is easier to compute and you don't have to store ten thousand different hashes, right ? Nope. It take longer. 317 times longer.\n",
"\n",
"**Note:** The hash of an object is just an integer with the following property : if 2 objects are equal, then their hashes are equal too. (The opposite is not always true, see the remarks at the end of the post)\n",
"\n",
"**Note 2:** In this post, I'm talking a lot about sets. I could say exactly the same thing about dicts. Sets are easier to manipulate in this example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What happens when I put something in a set ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's play a bit. Introducing Verbose, a class that talks all the time.\n",
"\n",
"Instances have a name, and we do 2 things:\n",
"\n",
" 1. If 2 instances share the same name, we say they're equal, through `__eq__` (and talk a bit).\n",
" 2. Through `__hash__`, we set the hash of instances to be solely based on their name (and talk a bit)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class Verbose(object):\n",
" \n",
" def __init__(self, name):\n",
" self.name = name\n",
" \n",
" def __eq__(self, other):\n",
" \"\"\"\n",
" __eq__(for \"equal\") is used to determine\n",
" the result of:\n",
" Verbose() == other\n",
" \"\"\"\n",
" equal = self.name == other.name\n",
" print(\"Is {} equal to {} ? {}.\".format(self, other, equal))\n",
" return equal\n",
" \n",
" def __hash__(self):\n",
" \"\"\"\n",
" Hashable objects need to implement __hash__,\n",
" that should return integers, with this\n",
" property: 2 equal objects should return\n",
" the same hash.\n",
" \"\"\"\n",
" hash_value = hash(self.name)\n",
" print(\"Asking the hash for {} which is {}.\".format(self, hash_value))\n",
" return hash_value\n",
" \n",
" def __repr__(self):\n",
" \"\"\"\n",
" This is just so that our prints looks clean\n",
" \"\"\"\n",
" return self.name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Play time !"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Is a equal to b ? False.\n"
]
},
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Verbose(\"a\") == Verbose(\"b\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Asking the hash for a which is -1212919011480583274.\n"
]
},
{
"data": {
"text/plain": [
"{a}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{Verbose(\"a\")}"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Asking the hash for b which is 5265123888727380584.\n",
"Asking the hash for a which is -1212919011480583274.\n"
]
},
{
"data": {
"text/plain": [
"{b, a}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{Verbose(\"a\"), Verbose(\"b\")}"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Asking the hash for a which is -1212919011480583274.\n",
"Asking the hash for a which is -1212919011480583274.\n",
"Is a equal to a ? True.\n"
]
},
{
"data": {
"text/plain": [
"{a}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{Verbose(\"a\"), Verbose(\"a\")}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When I put an object in a set, these things happen :\n",
"\n",
" 1. We compute the hash of the object\n",
" 2. We see if the hash is already in our set\n",
" 3. If so, we get the objects that share the same hash, they're potential matches\n",
" 4. We wheck if our object is equal to any of that list.\n",
" 5. If so, it's already in our set, so no need to add it again.\n",
" \n",
"What would happen if 2 objects had the same hash, but were not equal ?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class Verbose2(Verbose):\n",
" def __hash__(self):\n",
" return 1"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Is f equal to e ? False.\n",
"Is f equal to d ? False.\n",
"Is e equal to d ? False.\n",
"Is f equal to c ? False.\n",
"Is e equal to c ? False.\n",
"Is d equal to c ? False.\n",
"Is f equal to b ? False.\n",
"Is e equal to b ? False.\n",
"Is d equal to b ? False.\n",
"Is c equal to b ? False.\n",
"Is f equal to a ? False.\n",
"Is e equal to a ? False.\n",
"Is d equal to a ? False.\n",
"Is c equal to a ? False.\n",
"Is b equal to a ? False.\n"
]
},
{
"data": {
"text/plain": [
"{f, a, c, b, e, d}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{Verbose2(\"a\"), Verbose2(\"b\"), Verbose2(\"c\"), Verbose2(\"d\"), Verbose2(\"e\"), Verbose2(\"f\")}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thats 15 comparisons for 6 objects. Each object gets compared to all the others. The number of comparison is `6 * 5 / 2`"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"n = 6"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"15"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def number_of_comparisons(n):\n",
" return n * (n - 1) // 2\n",
"\n",
"number_of_comparisons(6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So this is what is happening for our initial bad example : there are 10 000 objects that all get compared to each other."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"49995000"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"number_of_comparisons(10000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's ~50 millions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## At this point, you might ask :\n",
"\n",
"*\"But you said elusively in your earliest list that 'We see if the hash is already in our set'. Isn't it the same problem ? We have to check all the values !*\"\n",
"\n",
"And yes, if a hash could be anything, that would be true. Fortunately, we know something about hashes : they're integers. A comparison between any objects can give 2 results : equal or not equal. A comparison between integers can give 3 results : less, equal, greater.\n",
"\n",
"What happens behind the scene may depend on the python implemention. Better than telling you things that Python does that we cannot **test** here, let's give an example of what à Python implementation **might** do.\n",
"\n",
"Python could place all our values in a [Binary Tree](https://en.wikipedia.org/wiki/Binary_tree) (`btree`). A binary tree is a way to organize sortable elements that look a lot like the game where you have to guess someone's number when they only tell you plus or minus. This allows for very efficient fetching of your data (`log2(n)`)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## So that's it, then\n",
"\n",
"Now we know why the `A` class is fast and the `B` class is so very very slow.\n",
"\n",
"Thank you very much. The show's finished. What you want more ? Ok, let's get you more !"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What happens when the hash is not really reliable ?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import random\n",
"class C(object):\n",
" def __hash__(self):\n",
" return random.randint(0, 10)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Is c in the set of s that contains only c ? False\n",
"And what if I ask you 20 times the question ? True\n"
]
}
],
"source": [
"c = C()\n",
"s = {c}\n",
"print (\"Is c in the set of s that contains only c ?\", c in s)\n",
"print (\"And what if I ask you 20 times the question ?\", any(c in s for i in range(20)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why do they say mutable objects should not be hashable ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(Mutable objects mean objects that you can modify after they've been created). Hashes are computed only when the object is first put in a set."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Asking the hash for a which is -1212919011480583274.\n",
"Asking the hash for b which is 5265123888727380584.\n"
]
}
],
"source": [
"a = Verbose(\"a\")\n",
"s = {a}\n",
"a.name = \"b\"\n",
"s.add(Verbose(\"b\"))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{b, b}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've mutated our object `a` to a `b` after we put it in a set, and now the set contains 2 equal `b` objects. We've broken our set ! So that's why `list`s and `dict`s (which are mutables) cannot be put in sets, but `tuple`s and `frozenset`s (which are immutable) can. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Do we really need hash AND eq to compare objects ? isn't hash alone enough ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given that there's a lower and upper limit to the hashes integer values, there are not lots of possible hashes compared to the number of different objects that might exist, so there are not enough integers to give a unique one to all the possible objects in universe. Here are a few examples :"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash(1.000000000000001) == hash(2561)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash(object) == hash(hash(object))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, would you use Python if you knew that **sometimes**, you get a random object disappearing from a set or a dict just because it has the same hash as an existing object ? You'd probably run away, and I'd do too. But it's ok, Python has our back. \\o/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"That's all, folks. Thanks a lot to [@rixx](https://twitter.com/codingrixx) for the inspiration of writing this as a blog post (which became a notebook, but well...).\n",
"\n",
"I'm all ears for suggestions and improvements regarding both the content and the format of this. I might do it again later.\n",
"\n",
"See ya !"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Joachim Jablon*\n",
"\n",
"(Creative Commons : BY-NC)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment