This is an outline of how I'd explain hash
to a novice programmer. I've given versions of this explanation hundreds of times to thousands of students, beginners and experts alike.
Students tell me they find it compelling because it helps them connect w/ what's really going on by emphasizing both the technical and human/design elements.
I'm assuming the student is comfortable with:
- Basic language constructs and concepts like variables, conditionals, functions, etc.
- Using collection types with built-in types
- Defining and using custom classes/objects
- Using some magic methods with custom objects to override default behavior
The way I explain magic methods is that Python knows what built-in operations like ==
, +
, *
, etc. mean for built-in types because a human programmed the Python interpreter to handle all those cases. +
means something different for numbers vs. strings vs. lists, for example, but it means something different because a human told it.
However, with objects we define ourselves, Python has no way of knowing what +
is supposed to mean unless we tell it. In this instance, we're the human who has to tell Python how to handle our case.
Magic methods are Python's way of letting us tell it how to use built-in operations with our custom objects. In other words, students are conditioned to continuously ask and reflect on the question: "How could Python know, how does it know, and who told it?"
Python lets you use custom objects as keys in a dictionary, but if we redefine __eq__
we see something surprising:
# Insert relevant example
To understand what's going on and how to fix it we need to understand Python's hash
and __hash__
.
If all you want is the "how to fix it", this is the rule. For dict
to work as you expect for custom objects you define, you must ensure the following is always true:
- If
a == b
thenhash(a) == hash(b)
In the same way that Python determines how ==
works with our custom object by calling our magic method __eq__
, it determines how hash
works with our custom object by calling our magic method __hash__
.
(Note: I usually have a table of magic methods + semantic effects I've been building up, which I'd include here with __hash__
and hash
added)
In short:
dict
somehow useshash
to do its jobdict
requires thathash(a) == hash(b)
whenevera == b
- For our custom objects, changing
__eq__
changes howa == b
works - For our custom objects, changing
__hash__
changes howhash(a)
works - If we change
__eq__
, we become responsible for changing__hash__
so thathash
continues to work asdict
expects when we use our custom objects as keys
This might seem annoying and arbitrary, but features of Python you know and love mean it has to work this way. You'll know you'll understand what's going on when you have a hard time imagining it working any other way.
However, if the above explanation is satisfactory then it's not essential you read on. We think it's useful and enlightening, but it's not essential. Just remember that dict
needs hash
to play nice with ==
and we become responsible for ensuring that remains the case whenever we define a custom __eq__
.
If the explanation unsatisfactory or you're curious, continue reading to learn more about...
- Why
dict
needs something likehash
to do its job (in ANY language, not just Python) - How exactly
dict
useshash
Note Here's where I'd outline the reasoning for hash
, which I modulate as appropriate for the audience.
To see why something like hash
is necessary, let's think about how we might build dict
ourselves if all we had were Python arrays/lists. What problem is hash
a solution for?
-
It wouldn't be too hard to make something like
dict
that was slow, e.g., a giant array of(key, value)
pairs that we always searched through linearly.Note: IME, if you ask an absolute novice to make something that works like
dict
using only arrays/lists, they almost all propose this. Most also immediately see that it's not great, so offer up an answer like "Well, you could do X, but, uhhh....not sure that's what you're looking for." -
Getting information out of an array with a numerical index is fast because we have special hardware (RAM) designed to make that fast. Is there some way we can use this? Yes!
Note: Again, emphasizing the human element: arrays are fast because HUMANS MADE THEM FAST.
In fact, early computer memory, e.g., the delay-line memory in the EDVAC did not have constant-time random access. Accessing values in an array by index was not "fast" on those computers.
I often bring in the history.
-
With a
dict
, we want to retrieve information with a non-numerical key. -
One strategy would be to convert the key into a number and then use an array (which is fast). This is what we do.
-
hash
is that conversion function -
It's possible for
hash(a)
andhash(b)
to assign the same index to different valuesa
andb
. This is called a "collision". -
Do you see why collisions would be a problem? Good! That means you're well on your way to understanding how
dict
-like collections work. -
There are a variety of ways to solve the "collision problem". There's a whole wikipedia page on it: hash collisions
Like before, it's not essential you understand how we solve the collision problem, though you might be able to think of a few ways you could. Different languages and implementations use different methods. The exact method a language runtime uses might change from version to version, even.
The key thing is that there's no mystery here. There's only a sequence of problems and solutions, each solved one after the other by other people like you.