jfarmer/python-hash.md

## python-hash.md

      
    Raw
  

              python-hash.md
            
          
    Python's hash()

Authors's Note

This is an outline of how I'd explain hash to a novice programmer. I've given versions of this explanation hundreds of times to thousands of students, beginners and experts alike.
Students tell me they find it compelling because it helps them connect w/ what's really going on by emphasizing both the technical and human/design elements.
I'm assuming the student is comfortable with:

Basic language constructs and concepts like variables, conditionals, functions, etc.
Using collection types with built-in types
Defining and using custom classes/objects
Using some magic methods with custom objects to override default behavior

The way I explain magic methods is that Python knows what built-in operations like ==, +, *, etc. mean for built-in types because a human programmed the Python interpreter to handle all those cases. + means something different for numbers vs. strings vs. lists, for example, but it means something different because a human told it.
However, with objects we define ourselves, Python has no way of knowing what + is supposed to mean unless we tell it. In this instance, we're the human who has to tell Python how to handle our case.
Magic methods are Python's way of letting us tell it how to use built-in operations with our custom objects. In other words, students are conditioned to continuously ask and reflect on the question: "How could Python know, how does it know, and who told it?"
Explanation

Python lets you use custom objects as keys in a dictionary, but if we redefine __eq__ we see something surprising:
# Insert relevant example
To understand what's going on and how to fix it we need to understand Python's hash and __hash__.
If all you want is the "how to fix it", this is the rule. For dict to work as you expect for custom objects you define, you must ensure the following is always true:

If a == b then hash(a) == hash(b)

In the same way that Python determines how == works with our custom object by calling our magic method __eq__, it determines how hash works with our custom object by calling our magic method __hash__.
(Note: I usually have a table of magic methods + semantic effects I've been building up, which I'd include here with __hash__ and hash added)
In short:

dict somehow uses hash to do its job
dict requires that hash(a) == hash(b) whenever a == b
For our custom objects, changing __eq__ changes how a == b works
For our custom objects, changing __hash__ changes how hash(a) works
If we change __eq__, we become responsible for changing __hash__ so that hash continues to work as dict expects when we use our custom objects as keys

This might seem annoying and arbitrary, but features of Python you know and love mean it has to work this way. You'll know you'll understand what's going on when you have a hard time imagining it working any other way.
However, if the above explanation is satisfactory then it's not essential you read on. We think it's useful and enlightening, but it's not essential. Just remember that dict needs hash to play nice with == and we become responsible for ensuring that remains the case whenever we define a custom __eq__.
If the explanation unsatisfactory or you're curious, continue reading to learn more about...

Why dict needs something like hash to do its job (in ANY language, not just Python)
How exactly dict uses hash

Further Explanation

Note Here's where I'd outline the reasoning for hash, which I modulate as appropriate for the audience.
To see why something like hash is necessary, let's think about how we might build dict ourselves if all we had were Python arrays/lists. What problem is hash a solution for?


It wouldn't be too hard to make something like dict that was slow, e.g., a giant array of (key, value) pairs that we always searched through linearly.
Note: IME, if you ask an absolute novice to make something that works like dict using only arrays/lists, they almost all propose this. Most also immediately see that it's not great, so offer up an answer like "Well, you could do X, but, uhhh....not sure that's what you're looking for."


Getting information out of an array with a numerical index is fast because we have special hardware (RAM) designed to make that fast. Is there some way we can use this? Yes!
Note: Again, emphasizing the human element: arrays are fast because HUMANS MADE THEM FAST.
In fact, early computer memory, e.g., the delay-line memory in the EDVAC did not have constant-time random access. Accessing values in an array by index was not "fast" on those computers.
I often bring in the history.


With a dict, we want to retrieve information with a non-numerical key.


One strategy would be to convert the key into a number and then use an array (which is fast). This is what we do.


hash is that conversion function


It's possible for hash(a) and hash(b) to assign the same index to different values a and b. This is called a "collision".


Do you see why collisions would be a problem? Good! That means you're well on your way to understanding how dict-like collections work.


There are a variety of ways to solve the "collision problem". There's a whole wikipedia page on it: hash collisions


Like before, it's not essential you understand how we solve the collision problem, though you might be able to think of a few ways you could. Different languages and implementations use different methods. The exact method a language runtime uses might change from version to version, even.
The key thing is that there's no mystery here. There's only a sequence of problems and solutions, each solved one after the other by other people like you.