cpq/Hash.md

## Hash.md

      
    Raw
  

              Hash.md
            
          
    What is a hash and how does it work

Let us start from a practical example. Imagine we are writing
a traffic monitoring application. The purpose of application would be
to calculate a number of bytes sent by each IP address. To do that,
let create a structure that does that accounting:
struct ipstat {
  uint32_t ip;      /* Source IP address */
  unsinged bytes;   /* Bytes sent        */
};

So, the application would have a set of such structures, one for every
IP address seen. When IP packet arrives, we must:

find the structure in set
if not found, allocate new structure, and add to set
for the found structure, increment 'bytes' by the packet length.

The easiest way of doing that is to organize the structures in a
linked list:
struct ipstat {
  struct ipstat	*next;  /* Next in list */
  uint32_t ip;          /* IP address   */
  unsigned	bytes;      /* Bytes sent   */
};

That would work fine, if the number of structures in a list is reasonably
small. Step (1) would require scanning a whole linked list:
struct ipstat *lookup(uint32_t ip) {
  struct ipstat	*p;
  for (p = head; p != NULL; p = p->next) {
    if (p->ip == ip) return p;
  }
  return NULL;
}

Fast networks can generate many thousands packets a second. If the diversity
of traffic is high, our list may contain thousands entries. That means,
we must do a lookup (i.e. call lookup() function) many thousands times
a second. It should be fast. The linked list implementation shown above
is not fast. It is actually almost the slowest possible, beacuse it does
a linear scan through all elements of the list.
What can be done to make this function faster ?
Well, it could be optimized this way: for every successful lookup,
we move the element to the head of the list. So next time we lookup,
it can be found faster. This way, the linked list will self-organize:
the 'rarely seen'  nodes will be pushed to the end of list, and the
'popular' ones will be at the beginning.
Done that, it is hard to predict what would be the average lookup time:
that depends on nature of traffic. In some cases, it might be enough.
This is the case when only very small number of machines are active
on the network, while others are idle.
In general, to provide fast lookup, the set should be organized in different
way. By lookup in this case I mean 'find struct ipstat by given IP address'.
What would be the fastest way? The fastest possible is index lookup.
Let's create an array:
#define	IPRANGE	4294967296
struct ipstat	*array[IPRANGE];

IP address is 32-bit integer, so the range on possible IP addresses
are from 0 to 4294967296. We create array of pointers to struct ipstat,
and for every incoming packet do a lookup:
struct ipstat *lookup(uint32_t ip) {
  return array[ip];
}

Yes, this is lightning fast - the fastest possible way of findind the
value struct ipstat * by associated key uint32_t ip. The only problem
with that is memory struct ipstat *array[IPRANGE] would take.
On 32-bit machine, it is 4Gb * sizeof(pointer), so 16 Gigabytes.
Not nice. The other thing is that almost all memory of that array
will not be used. Array holds ~4 billion entries. We gonna see thousands
IP addresses. The space would be wasted. And probably no modern operating
system will allow us to allocate 16 Gb of memory for the array.
What we have to do ? Well, shrink the array. Let's shrink it to, say, 1000:
#define	SIZE	1000
struct ipstat	*array[SIZE];

And, every IP address we must 'map' to this range: from 0 to SIZE.
Let write the function 'hash', that does that:
unsigned hash(uin32_t key) {
  /* Map the key (ip-address) to the index in the array */
  return (key % SIZE);
}

Now, hash(ip) will always return a value between 0 and SIZE - 1.
Lookup function will look like this:
struct ipstat *lookup(uint32_t ip) {
  unsigned	index;
  index = hash(ip);
  return (array[index]);
}

Bingo! We have done that. lookup() is still very fast, because the function
hash() is very simple and fast. This method is called 'hashing', and the
variable 'array' is called a 'hash table'.
The only problem left, is when two or more IP addresses are mapped to
the same index. This is called 'hash collision'. There are number of
ways to fight that situation. One of the simplest methods is to
organize collided nodes into a linked list. In this case, the hash table
would not hold the values themselves, but the linked list heads.
The lookup function will look like this:
struct ipstat *lookup(uint32_t ip) {
  unsigned	index = hash(ip);
  struct ipstat	*p, *head = array[index];
  index = hash(ip);
  for (p = head; p != NULL; p = p->next) {
    if (p->ip == ip) return p;
  }
  return NULL;
}

It looks very much like the first, linked-list based lookup function,
which we said has very bad performance. Yes, it is true. The difference is
that linked-list implementation holds ALL elements in one single linked
list. The hash table is a set of much smaller linked lists, where each list
holds the collided elements only. If the hash function is good, and it scatters
the keys smoothly, the avarage length of the list would be
TOTAL_NUMBER_OF_ELEMENTS / SIZE

On how to choose a hash function, dynamically resizable hash tables
in the next review.
Author: Sergey Lyubka, 2005