Skip to content

Instantly share code, notes, and snippets.

@cpq
Created January 24, 2014 14:35
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save cpq/8598442 to your computer and use it in GitHub Desktop.
Save cpq/8598442 to your computer and use it in GitHub Desktop.
What is a hash and how does it work

What is a hash and how does it work

Let us start from a practical example. Imagine we are writing a traffic monitoring application. The purpose of application would be to calculate a number of bytes sent by each IP address. To do that, let create a structure that does that accounting:

struct ipstat {
  uint32_t ip;      /* Source IP address */
  unsinged bytes;   /* Bytes sent        */
};

So, the application would have a set of such structures, one for every IP address seen. When IP packet arrives, we must:

  1. find the structure in set
  2. if not found, allocate new structure, and add to set
  3. for the found structure, increment 'bytes' by the packet length.

The easiest way of doing that is to organize the structures in a linked list:

struct ipstat {
  struct ipstat	*next;  /* Next in list */
  uint32_t ip;          /* IP address   */
  unsigned	bytes;      /* Bytes sent   */
};

That would work fine, if the number of structures in a list is reasonably small. Step (1) would require scanning a whole linked list:

struct ipstat *lookup(uint32_t ip) {
  struct ipstat	*p;
  for (p = head; p != NULL; p = p->next) {
    if (p->ip == ip) return p;
  }
  return NULL;
}

Fast networks can generate many thousands packets a second. If the diversity of traffic is high, our list may contain thousands entries. That means, we must do a lookup (i.e. call lookup() function) many thousands times a second. It should be fast. The linked list implementation shown above is not fast. It is actually almost the slowest possible, beacuse it does a linear scan through all elements of the list.

What can be done to make this function faster ?

Well, it could be optimized this way: for every successful lookup, we move the element to the head of the list. So next time we lookup, it can be found faster. This way, the linked list will self-organize: the 'rarely seen' nodes will be pushed to the end of list, and the 'popular' ones will be at the beginning.

Done that, it is hard to predict what would be the average lookup time: that depends on nature of traffic. In some cases, it might be enough. This is the case when only very small number of machines are active on the network, while others are idle.

In general, to provide fast lookup, the set should be organized in different way. By lookup in this case I mean 'find struct ipstat by given IP address'. What would be the fastest way? The fastest possible is index lookup. Let's create an array:

#define	IPRANGE	4294967296
struct ipstat	*array[IPRANGE];

IP address is 32-bit integer, so the range on possible IP addresses are from 0 to 4294967296. We create array of pointers to struct ipstat, and for every incoming packet do a lookup:

struct ipstat *lookup(uint32_t ip) {
  return array[ip];
}

Yes, this is lightning fast - the fastest possible way of findind the value struct ipstat * by associated key uint32_t ip. The only problem with that is memory struct ipstat *array[IPRANGE] would take.

On 32-bit machine, it is 4Gb * sizeof(pointer), so 16 Gigabytes. Not nice. The other thing is that almost all memory of that array will not be used. Array holds ~4 billion entries. We gonna see thousands IP addresses. The space would be wasted. And probably no modern operating system will allow us to allocate 16 Gb of memory for the array.

What we have to do ? Well, shrink the array. Let's shrink it to, say, 1000:

#define	SIZE	1000
struct ipstat	*array[SIZE];

And, every IP address we must 'map' to this range: from 0 to SIZE. Let write the function 'hash', that does that:

unsigned hash(uin32_t key) {
  /* Map the key (ip-address) to the index in the array */
  return (key % SIZE);
}

Now, hash(ip) will always return a value between 0 and SIZE - 1. Lookup function will look like this:

struct ipstat *lookup(uint32_t ip) {
  unsigned	index;
  index = hash(ip);
  return (array[index]);
}

Bingo! We have done that. lookup() is still very fast, because the function hash() is very simple and fast. This method is called 'hashing', and the variable 'array' is called a 'hash table'.

The only problem left, is when two or more IP addresses are mapped to the same index. This is called 'hash collision'. There are number of ways to fight that situation. One of the simplest methods is to organize collided nodes into a linked list. In this case, the hash table would not hold the values themselves, but the linked list heads. The lookup function will look like this:

struct ipstat *lookup(uint32_t ip) {
  unsigned	index = hash(ip);
  struct ipstat	*p, *head = array[index];
  index = hash(ip);
  for (p = head; p != NULL; p = p->next) {
    if (p->ip == ip) return p;
  }
  return NULL;
}

It looks very much like the first, linked-list based lookup function, which we said has very bad performance. Yes, it is true. The difference is that linked-list implementation holds ALL elements in one single linked list. The hash table is a set of much smaller linked lists, where each list holds the collided elements only. If the hash function is good, and it scatters the keys smoothly, the avarage length of the list would be

TOTAL_NUMBER_OF_ELEMENTS / SIZE

On how to choose a hash function, dynamically resizable hash tables in the next review.

Author: Sergey Lyubka, 2005

@ArashPartow
Copy link

A comprehensive list of general purpose hash functions and their implementations can found here:

https://www.partow.net/programming/hashfunctions/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment