alyssaq/data-structures.md

## data-structures.md

      
    Raw
  

              data-structures.md
            
          
    Data Structures

##Dictionaries
A set of n records, each identified by one or more key fields.
Problem: Efficiently locate, insert and delete the record associated with any query key q.
Data structures include hash tables, skip lists and balanced/unbalanced binary search trees.
###Representations

Unsorted arrays: for small data sets for easier maintanence. However, linear search will kill you with larger dictionaries.
Sorted arrays: Fast binary search. Appropriate iff there are not many insertions or deletions.
Hash tables: Great for applications with large keys (say 100-10mil).

A Hash function is used to map keys to integers between 0 and n - 1.
Maintain an array of n buckets, each typically implemented using an unsorted linked list.
Must deal with collisions. Open addressing can have better cache perfomance than bucketing.
Hash functions for strings include Horner's rule, computed in O(n).


Binary search trees: Supports fast inserations, deletions and queries at O(log n). Balanced BSTs use local rotation operations to restructure search trees - red-black trees, AVL, 2/3 trees. B-trees for large datasets.

##Priority Queues
A set of n records with ordered keys
Problem: Provide quick access to the smallest or largest key in the set.
Priority queues are great useful in simulations for maintaining a set of future events ordered by time. Enables you to retrive items not by insertion time (as in a stack or queue), nor by a key match (as in a dictionary), but by which itme has the highest priority of retrieval.
###Representations

Sorted array or list: Efficient to identify the smallest element and deleting it by decrementing the top index. Slow insertion to maintain the order.
Binary heaps: Supports insertion and extractMin in O(log n) time.
Binary search trees: Effective since the smallest element is always the leftmost leaf and largest is the rightmost leaf. The min and max is found in O(log n) by tracing down the left or right pointers until the next pointer is null.

##Graphs
Graph G with n nodes/vertices and m edges

Hypergraphs are generalized graphs where each edge may link subsets of more than two vertices.
###Representations

Adjacency matrices - n² space:

Best to use a bit vector for dense graphs.
Frequent lookups to answer "Is (i,j) in G?"


Adjacency lists - n + m space:

Suited for DFS-algorithms, sparse graphs, planar graphs (maps of countries).
For static graphs (no edge insertions or deletions), each edge list can be replaced by a packed array of vertex identifiers. This eliminates pointers and saves space.


###Graph Traversals: BFS, DFS

BFS: Queue - FIFO exploring the oldest unexplored vertices first. Thus, exploration radiates out slowly from the starting vertex.
DFS: Stack - LIFO exploring the vertices by going down a path, visiting a new neighbour if available and back-tracking when we are surrounded by previously discovered vertices. Thus, explorations quickly wander away from our starting point.

##Sets
A set is an unordered collection of unique items from a universal set.

Multi-sets permit items to have more than one occurrence.

A universe of items U = {u₁, ..., u_n} on which is defined a collection of subsets S = {s₁, ..., s_n}
Problem:

(1) Test whether item in universe is an element of the subset. u_i ∈ S_j

(2) Compute the union or intersection of S_i and S_j

(3) Insert or delete members of S
Keeping the set sorted turns the problem of finding the union or intersection to O(n) running-time instead of O(n²). Just sweep from left to right to see what you are missing.
Note the differences from dictionaries and strings.

A collection of items not drawn from a fixed-sized universal set is a dictionary. Strings are structures where order matters - i.e, {A, B} is not the same as {B, A}.
###Representations

Bit vectors - n space: value 1 or 0 to indicate whether element i is in S. Space efficient and fast insertion and deletion by flipping the bit. Poor performance for sparse subsets - O(n) to identify all members.
Linked list, array or dictionaries: Contain exactly the elements in the subset. For sparse subsets, dictionaries can be more space and time efficient than bit vectors. Keep the subsets sorted for efficient union and intersection, so a linear-time traversal through both subsets identifies all duplicates.
Boom filters: Emulate a bit vector in the absence of a fixed universal set. Hash each subset element to an integer from 0 to n and setting the corresponding bit. More space efficient for static subset applications that can torelate a small probability of error.

##Trees
Red-black trees are used internally by the Standard Library as a basis for the associative container classes.
A red-black tree is a balanced binary search tree with one extra attribute for each node: the colour, which is either red or black.
It has the following red-black properties:

Every node is either red or black.
Every leaf is black.
If a node is red, then both its children are black.
Every simple path from a node to a descendant leaf contains the same number of black nodes.
Guarantees O(log n) search times and can be re-balanced in O(log n) time.

AVL trees are "partially" balanced binary search trees, allowing for O(log n) time with find, insert, and delete operations.
The criteria that is used to determine the "level" of "balanced-ness" is the difference between the heights of subtrees of a root in the tree. In an AVL tree the difference between the height of the right and left subtrees (or the root node) is never more than one. Whenever an item is inserted or deleted, a check is made to see if the tree has become unbalanced. If it has, balance is restored by performing a set of manipulations (called "rotations") on the tree. These rotations come in two flavors: single rotations and double rotations (and each flavor has its corresponding "left" and "right" versions).
Kd-tree and related spatial data structures hierarchically decompose space into a small number of cells, each containing a few representatives from an input set of points. Fast access to any item by position.
Problem: Given a set S of n points in k dimensions, construct a tree that partitions space by half-planes such that each item is contained in its own box-shaped region.
Typical algorithms construct kd-trees by partioning point sets. A plane cuts through one of the dimensions to equally partition the subset of points into left/right subsets. These children are again partitioned into equal halves, using planes through another dimension. Partitioning stops after log n levels, with each points in its own leaf cell.

Each box-shaped region is defined by 2k planes, where k is the number of dimensions (hence kd-tree).

kd-trees are most useful for small dimensions, 2-20. They lose effectiveness as the dimesionality increases because the retio of the volumn unit sphere in k-dimensions shrinks exponentially compared to the unit cube.
Applications: point location, nearest-neighbour search, range search, partial key search