misho-kr/Data Structures.md

## Data Structures.md

      
    Raw
  

              Data Structures.md
            
          
    Data Structures

A good algorithm usually comes together with a set of good data structures that allow the algorithm to manipulate the data efficiently. In this course, we consider the common data structures that are used in various computational problems. You will learn how these data structures are implemented in different programming languages and will practice implementing them in our programming assignments. This will help you to understand what is going on inside a particular built-in implementation of a data structure and what to expect from it. You will also learn typical use cases for these data structures.
A few examples of questions that we are going to cover in this class are the following:
What is a good strategy of resizing a dynamic array?
How priority queues are implemented in C++, Java, and Python?
How to implement a hash table so that the amortized running time of all operations is O(1) on average?
What are good strategies to keep a binary tree balanced?

Learn how services like Dropbox manage to upload some large files instantly and to save a lot of storage space!
Lecturers

Daniel Kane, Alexander S. Kulikov, Michael Levin, Neil Rhodes
Week 1: Basic Data Structures

In this module, you will learn about the basic data structures used throughout the rest of this course. We start this module by looking in detail at the fundamental building blocks: arrays and linked lists. From there, we build up two important data structures: stacks and queues. Next, we look at trees: examples of how they’re used in Computer Science, how they’re implemented, and the various ways they can be traversed. Once you’ve completed this module, you will be able to implement any of these data structures, as well as have a solid understanding of the costs of the operations, as well as the tradeoffs involved in using each data structure.
List the basic data structures
Analyze operations with data structures
Choose appropriate basic data structure for a task at hand
Apply basic data structures in programming challenges
Develop a program that simulates network packet processing


Arrays
Single and Doubly Linked Lists
Stacks and Queues
Trees and Traversals

In-Order, Pre-Order and Post-Order
Depth-First and Breath-First


References


Chapters 10.1, 10.2, 10.4 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.

Week 2: Dynamic Arrays and Amortized Analysis

In this module, we discuss Dynamic Arrays: a way of using arrays when it is unknown ahead-of-time how many elements will be needed. Here, we also discuss amortized analysis: a method of determining the amortized cost of an operation over a sequence of operations. Amortized analysis is very often used to analyse performance of algorithms when the straightforward analysis produces unsatisfactory results, but amortized analysis helps to show that the algorithm is actually efficient. It is used both for Dynamic Arrays analysis and will also be used in the end of this course to analyze Splay trees.
Describe how dynamic arrays work
Calculate amortized running time of operations
List the methods for amortized analysis


Dynamic Arrays

C++ vector, Java ArayList, Python list


Amortized analysis for Dynamic Arrays

Amortized cost is O(1) -- O(N) rarely and O(1) most often
Double the array when it needs to expand, can not use constant amount
Aggregate method -- sum up over N allocations
Banker's method -- charge extra for each cheap operation, collect 3 tokens every time element is added, spend them when increasing the array
Physicist method -- define a petential function Fi ...


References


Chapter 17 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.

Week 3: Priority Queues and Disjoint Sets

We start this module by considering priority queues which are used to efficiently schedule jobs, either in the context of a computer operating system or in real life, to sort huge files, which is the most important building block for any Big Data processing algorithm, and to efficiently compute shortest paths in graphs, which is a topic we will cover in our next course. For this reason, priority queues have built-in implementations in many programming languages, including C++, Java, and Python. We will see that these implementations are based on a beautiful idea of storing a complete binary tree in an array that allows to implement all priority queue methods in just few lines of code. We will then switch to disjoint sets data structure that is used, for example, in dynamic graph connectivity and image processing. We will see again how simple and natural ideas lead to an implementation that is both easy to code and very efficient. By completing this module, you will be able to implement both these data structures efficiently from scratch.
Describe how heaps and priority queues work
Describe how disjoint set union data structure works
Analyze the running time of operations with heaps
List the heuristics that speedup disjoint set union
Apply priority queues to schedule jobs on processors
Apply disjoint set union to merge tables in a database


Binary max-heap is a tree where the value of each node is at least the values of its children
Basic operations: Top, Insert, SiftUp, SiftDown, ExtractMax
More operations:

Change Priority and let the changed element sift up or down
Remove by changing the priority to infinity and let it sift up then extract-max


Runtime complexity: get-max (top) is O(1), all other operations are O(tree height)
Complete Binary Tree all its levels are filled except the last one which is filled left to right

Low height -- always O(log(n))
Easily stored in array
Operations must preserve the complete-binary-tree property


Heap Sort is in-place, comparison-based, with running time O(n*log(n))

Build heap, in O(nlog(N) which is practically 2N
Repeatedly extract-max, N times in O(log(N))


Partial Sorting produces the last k elements of sorted array in O(N) time if k = O(N / log(N))
d-ary heap
Priority Queue

Supports 2 main operations -- Insert and ExtractMax
List and array implementations have one fast operation O(1) and one very slow operation O(N)
Binary heap implementation provides 2 fast operations O(log(N))


Disjoint Set basic operations: MakeSet(x), Find(x), Union(x,y)
Native implementation: simple array, linked lists
Tree implementation, can degenerate the tree into a list
Union by Rank

Merge by hanging the shorter tree under the root of the other
Lemmas

The height of any tree in the forrest is at most log2(N)
Any tree of height k in the forest has at least power(2,k) nodes


Union and Find work in O(log(n))


Path compression

Iterated logarithm of N -- log*N, is the numebr of times the log function needs to be applied to N before the result is less or equal than 1
The amortized time of a single operation is O(log*n)
Rank of node i is no longer equal to the height of the subtree rooted at i
The height of the subtree rooted at i is at most rank of i
Once an internal node always an internal node


References


Chapter 6 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Chapter 6.4 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Chapters 21.1 and 21.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Section 5.1.4 of Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani. Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.
Tutorial on Disjoint Sets data structures
Visualization of Disjoint Sets with and without Path Compression and Union by Rank heuristics.

Week 4: Hash Tables

In this module you will learn about very powerful and widely used technique called hashing. Its applications include implementation of programming languages, file systems, pattern search, distributed key-value storage and many more. You will learn how to implement data structures to store and modify sets of objects and mappings from one type of objects to another one. You will see that naive implementations either consume huge amount of memory or are slow, and then you will learn to implement hash tables that use linear memory and work in O(1) on average! In the end, you will learn how hash functions are used in modern disrtibuted systems and how they are used to optimize storage of services like Dropbox, Google Drive and Yandex Disk!
List applications of hashing
Apply direct addressing to retrieve names by phone numbers
Develop a hash table based on chaining scheme
Apply hashing to find patterns in text
Describe how Dropbox, Google Drive and Yandex Disk save space
Describe the principles on which distributed hash tables are built


Applications and use cases
Direct addressing

Use the key directly to locate the storage cell for the value
Fast access and update times -- O(1)
Needs lots of memory, may not fit into available memory


List-based mapping

Efficient memory use -- Theta(N)
Some operations -- Top, Pop, Append are fast -- O(1)
Find, Erase and Update are very slow -- O(N)


Hash function

Hash function h maps set of objects S to a number in the range 0 .. m
m is cardinality of the hash function
Should be fast to compute, return different values for different objects, have small cardinality
Collisions when h(o1) == h(o2) and o1 != o1


Map is a data structure with HasKey(O), Get(O) and _Set(O,v)
Chaining to resolve has collisions with array of {O,v} pairs
Lemma: the running time of the map operations is O(1+c) where c is the lenght of the longest chain
Lemma: the memory consumption of chaining is O(n+m) where n is the number of different keys and m is the cardinality of the hash function
Set is a data structure with Add(O), Remove(O) and Find(O) operations

Can be implemented as a map of objects to V = { true, false }, or a chaining without values


Hash Table is a map or set that uses hashing
Selecting good hash function -- hashing must be uniform and deterministic, have few collisins and be fast to compute
Universal family of hash functions H

set of hash functions such that for x != y in universe U (set of all possible keys) the probability of collision is Pr[ h(x) == h(y) ] <= 1/m
All hash functions in H are deterministic
If h is chosen from H then the average lenght of the longest chain is O(1+alpha) where alpha = n / m os the load factor


Dynamic Hash Table -- statr with some table size, choose new hash function, resize and rehash when necessary (load factor becomes too high)
Hashing integers

Choose prime number p greater than 10**7 and hash table size m
H : { h(x) = ((ax + b) mod p) mod m } is universal family


Hashing strings

Use every character in the string, else there will be more collisions
Polinomila function is Sum( s[i] * x**i mod p ) for i: 0 .. |s|-1 with a fixed prime number p and all 1= < x <= p-1
Running time is O(|s|)
Java hashing function is similiar that uses x=31 and skips mod operation for technical reasons


Pattern Search in Strings

Naive algorithm -- running time is O(|T| * |P|)
Rabin-Karp algorithm -- use hashing to quickly compare pattern with substrings

If h(P) != h(S) then P != S
If h(P) == h(S) then call AreEqual(P, S), the probability of "false alarms" is |P| / p
Use polinomial hash function Pp with prime p
Same running time as the naive alg, but can be improved -- polynomial hashes of 2 consequtive substring are very similar
PrecomputeHash has running time O(|T| + |P|)


Online Storage

Upload large files to online storage quickly
Naive algorithm -- upload, compare, store or save link to the original
Hash comparision -- upload, use Rabin-Karp algorothm to compare, store or save link
If several hash functions are used, the probability of collisions is low but still exists


Distributed Hash Tables

Get 1000 node, create hash function, compute which nodes owns the object
Computer sometimes break
Consistent hashing
When a node goes off its neighbors takes its data, when a new node joins it takes data from neighbors


References


Chapter 1.5.1 in [DPV] Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani. Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.
Chapters 11.1 and 11.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Chapter 1.5 in [DPV] Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani. Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.
Chapter 11.3 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Chapters 32.1 and 32.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
https://en.wikipedia.org/wiki/Distributed_hash_table
http://stackoverflow.com/questions/144360/simple-basic-explanation-of-a-distributed-hash-table-dht
https://www.cs.cmu.edu/~dga/15-744/S07/lectures/16-dht.pdf
https://en.wikipedia.org/wiki/Consistent_hashing

Week 5: Binary Search Trees

In this module we study binary search trees, which are a data structure for doing searches on dynamically changing ordered sets. You will learn about many of the difficulties in accomplishing this task and the ways in which we can overcome them. In order to do this you will need to learn the basic structure of binary search trees, how to insert and delete without destroying this structure, and how to ensure that the tree remains balanced.
Describe how balanced binary search trees work
Analyze the running time of operations with binary search trees
List the capabilities of binary search trees
Compare balanced binary search trees with arrays and lists


Local Search data structure stores elements with each with key and supports operations

RangeSearch(x, y) -- returns all elements with keys between x and y
Nearest Neighbords(x) -- returns the elements with keys on either side of z
Modify -- Insert and Delete


Hash table, linked list, array and sorted array do not support or work well on some of the operations
Binary search tree

Has parts -- root node, left and right subtrees
Each node has a key, parent, left and right child
Search tree property -- a node's key is larger than the key of any descendant of its left child, and smaller than the key of any descendant of its right child


Operations -- Find, Next, Range Search, Insert, Delete
Runtime cost of the opertions depend on how balanced the tree is, what left and right subtrees to have approximately the same size
For perfectly balanced tree operations run in O(long(n)) time, but insertion and deletion destroy the balance
Rearrange the tree to maintain the balance -- RotateRight, RotateLeft
Keep the height of each node
AVL Trees maintain the AVL Property -- for all nodes N, |N.Left.Height - N.Right.Height| <= 1
Theorem -- For a binary tree with AVL propery each node has a subtree with size at least the Fibonacci Number F(h) where h is the height of that subtree
Where to correct AVL tree -- heights stay the same except on the insertion path
Operations -- Rebalance, RebalanceRight, RebalanceLeft, AdjustHeight
Another useful feature of binary search trees is the ability to recombine them in interesting ways

Merge -- inputs are roots R1 and R2 of trees with all keys in R1's subtree smaller than those in R2's subtree
Split -- output is 2 trees, one with elements <= x, the other with elements > x


References


Chapters 5.11.1, 5.11.2 here
AVL tree
See this visualization. Play with this AVL tree by adding and deleting elements to see how it manages to keep being balanced.

Week 6: Binary Search Trees 2

In this module we continue studying binary search trees. We study a few non-trivial applications. We then study the new kind of balanced search trees - Splay Trees. They adapt to the queries dynamically and are optimal in many ways.
Describe how to implement advanced operations using balanced binary search trees
Describe how splay trees work
Analyze the running time of operations with splay trees
Apply amortized analysis to splay trees
Apply binary search trees in programming challenges
Develop a balanced binary search tree


Applications

Order Statistics -- find the k-th smallest element in T, requires new field Size
Color Flips -- flips the color of all squares of index > x

Use 2 trees, one with opposite colors
Nodes have extra field Color


Splay Tree puts more frequently searched nodes near the root
Just rotate to top does not work
Instead use operations:

Zig-Zig
Zig-Zag
Zig (if just below the root)


Splay operation is sometimes slow, need to amortize

Rank(N) = log2(Size of subtree of N)
Potential function = Sum1-N


Amortized cost of Find+Splay is O(log(n))
Operations - Find, Insert, Delete, Split, Merge
Cost of accessing node is O(log(D+1)) where D is distance between last access and current time
Dynamic Optimality Conjecture - For any sequence of binary search tree operations a splay tree does at most a constant factor more work than the best search tree for that sequence
Conclusions

Easy to implement
O(log(n)) per operition (amortized)
Can be much better if queries have strtucture


References


Chapters 14.1, 14.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Chapter 5.11.6 here
This visualization. Play with it by adding and erasing keys from it, and see how it can be unbalanced, in contrast with AVL tree, but pulls the keys it works with to the top.
This answer about comparison of AVL trees and Splay trees.
Original paper on Splay trees.