Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active April 4, 2018 23:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save misho-kr/be0cf68a9d1f5f0ffcf252dabbc7ca6d to your computer and use it in GitHub Desktop.
Save misho-kr/be0cf68a9d1f5f0ffcf252dabbc7ca6d to your computer and use it in GitHub Desktop.
Summary of "Data Structures" course at Coursera.Org

A good algorithm usually comes together with a set of good data structures that allow the algorithm to manipulate the data efficiently. In this course, we consider the common data structures that are used in various computational problems. You will learn how these data structures are implemented in different programming languages and will practice implementing them in our programming assignments. This will help you to understand what is going on inside a particular built-in implementation of a data structure and what to expect from it. You will also learn typical use cases for these data structures.

A few examples of questions that we are going to cover in this class are the following:

What is a good strategy of resizing a dynamic array?
How priority queues are implemented in C++, Java, and Python?
How to implement a hash table so that the amortized running time of all operations is O(1) on average?
What are good strategies to keep a binary tree balanced?

Learn how services like Dropbox manage to upload some large files instantly and to save a lot of storage space!

Lecturers

Daniel Kane, Alexander S. Kulikov, Michael Levin, Neil Rhodes

Week 1: Basic Data Structures

In this module, you will learn about the basic data structures used throughout the rest of this course. We start this module by looking in detail at the fundamental building blocks: arrays and linked lists. From there, we build up two important data structures: stacks and queues. Next, we look at trees: examples of how they’re used in Computer Science, how they’re implemented, and the various ways they can be traversed. Once you’ve completed this module, you will be able to implement any of these data structures, as well as have a solid understanding of the costs of the operations, as well as the tradeoffs involved in using each data structure.

List the basic data structures
Analyze operations with data structures
Choose appropriate basic data structure for a task at hand
Apply basic data structures in programming challenges
Develop a program that simulates network packet processing
  • Arrays
  • Single and Doubly Linked Lists
  • Stacks and Queues
  • Trees and Traversals
    • In-Order, Pre-Order and Post-Order
    • Depth-First and Breath-First

References

  • Chapters 10.1, 10.2, 10.4 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.

Week 2: Dynamic Arrays and Amortized Analysis

In this module, we discuss Dynamic Arrays: a way of using arrays when it is unknown ahead-of-time how many elements will be needed. Here, we also discuss amortized analysis: a method of determining the amortized cost of an operation over a sequence of operations. Amortized analysis is very often used to analyse performance of algorithms when the straightforward analysis produces unsatisfactory results, but amortized analysis helps to show that the algorithm is actually efficient. It is used both for Dynamic Arrays analysis and will also be used in the end of this course to analyze Splay trees.

Describe how dynamic arrays work
Calculate amortized running time of operations
List the methods for amortized analysis
  • Dynamic Arrays
    • C++ vector, Java ArayList, Python list
  • Amortized analysis for Dynamic Arrays
    • Amortized cost is O(1) -- O(N) rarely and O(1) most often
    • Double the array when it needs to expand, can not use constant amount
    • Aggregate method -- sum up over N allocations
    • Banker's method -- charge extra for each cheap operation, collect 3 tokens every time element is added, spend them when increasing the array
    • Physicist method -- define a petential function Fi ...

References

  • Chapter 17 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.

Week 3: Priority Queues and Disjoint Sets

We start this module by considering priority queues which are used to efficiently schedule jobs, either in the context of a computer operating system or in real life, to sort huge files, which is the most important building block for any Big Data processing algorithm, and to efficiently compute shortest paths in graphs, which is a topic we will cover in our next course. For this reason, priority queues have built-in implementations in many programming languages, including C++, Java, and Python. We will see that these implementations are based on a beautiful idea of storing a complete binary tree in an array that allows to implement all priority queue methods in just few lines of code. We will then switch to disjoint sets data structure that is used, for example, in dynamic graph connectivity and image processing. We will see again how simple and natural ideas lead to an implementation that is both easy to code and very efficient. By completing this module, you will be able to implement both these data structures efficiently from scratch.

Describe how heaps and priority queues work
Describe how disjoint set union data structure works
Analyze the running time of operations with heaps
List the heuristics that speedup disjoint set union
Apply priority queues to schedule jobs on processors
Apply disjoint set union to merge tables in a database
  • Binary max-heap is a tree where the value of each node is at least the values of its children
  • Basic operations: Top, Insert, SiftUp, SiftDown, ExtractMax
  • More operations:
    • Change Priority and let the changed element sift up or down
    • Remove by changing the priority to infinity and let it sift up then extract-max
  • Runtime complexity: get-max (top) is O(1), all other operations are O(tree height)
  • Complete Binary Tree all its levels are filled except the last one which is filled left to right
    • Low height -- always O(log(n))
    • Easily stored in array
    • Operations must preserve the complete-binary-tree property
  • Heap Sort is in-place, comparison-based, with running time O(n*log(n))
    • Build heap, in O(nlog(N) which is practically 2N
    • Repeatedly extract-max, N times in O(log(N))
  • Partial Sorting produces the last k elements of sorted array in O(N) time if k = O(N / log(N))
  • d-ary heap
  • Priority Queue
    • Supports 2 main operations -- Insert and ExtractMax
    • List and array implementations have one fast operation O(1) and one very slow operation O(N)
    • Binary heap implementation provides 2 fast operations O(log(N))
  • Disjoint Set basic operations: MakeSet(x), Find(x), Union(x,y)
  • Native implementation: simple array, linked lists
  • Tree implementation, can degenerate the tree into a list
  • Union by Rank
    • Merge by hanging the shorter tree under the root of the other
    • Lemmas
      • The height of any tree in the forrest is at most log2(N)
      • Any tree of height k in the forest has at least power(2,k) nodes
    • Union and Find work in O(log(n))
  • Path compression
    • Iterated logarithm of N -- log*N, is the numebr of times the log function needs to be applied to N before the result is less or equal than 1
    • The amortized time of a single operation is O(log*n)
    • Rank of node i is no longer equal to the height of the subtree rooted at i
    • The height of the subtree rooted at i is at most rank of i
    • Once an internal node always an internal node

References

  • Chapter 6 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
  • Chapter 6.4 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
  • Chapters 21.1 and 21.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
  • Section 5.1.4 of Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani. Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.
  • Tutorial on Disjoint Sets data structures
  • Visualization of Disjoint Sets with and without Path Compression and Union by Rank heuristics.

Week 4: Hash Tables

In this module you will learn about very powerful and widely used technique called hashing. Its applications include implementation of programming languages, file systems, pattern search, distributed key-value storage and many more. You will learn how to implement data structures to store and modify sets of objects and mappings from one type of objects to another one. You will see that naive implementations either consume huge amount of memory or are slow, and then you will learn to implement hash tables that use linear memory and work in O(1) on average! In the end, you will learn how hash functions are used in modern disrtibuted systems and how they are used to optimize storage of services like Dropbox, Google Drive and Yandex Disk!

List applications of hashing
Apply direct addressing to retrieve names by phone numbers
Develop a hash table based on chaining scheme
Apply hashing to find patterns in text
Describe how Dropbox, Google Drive and Yandex Disk save space
Describe the principles on which distributed hash tables are built
  • Applications and use cases
  • Direct addressing
    • Use the key directly to locate the storage cell for the value
    • Fast access and update times -- O(1)
    • Needs lots of memory, may not fit into available memory
  • List-based mapping
    • Efficient memory use -- Theta(N)
    • Some operations -- Top, Pop, Append are fast -- O(1)
    • Find, Erase and Update are very slow -- O(N)
  • Hash function
    • Hash function h maps set of objects S to a number in the range 0 .. m
    • m is cardinality of the hash function
    • Should be fast to compute, return different values for different objects, have small cardinality
    • Collisions when h(o1) == h(o2) and o1 != o1
  • Map is a data structure with HasKey(O), Get(O) and _Set(O,v)
  • Chaining to resolve has collisions with array of {O,v} pairs
  • Lemma: the running time of the map operations is O(1+c) where c is the lenght of the longest chain
  • Lemma: the memory consumption of chaining is O(n+m) where n is the number of different keys and m is the cardinality of the hash function
  • Set is a data structure with Add(O), Remove(O) and Find(O) operations
    • Can be implemented as a map of objects to V = { true, false }, or a chaining without values
  • Hash Table is a map or set that uses hashing
  • Selecting good hash function -- hashing must be uniform and deterministic, have few collisins and be fast to compute
  • Universal family of hash functions H
    • set of hash functions such that for x != y in universe U (set of all possible keys) the probability of collision is Pr[ h(x) == h(y) ] <= 1/m
    • All hash functions in H are deterministic
    • If h is chosen from H then the average lenght of the longest chain is O(1+alpha) where alpha = n / m os the load factor
  • Dynamic Hash Table -- statr with some table size, choose new hash function, resize and rehash when necessary (load factor becomes too high)
  • Hashing integers
    • Choose prime number p greater than 10**7 and hash table size m
    • H : { h(x) = ((ax + b) mod p) mod m } is universal family
  • Hashing strings
    • Use every character in the string, else there will be more collisions
    • Polinomila function is Sum( s[i] * x**i mod p ) for i: 0 .. |s|-1 with a fixed prime number p and all 1= < x <= p-1
    • Running time is O(|s|)
    • Java hashing function is similiar that uses x=31 and skips mod operation for technical reasons
  • Pattern Search in Strings
    • Naive algorithm -- running time is O(|T| * |P|)
    • Rabin-Karp algorithm -- use hashing to quickly compare pattern with substrings
      • If h(P) != h(S) then P != S
      • If h(P) == h(S) then call AreEqual(P, S), the probability of "false alarms" is |P| / p
      • Use polinomial hash function Pp with prime p
      • Same running time as the naive alg, but can be improved -- polynomial hashes of 2 consequtive substring are very similar
      • PrecomputeHash has running time O(|T| + |P|)
  • Online Storage
    • Upload large files to online storage quickly
    • Naive algorithm -- upload, compare, store or save link to the original
    • Hash comparision -- upload, use Rabin-Karp algorothm to compare, store or save link
    • If several hash functions are used, the probability of collisions is low but still exists
  • Distributed Hash Tables
    • Get 1000 node, create hash function, compute which nodes owns the object
    • Computer sometimes break
    • Consistent hashing
    • When a node goes off its neighbors takes its data, when a new node joins it takes data from neighbors

References

Week 5: Binary Search Trees

In this module we study binary search trees, which are a data structure for doing searches on dynamically changing ordered sets. You will learn about many of the difficulties in accomplishing this task and the ways in which we can overcome them. In order to do this you will need to learn the basic structure of binary search trees, how to insert and delete without destroying this structure, and how to ensure that the tree remains balanced.

Describe how balanced binary search trees work
Analyze the running time of operations with binary search trees
List the capabilities of binary search trees
Compare balanced binary search trees with arrays and lists
  • Local Search data structure stores elements with each with key and supports operations
    • RangeSearch(x, y) -- returns all elements with keys between x and y
    • Nearest Neighbords(x) -- returns the elements with keys on either side of z
    • Modify -- Insert and Delete
  • Hash table, linked list, array and sorted array do not support or work well on some of the operations
  • Binary search tree
    • Has parts -- root node, left and right subtrees
    • Each node has a key, parent, left and right child
    • Search tree property -- a node's key is larger than the key of any descendant of its left child, and smaller than the key of any descendant of its right child
  • Operations -- Find, Next, Range Search, Insert, Delete
  • Runtime cost of the opertions depend on how balanced the tree is, what left and right subtrees to have approximately the same size
  • For perfectly balanced tree operations run in O(long(n)) time, but insertion and deletion destroy the balance
  • Rearrange the tree to maintain the balance -- RotateRight, RotateLeft
  • Keep the height of each node
  • AVL Trees maintain the AVL Property -- for all nodes N, |N.Left.Height - N.Right.Height| <= 1
  • Theorem -- For a binary tree with AVL propery each node has a subtree with size at least the Fibonacci Number F(h) where h is the height of that subtree
  • Where to correct AVL tree -- heights stay the same except on the insertion path
  • Operations -- Rebalance, RebalanceRight, RebalanceLeft, AdjustHeight
  • Another useful feature of binary search trees is the ability to recombine them in interesting ways
    • Merge -- inputs are roots R1 and R2 of trees with all keys in R1's subtree smaller than those in R2's subtree
    • Split -- output is 2 trees, one with elements <= x, the other with elements > x

References

  • Chapters 5.11.1, 5.11.2 here
  • AVL tree
  • See this visualization. Play with this AVL tree by adding and deleting elements to see how it manages to keep being balanced.

Week 6: Binary Search Trees 2

In this module we continue studying binary search trees. We study a few non-trivial applications. We then study the new kind of balanced search trees - Splay Trees. They adapt to the queries dynamically and are optimal in many ways.

Describe how to implement advanced operations using balanced binary search trees
Describe how splay trees work
Analyze the running time of operations with splay trees
Apply amortized analysis to splay trees
Apply binary search trees in programming challenges
Develop a balanced binary search tree
  • Applications
    • Order Statistics -- find the k-th smallest element in T, requires new field Size
    • Color Flips -- flips the color of all squares of index > x
      • Use 2 trees, one with opposite colors
      • Nodes have extra field Color
  • Splay Tree puts more frequently searched nodes near the root
  • Just rotate to top does not work
  • Instead use operations:
    • Zig-Zig
    • Zig-Zag
    • Zig (if just below the root)
  • Splay operation is sometimes slow, need to amortize
    • Rank(N) = log2(Size of subtree of N)
    • Potential function = Sum1-N
  • Amortized cost of Find+Splay is O(log(n))
  • Operations - Find, Insert, Delete, Split, Merge
  • Cost of accessing node is O(log(D+1)) where D is distance between last access and current time
  • Dynamic Optimality Conjecture - For any sequence of binary search tree operations a splay tree does at most a constant factor more work than the best search tree for that sequence
  • Conclusions
    • Easy to implement
    • O(log(n)) per operition (amortized)
    • Can be much better if queries have strtucture

References

  • Chapters 14.1, 14.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
  • Chapter 5.11.6 here
  • This visualization. Play with it by adding and erasing keys from it, and see how it can be unbalanced, in contrast with AVL tree, but pulls the keys it works with to the top.
  • This answer about comparison of AVL trees and Splay trees.
  • Original paper on Splay trees.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment