Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active September 30, 2016 06:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save misho-kr/ff3a03057a951ba7b939 to your computer and use it in GitHub Desktop.
Save misho-kr/ff3a03057a951ba7b939 to your computer and use it in GitHub Desktop.
Summary of "Algorithms, Part I" course at Coursera.Org

The course is based on a variety of material that we have prepared over many years:

  • Our textbook Algorithms, 4th edition is the basic reference for the material we will be covering
  • Our booksite, which is open to everyone and contains a wealth of supplementary information, including synopses of the textbook and Java code that you will be using throughout the course

References:

Week 1

Union-Find

We illustrate our basic approach to developing and analyzing algorithms by considering the dynamic connectivity problem. We introduce the union-find data type and consider several implementations (quick find, quick union, weighted quick union, and weighted quick union with path compression). Finally, we apply the union-find data type to the percolation problem from physical chemistry.

  • Dynamic connectivity -- union command to connect two objects, find/connected query to check for connection between two objects
  • Is-Connected-To is equivalence relation -- reflexive, symmetric and transitive
  • Connected components -- maximal set of objects that are mutually connected
  • Union-find data type and API
  • Quick-find (eager approach) -- find is O(1), initialize and union is O(N)
  • Quick-union (lazy approach) -- trees can get tall, find could be O(N)
  • Improvements
  • Weighted quick union -- keep track of size of each tree, balance the tree by linking smaller tree to root of larger tree
  • Depth of any node is at most lg(N), so union and connected operations are O(lg(N))
  • Quick union with path compression -- after computing the root of p, set the id of each examined node to point to that root; keeps the tree almost completely flat
  • In theory WQUPC is not quite linear, in practice is linear
  • Steps to developing a usable algorithm
  • Model the problem
  • Find an algorithm to solve it
  • Analyze runtime complexity and memory consumption
  • Find a way to improve, iterate until satisfied

Analysis of Algorithms

The basis of our approach for analyzing the performance of algorithms is the scientific method. We begin by performing computational experiments to measure the running times of our programs. We use these measurements to develop hypotheses about performance. Next, we create mathematical models to explain their behavior. Finally, we consider analyzing the memory usage of our Java programs.

  • Reason to analyze algorithms -- predict performance, comapre algorithms, provide guarantees, understand theroretical basis
  • Fast algorithms enable new technologies
  • Scientific method to analyze algorithms
  • Observe some feature of the natural world
  • Hypothesizea model that is consistent with the observations
  • Predict events using the hypothesis
  • Verify the predictions by making further observations
  • Validate by repeating until the hypothesis and observations agree
  • 3-Sum problem -- how many triples sum to exactly 0 (zero)
  • Log-plot diagram
  • Doubling hypothesis -- double the input size
  • Running time is a * ( N ** b ) with b = lg ratio
  • a can be calculated from the above formula given b
  • Difficult to get precise measurements, easier and cheaper than other sciences
  • Mathematical model for running time
  • Cost of basic operations (cost model) and tilde notation
  • Simplifying the calculations -- estimate running time or memory as function of input size N, ignore lower order terms
  • Common order-of-growth classifications -- constant, linear, logarithmic, linearithmic, qubic, exponential
  • Algorithm for 3-Sum problem that is O(N**2 log(N))
  • Step 1 -- sort the N distinct numbers, N**2 with insertion sort
  • Step 2 -- for each pair a[i] and a[j] do binary search for -(a[i]+a[j]), N**2 * log(N)
  • Types of analysis
  • Best case -- lower bound on cost, determined by "easiest" input, provides goal for all inputs
  • Worst case -- upper bound on cost, determine by "most difficult" input, provides guarantee for all inputs
  • Average case -- expected cost for random input, provides a way to predict performance
  • Optimal algorithm is one where no other algorithm can provide better guarantee
  • Estimation of memory usage

Week 2

Bags, Queues and Stacks

We consider two fundamental data types for storing collections of objects: the stack and the queue. We implement each using either a singly-linked list or a resizing array. We introduce two advanced Java features—generics and iterators—that simplify client code. Finally, we consider various applications of stacks and queues ranging from parsing arithmetic expressions to simulating queueing systems.

  • Stacks -- LIFO, last in first out
  • Queues -- FIFO, first in first out
  • We are using interfaces for the basic structures to isolate the client from the details of the implementation
  • Two basic implementations
  • Linked list -- every operation takes linear time, requires extra time and space to deal with links
  • Array -- every operation takes amortized linear time, less wasted space
  • Resizing arrays -- double the size when expanding, shrink by half when utilization drops below 25%
  • Avoid loitering by setting the slots to Null when removing items
  • Generics to implement templetized data structures and algorithms; no casting, allows to discover type errors at compile time, not at run time
  • Iterators, Iterable interface
  • Evaluate infix expressions (Dijkstra two-stack algorithm), reverse Polish notation
Elementary Sorts

We introduce the sorting problem and Java's Comparable interface. We study two elementary sorting methods (selection sort and insertion sort) and a variation of one of them (shellsort). We also consider two algorithms for uniformly shuffling an array. We conclude with an application of sorting to computing the convex hull via the Graham scan algorithm.

  • Comparable interface
  • Selection sort
  • Invariants -- 1) entries to the left are sorted, 2) no entry to the right is smaller than any entry to the left
  • Complexity -- (N2)/2 compares and N-1 exchanges, O(N2) always
  • Insertion Sort
  • Invariants -- 1) entries to the left are sorted, 2) entries to the rigths have not been seen yet
  • Complexity -- on average (N2)/4 compares and (N2)/4 exchanges, best case is N-1 compares and 0 exchanges, worst case is (N**2)/2 compars and exchanges
  • For partially-sorted arrys takes linear time
  • Number of exchanges equals the number of inversions
  • Shell sort -- move more than one position by h-sorting the array
  • A g-sorted array remains g-sorted after h-sorting
  • What sequence to use for the sorting stride?
  • The worst case for number of compares with 3*x+1 sequence is O(N**(3/2))
  • Shuffling
  • One idea -- generate random number for eash item, then shell-sort by it
  • Knuts algorithm has linear runtime complexity
  • Convex hull of a set of N points is the smallest perimeter fence enclosing all points
  • Mechanical algorithm -- hammer nails at each points, then stretch a band around the nails
  • Fastest pair problem -- given N points in the plane, find a pair of points with the largest Euclidian distance between them
  • The vertices of the convex hull appear in increasing order of polar angle with respect to point p with the lowest y coordinate
  • Given 3 points, how to compute if they form ccw turn?

Week 3

Mergesort

We study the mergesort algorithm and show that it guarantees to sort any array of N items with at most N lg N compares. We also consider a nonrecursive, bottom-up version. We prove that any compare-based sorting algorithm must make at least ~ N lg N compares in the worst case. We discuss using different orderings for the objects that we are sorting and the related concept of stability.

  • Divide array into two halves, sort each half, merge the two halves
  • Uses at most N * log(N) compares and 6N * log(N) array accesses
  • Uses extra space proportional to N; a sorting algorithm is in-place if it uses less than c*N extra memory
  • Practical improvements
  • For small arrays fallback to insertion sort
  • Check of both halves are already sorted before merging
  • Save time but not space by switching the roles of aux and main array in each recursive call
  • Bottom-up mergesort -- non-recursive version of mergesort algorithm
  • Complexity of sorting
  • Model of computation, cost model, upper bound, lower bound, optimal algorithm
  • Any sorting algorithm must use lg(N!) ~ N*lg(N) compares in worst case
  • Lower bound may not hold if algorithm has information about:
  • Initial order of the input
  • Distribution of the keys
  • Representation of the keys
  • Comparator interface -- must be a total order
  • A stable sort preserves the relative order order of items with equal keys
  • Insertion sort is stable because it does not move items over each other
  • Selection sort is not stable because long-distance exchange may move an item past some equal item
  • Shellsort is not stable because of long-distance exchanges
  • Merge sort is stable. Yeah!

Quicksort

We introduce and implement the randomized quicksort algorithm and analyze its performance. We also consider randomized quickselect, a quicksort variant which finds the kth smallest item in linear time. Finally, we consider 3-way quicksort, a variant of quicksort that works especially well in the presence of duplicate keys.

  • One of top 10 algorithms of XX century
  • Basic plan -- shuffle the array, partition around a pivot so that all entries to the left are smaller than any entry to the right, then sort each half recursively
  • Implementation details
  • Partioning in-place -- using an extra array makes partitioning easier and stable but not worth the cost
  • Testing whether the pointers cross is tricky
  • Shuffling is needed fr performance guarantee
  • When duplicate keys are present, it is (counter-intuitively) better to stop on keys equal to the partitioning item's key
  • Use insertion sort when the subarrays become small
  • Choosing the median value for pivot -- median of sample, median-of-3 random items
  • Best case number of compares is Nlg(N), worst case is (N**2)/2, average case for array of N distinct keys is 2Nlg(N) and N*lg(N)/3 exchanges
  • Most textbook implementations are quadratic if arrays:
  • Is sorted or reverse sorted
  • Has many duplicate keys
  • Quicksort is in-place sorting algorithm that is not stable
  • Selecton -- given an array of N items, find the k-smallest
  • Sort the array, then return k element
  • Quick-select uses quicksit algorithm but after partitioning it uses only one of the two halves
  • Takes linear time on average, but (N**2)/2 on average but shuffling protects against that
  • There is an algorithm that guarantees linear worst-case time, but the constants are too high
  • Quicksorts goes quadratic unless partitioning stops on equal keys (discovered by C user in 1990)
  • Solution is 3-way partitioning
  • In the presence of duplicate keys randomized quicksort with 3-way partitioning reduces running time from linearithmic to linear
  • Practical implementations of quicksort:
  • Cutoff to insertion sort for small subarrays
  • Bentley-Mcllroy 3-way partitioning
  • Partitioning item:
  • Small arrays -- middle item
  • Medium arrays -- median of 3
  • Large arrays -- Tukey's ninther (median of the median of 3 sampes, each of 3 entries)

Week 4

This week we are going to introduce two fundamental data types, address the challenges of developing algorithms and data structures that can serve as the basis of efficient implementations, and try to convince you that such implementations enable solution of a broad range of applications problems that could not be solved without them.

Priority Queues

We introduce the priority queue data type and an efficient implementation using the binary heap data structure. This implementation also leads to an efficient sorting algorithm known as heapsort. We conclude with an applications of priority queues where we simulate the motion of N particles subject to the laws of elastic collision.

  • API -- insert, max, delMax; challenge is to implement these operations efficiently (logarithmic time)
  • Complete binary tree is perfectly balanced except for bottom level, has lg(N) height
  • Binary heap -- array representation of a heap-oriented complete binary tree; parents keys no smaller than children's keys; can use array indices to move through the tree
  • Promotion -- swim() operation to exchange key in child with key in parent
  • Insert -- add node at end, then promote it, at most 1+ lg(N) compares
  • Demotion -- sink() operation to exchange key in parent with key in larger child
  • Delete max -- exchange root with node at end, remove the "old" root (max element) and sink the "new" root
  • Heapsort is in-place sort that uses max-heap to add all items into the heap and then remove the maximum key intil heap is empty
  • Heapsort makes one pass to put the unordered heap in order (at most 2N compares and exchanges), then makes second pass to remove the keys one-by-one, for a total of at most 2N*lg(N) compares and exchanges
  • Heapsort makes poor use of cache memory, is not stable, and inner loop is longer than quicksort
  • Applications of Priority Queues -- one example is event-driven simulation, many many more

Symbol Tables

We define an API for symbol tables (also known as associative arrays) and describe two elementary implementations using a sorted array (binary search) and an unordered list (sequential search). When the keys are Comparable, we define an extended API that includes the additional methods min, max floor, ceiling, rank, and select. To develop an efficient implementation of this API, we study the binary search tree data structure and analyze its performance.

  • Key-value pair abstraction, insert a value with specified key, given a key find the corresponding value
  • Basic API -- associative array abstraction, keys have equivalence relation; additonal operations rank, floor, ceiling
  • Elementary implementations -- unordered linked list, sorted array
  • Binary Search Tree (BST) is a binary tree in symmetric order -- each node has a key, an every node's key is:
    • Larger than all keys in its left subtree
    • Smaller than all keys in its right subtree
  • If N distinct keys are inserted into a BST in random order the expected number of compares for a search/insert is ~ 2*lg(N), but worst case is N
  • In-order traversal (with queue) -- traverse left subtree, enqueue key, traverse right subtree; reverse-order uses stack
  • Deletion
    • Using tombstones leads to memory overload
    • Deleting minimum key
    • Hibbard deletion

Week 5

Can we guarantee fast search, insert, delete, min, max, floor, ceiling, rank, and select in a symbol table with Comparable keys? This week, you will learn that the answer to this question is a resounding Yes! and that a relatively compact implementation can do the job. Then, we consider applications and generalizations of binary search trees to problems in computational geometry.

Balanced Search Trees

In this lecture, our goal is to develop a symbol table with guaranteed logarithmic performance for search and insert (and many other operations). We begin with 2-3 trees, which are easy to analyze but hard to implement. Next, we consider red-black binary search trees, which we view as a novel way to implement 2-3 trees as binary search trees. Finally, we introduce B-trees, a generalization of 2-3 trees that are widely used to implement file systems.

  • 2-3 Tree -- allow 1 or 2 keys per node, 2-node: 1 key 2 children, 3-node: 2 keys 3 children
    • Every path from root to null link has same length, guarantees logarithmic performance for search and insert
    • Each transformation maintains symmetric order and perfect balance
    • Tree length -- worst case is lg(N) (all 2-nodes), best case os log3(N) (all 3-nodes)
    • Direct manipulation is complicated
  • Red-Black Tree -- represent 2-3 tree as BST, use "internal " left-leaning links as "glue" for 3-nodes
    • No node has 2 red links connected to it
    • Perfect black links -- every path from root to null link has the same number of black links
    • Red links lean left
    • Most operations are same as BST
    • Maintain 1-1 correspondence with 2-3 trees by appling elementary red-black BST operations -- left rotation, right rotation, color flip
    • Height of tree is less than 2 * lg(N) and close to lg(N) in typical applications
  • B-tree -- generalize 2-3 tree by allowing up to M-1 key-link pairs per node
    • At least 2 ley-link pairs at root, at least M/2 ley-link pairs in other nodes
    • A search or an insertion in a B-tree of order N with N keys requires between log M-1 (N) and log M/2 (N) probes, in practice at most 4
  • Geometric applications of BSTs
    • 1d range search
    • Line segment intersection
    • Kd trees
    • Interval search trees
    • Rectangle intersection

Geometric Applications of BSTs

We start with 1d and 2d range searching, where the goal is to find all points in a given 1d or 2d interval. To accomplish this, we consider kd-trees, a natural generalization of BSTs when the keys are points in the plane (or higher dimensions). We also consider intersection problems, where the goal is to find all intersections among a set of line segments or rectangles.

Week 6

We conclude the course by considering hash tables, a data structure that achieves constant-time performance for core symbol table operations, provided that search keys are standard data types or simply defined. Then we consider several fundamental (and useful) examples of symbol-table clients.

Hash Tables

We begin by describing the desirable properties of hash function and how to implement them in Java, including a fundamental tenet known as the uniform hashing assumption that underlies the potential success of a hashing application. Then, we consider two strategies for implementing hash tables—separate chaining and linear probing. Both strategies yield constant-time performance for search and insert under the uniform hashing assumption. We conclude with applications of symbol tables including sets, dictionary clients, indexing clients, and sparse vectors.

  • Better average performance than red-black BST (constant time) though weaker performance guarantee
  • Deos not support range and ordered symbol table operations
  • How to choose good hash function? Needs equality test for keys but not comparison; have to resolve potential collisions
    • Default implementation -- take address of value
    • Java provides hashCode() method for every class
    • Horner's method to hash a string of length L
    • Can be used for composite types -- start with 17 and combine each significant field using the 31*x + y rule
    • Uniform hashing assumption -- each key is equally likely to hash to an integer between 0 and M-1
    • Always use all bits (chars) in string, or all significant fields of composite key (or face denial-of-service attack)
  • Collision resolution
    • Separate Chaining with linked list to store all items that collide
      • Number of probles for search/insert is close to N/M if assuming uniform hashing
      • Easier to implement delete, performance degrades gracefully, cLustering less sensitive to poorly-designed hash functions
    • Linear Probing -- when a new key collides, find next empty lost and put it there
      • Size of array M must be equal or (better) greater than number of key-value pairs N
      • Keys tend to cluster
      • What is eman displacement of keys -- approximately 3/2 with M/2 keys, but with M keys it jumps to sqrt(PI*M/8)
      • Less wasted space and better cache performance
    • Two-probe hashing
    • Double hashing (more difficult to implement delete)
    • Cuckoo hashing
  • Crypto-strength hash functions (MD5, SHA, RIPEMD-160) are great but too slow
  • Applications -- sets, dictionary clients, indexing clients, sparse vectors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment