misho-kr/Algorithms, Part I.md

## Algorithms, Part I.md

      
    Raw
  

              Algorithms, Part I.md
            
          
    Algorithms, Part I

The course is based on a variety of material that we have prepared over many years:

Our textbook Algorithms, 4th edition is the basic reference for the material we will be covering
Our booksite, which is open to everyone and contains a wealth of supplementary information, including synopses of the textbook and Java code that you will be using throughout the course

References:

Algorithms, 4th Edition
Introduction to Programming in Python

Week 1

Union-Find

We illustrate our basic approach to developing and analyzing algorithms by considering the dynamic connectivity problem. We introduce the union-find data type and consider several implementations (quick find, quick union, weighted quick union, and weighted quick union with path compression). Finally, we apply the union-find data type to the percolation problem from physical chemistry.

Dynamic connectivity -- union command to connect two objects, find/connected query to check for connection between two objects
Is-Connected-To is equivalence relation -- reflexive, symmetric and transitive
Connected components -- maximal set of objects that are mutually connected
Union-find data type and API
Quick-find (eager approach) -- find is O(1), initialize and union is O(N)
Quick-union (lazy approach) -- trees can get tall, find could be O(N)
Improvements
Weighted quick union -- keep track of size of each tree, balance the tree by linking smaller tree to root of larger tree
Depth of any node is at most lg(N), so union and connected operations are O(lg(N))
Quick union with path compression -- after computing the root of p, set the id of each examined node to point to that root; keeps the tree almost completely flat
In theory WQUPC is not quite linear, in practice is linear
Steps to developing a usable algorithm
Model the problem
Find an algorithm to solve it
Analyze runtime complexity and memory consumption
Find a way to improve, iterate until satisfied

Analysis of Algorithms

The basis of our approach for analyzing the performance of algorithms is the scientific method. We begin by performing computational experiments to measure the running times of our programs. We use these measurements to develop hypotheses about performance. Next, we create mathematical models to explain their behavior. Finally, we consider analyzing the memory usage of our Java programs.

Reason to analyze algorithms -- predict performance, comapre algorithms, provide guarantees, understand theroretical basis
Fast algorithms enable new technologies
Scientific method to analyze algorithms
Observe some feature of the natural world
Hypothesizea model that is consistent with the observations
Predict events using the hypothesis
Verify the predictions by making further observations
Validate by repeating until the hypothesis and observations agree
3-Sum problem -- how many triples sum to exactly 0 (zero)
Log-plot diagram
Doubling hypothesis  -- double the input size
Running time is a * ( N ** b ) with b = lg ratio
a can be calculated from the above formula given b
Difficult to get precise measurements, easier and cheaper than other sciences
Mathematical model for running time
Cost of basic operations (cost model) and tilde notation
Simplifying the calculations -- estimate running time or memory as function of input size N, ignore lower order terms
Common order-of-growth classifications -- constant, linear, logarithmic, linearithmic, qubic, exponential
Algorithm for 3-Sum problem that is O(N**2 log(N))
Step 1 -- sort the N distinct numbers, N**2 with insertion sort
Step 2 -- for each pair a[i] and a[j] do binary search for -(a[i]+a[j]), N**2 * log(N)
Types of analysis
Best case -- lower bound on cost, determined by "easiest" input, provides goal for all inputs
Worst case -- upper bound on cost, determine by "most difficult" input, provides guarantee for all inputs
Average case -- expected cost for random input, provides a way to predict performance
Optimal algorithm is one where no other algorithm can provide better guarantee
Estimation of memory usage

Week 2

Bags, Queues and Stacks

We consider two fundamental data types for storing collections of objects: the stack and the queue. We implement each using either a singly-linked list or a resizing array. We introduce two advanced Java features—generics and iterators—that simplify client code. Finally, we consider various applications of stacks and queues ranging from parsing arithmetic expressions to simulating queueing systems.

Stacks -- LIFO, last in first out
Queues -- FIFO, first in first out
We are using interfaces for the basic structures to isolate the client from the details of the implementation
Two basic implementations
Linked list -- every operation takes linear time, requires extra time and space to deal with links
Array -- every operation takes amortized linear time, less wasted space
Resizing arrays -- double the size when expanding, shrink by half when utilization drops below 25%
Avoid loitering by setting the slots to Null when removing items
Generics to implement templetized data structures and algorithms; no casting, allows to discover type errors at compile time, not at run time
Iterators, Iterable interface
Evaluate infix expressions (Dijkstra two-stack algorithm), reverse Polish notation

Elementary Sorts

We introduce the sorting problem and Java's Comparable interface. We study two elementary sorting methods (selection sort and insertion sort) and a variation of one of them (shellsort). We also consider two algorithms for uniformly shuffling an array. We conclude with an application of sorting to computing the convex hull via the Graham scan algorithm.

Comparable interface
Selection sort
Invariants -- 1) entries to the left are sorted, 2) no entry to the right is smaller than any entry to the left
Complexity -- (N2)/2 compares and N-1 exchanges, O(N2) always
Insertion Sort
Invariants -- 1) entries to the left are sorted, 2) entries to the rigths have not been seen yet
Complexity -- on average (N2)/4 compares and (N2)/4 exchanges, best case is N-1 compares and 0 exchanges, worst case is (N**2)/2 compars and exchanges
For partially-sorted arrys takes linear time
Number of exchanges equals the number of inversions
Shell sort -- move more than one position by h-sorting the array
A g-sorted array remains g-sorted after h-sorting
What sequence to use for the sorting stride?
The worst case for number of compares with 3*x+1 sequence is O(N**(3/2))
Shuffling
One idea -- generate random number for eash item, then shell-sort by it
Knuts algorithm has linear runtime complexity
Convex hull of a set of N points is the smallest perimeter fence enclosing all points
Mechanical algorithm -- hammer nails at each points, then stretch a band around the nails
Fastest pair problem -- given N points in the plane, find a pair of points with the largest Euclidian distance between them
The vertices of the convex hull appear in increasing order of polar angle with respect to point p with the lowest y coordinate
Given 3 points, how to compute if they form ccw turn?

Week 3

Mergesort

We study the mergesort algorithm and show that it guarantees to sort any array of N items with at most N lg N compares. We also consider a nonrecursive, bottom-up version. We prove that any compare-based sorting algorithm must make at least ~ N lg N compares in the worst case. We discuss using different orderings for the objects that we are sorting and the related concept of stability.

Divide array into two halves, sort each half, merge the two halves
Uses at most N * log(N) compares and 6N * log(N) array accesses
Uses extra space proportional to N; a sorting algorithm is in-place if it uses less than c*N extra memory
Practical improvements
For small arrays fallback to insertion sort
Check of both halves are already sorted before merging
Save time but not space by switching the roles of aux and main array in each recursive call
Bottom-up mergesort -- non-recursive version of mergesort algorithm
Complexity of sorting
Model of computation, cost model, upper bound, lower bound, optimal algorithm
Any sorting algorithm must use lg(N!) ~ N*lg(N) compares in worst case
Lower bound may not hold if algorithm has information about:
Initial order of the input
Distribution of the keys
Representation of the keys
Comparator interface -- must be a total order
A stable sort preserves the relative order order of items with equal keys
Insertion sort is stable because it does not move items over each other
Selection sort is not stable because long-distance exchange may move an item past some equal item
Shellsort is not stable because of long-distance exchanges
Merge sort is stable. Yeah!

Quicksort

We introduce and implement the randomized quicksort algorithm and analyze its performance. We also consider randomized quickselect, a quicksort variant which finds the kth smallest item in linear time. Finally, we consider 3-way quicksort, a variant of quicksort that works especially well in the presence of duplicate keys.

One of top 10 algorithms of XX century
Basic plan -- shuffle the array, partition around a pivot so that all entries to the left are smaller than any entry to the right, then sort each half recursively
Implementation details
Partioning in-place -- using an extra array makes partitioning easier and stable but not worth the cost
Testing whether the pointers cross is tricky
Shuffling is needed fr performance guarantee
When duplicate keys are present, it is (counter-intuitively) better to stop on keys equal to the partitioning item's key
Use insertion sort when the subarrays become small
Choosing the median value for pivot -- median of sample, median-of-3 random items
Best case number of compares is Nlg(N), worst case is (N**2)/2, average case for array of N distinct keys is 2Nlg(N) and N*lg(N)/3 exchanges
Most textbook implementations are quadratic if arrays:
Is sorted or reverse sorted
Has many duplicate keys
Quicksort is in-place sorting algorithm that is not stable
Selecton -- given an array of N items, find the k-smallest
Sort the array, then return k element
Quick-select uses quicksit algorithm but after partitioning it uses only one of the two halves
Takes linear time on average, but (N**2)/2 on average but shuffling protects against that
There is an algorithm that guarantees linear worst-case time, but the constants are too high
Quicksorts goes quadratic unless partitioning stops on equal keys (discovered by C user in 1990)
Solution is 3-way partitioning
In the presence of duplicate keys randomized quicksort with 3-way partitioning reduces running time from linearithmic to linear
Practical implementations of quicksort:
Cutoff to insertion sort for small subarrays
Bentley-Mcllroy 3-way partitioning
Partitioning item:
Small arrays -- middle item
Medium arrays -- median of 3
Large arrays -- Tukey's ninther (median of the median of 3 sampes, each of 3 entries)

Week 4

This week we are going to introduce two fundamental data types, address the challenges of developing algorithms and data structures that can serve as the basis of efficient implementations, and try to convince you that such implementations enable solution of a broad range of applications problems that could not be solved without them.
Priority Queues

We introduce the priority queue data type and an efficient implementation using the binary heap data structure. This implementation also leads to an efficient sorting algorithm known as heapsort. We conclude with an applications of priority queues where we simulate the motion of N particles subject to the laws of elastic collision.

API -- insert, max, delMax; challenge is to implement these operations efficiently (logarithmic time)
Complete binary tree is perfectly balanced except for bottom level, has lg(N) height
Binary heap -- array representation of a heap-oriented complete binary tree; parents keys no smaller than children's keys; can use array indices to move through the tree
Promotion -- swim() operation to exchange key in child with key in parent
Insert -- add node at end, then promote it, at most 1+ lg(N) compares
Demotion -- sink() operation to exchange key in parent with key in larger child
Delete max -- exchange root with node at end, remove the "old" root (max element) and sink the "new" root
Heapsort is in-place sort that uses max-heap to add all items into the heap and then remove the maximum key intil  heap is empty
Heapsort makes one pass to put the unordered heap in order (at most 2N compares and exchanges), then makes second pass to remove the keys one-by-one, for a total of at most 2N*lg(N) compares and exchanges
Heapsort makes poor use of cache memory, is not stable, and inner loop is longer than quicksort
Applications of Priority Queues -- one example is event-driven simulation, many many more

Symbol Tables

We define an API for symbol tables (also known as associative arrays) and describe two elementary implementations using a sorted array (binary search) and an unordered list (sequential search). When the keys are Comparable, we define an extended API that includes the additional methods min, max floor, ceiling, rank, and select. To develop an efficient implementation of this API, we study the binary search tree data structure and analyze its performance.

Key-value pair abstraction, insert a value with specified key, given a key find the corresponding value
Basic API -- associative array abstraction, keys have equivalence relation; additonal operations rank, floor, ceiling
Elementary implementations -- unordered linked list, sorted array
Binary Search Tree (BST) is a binary tree in symmetric order -- each node has a key, an every node's key is:

Larger than all keys in its left subtree
Smaller than all keys in its right subtree


If N distinct keys are inserted into a BST in random order the expected number of compares for a search/insert is ~ 2*lg(N), but worst case is N
In-order traversal (with queue) -- traverse left subtree, enqueue key, traverse right subtree; reverse-order uses stack
Deletion

Using tombstones leads to memory overload
Deleting minimum key
Hibbard deletion


Week 5

Can we guarantee fast search, insert, delete, min, max, floor, ceiling, rank, and select in a symbol table with Comparable keys? This week, you will learn that the answer to this question is a resounding Yes! and that a relatively compact implementation can do the job. Then, we consider applications and generalizations of binary search trees to problems in computational geometry.
Balanced Search Trees

In this lecture, our goal is to develop a symbol table with guaranteed logarithmic performance for search and insert (and many other operations). We begin with 2-3 trees, which are easy to analyze but hard to implement. Next, we consider red-black binary search trees, which we view as a novel way to implement 2-3 trees as binary search trees. Finally, we introduce B-trees, a generalization of 2-3 trees that are widely used to implement file systems.

2-3 Tree -- allow 1 or 2 keys per node, 2-node: 1 key 2 children, 3-node: 2 keys 3 children

Every path from root to null link has same length, guarantees logarithmic performance for search and insert
Each transformation maintains symmetric order and perfect balance
Tree length -- worst case is lg(N) (all 2-nodes), best case os log3(N) (all 3-nodes)
Direct manipulation is complicated


Red-Black Tree -- represent 2-3 tree as BST, use "internal " left-leaning links as "glue" for 3-nodes

No node has 2 red links connected to it
Perfect black links -- every path from root to null link has the same number of black links
Red links lean left
Most operations are same as BST
Maintain 1-1 correspondence with 2-3 trees by appling elementary red-black BST operations -- left rotation, right rotation, color flip
Height of tree is less than 2 * lg(N) and close to lg(N) in typical applications


B-tree -- generalize 2-3 tree by allowing up to M-1 key-link pairs per node

At least 2 ley-link pairs at root, at least M/2 ley-link pairs in other nodes
A search or an insertion in a B-tree of order N with N keys requires between log M-1 (N) and log M/2 (N) probes, in practice at most 4


Geometric applications of BSTs

1d range search
Line segment intersection
Kd trees
Interval search trees
Rectangle intersection


Geometric Applications of BSTs

We start with 1d and 2d range searching, where the goal is to find all points in a given 1d or 2d interval. To accomplish this, we consider kd-trees, a natural generalization of BSTs when the keys are points in the plane (or higher dimensions). We also consider intersection problems, where the goal is to find all intersections among a set of line segments or rectangles.
Week 6

We conclude the course by considering hash tables, a data structure that achieves constant-time performance for core symbol table operations, provided that search keys are standard data types or simply defined. Then we consider several fundamental (and useful) examples of symbol-table clients.
Hash Tables

We begin by describing the desirable properties of hash function and how to implement them in Java, including a fundamental tenet known as the uniform hashing assumption that underlies the potential success of a hashing application. Then, we consider two strategies for implementing hash tables—separate chaining and linear probing. Both strategies yield constant-time performance for search and insert under the uniform hashing assumption. We conclude with applications of symbol tables including sets, dictionary clients, indexing clients, and sparse vectors.

Better average performance than red-black BST (constant time) though weaker performance guarantee
Deos not support range and ordered symbol table operations
How to choose good hash function? Needs equality test for keys but not comparison; have to resolve potential collisions

Default implementation -- take address of value
Java provides hashCode() method for every class
Horner's method to hash a string of length L
Can be used for composite types -- start with 17 and combine each significant field using the 31*x + y rule
Uniform hashing assumption -- each key is equally likely to hash to an integer between 0 and M-1
Always use all bits (chars) in string, or all significant fields of composite key (or face denial-of-service attack)


Collision resolution

Separate Chaining with linked list to store all items that collide

Number of probles for search/insert is close to N/M if assuming uniform hashing
Easier to implement delete, performance degrades gracefully, cLustering less sensitive to poorly-designed hash functions


Linear Probing -- when a new key collides, find next empty lost and put it there

Size of array M must be equal or (better) greater than number of key-value pairs N
Keys tend to cluster
What is eman displacement of keys -- approximately 3/2 with M/2 keys, but with M keys it jumps to sqrt(PI*M/8)
Less wasted space and better cache performance


Two-probe hashing
Double hashing (more difficult to implement delete)
Cuckoo hashing


Crypto-strength hash functions (MD5, SHA, RIPEMD-160) are great but too slow
Applications -- sets, dictionary clients, indexing clients, sparse vectors