jianminchen/Learn KMP algorithm in 15 minutes

## Learn KMP algorithm in 15 minutes
Julia likes to study KMP algorithm again using lecture note. What she likes to do is to go over the lecture
notes word by word, and then focus on thinking process.

She started to review KMP using Chinese blog, but she found out that the blog is not clear, so she likes to
follow up with more detail lecture note.

She likes to write and also she likes to read good writing of the KMP algorithm.

https://www.ics.uci.edu/~eppstein/161/960227.html

Knuth-Morris-Pratt string matching
The problem: given a (short) pattern and a (long) text, both strings, determine whether the pattern appears
somewhere in the text. Last time we saw how to do this with finite automata. This time we'll go through the
Knuth-Morris-Pratt (KMP) algorithm, which can be thought of as an efficient way to build these automata. I
also have some working C++ source code which might help you understand the algorithm better.

First let's look at a naive solution.
suppose the text is in an array: char T[n]
and the pattern is in another array: char P[m].

One simple method is just to try each possible position the pattern could appear in the text.

Naive string matching:

    for (i=0; T[i] != '\0'; i++)
    {
      for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ;
        if (P[j] == '\0')
          found a match
    }

There are two nested loops; the inner one takes O(m) iterations and the outer one takes O(n)
iterations so the total time is the product, O(mn). This is slow; we'd like to speed it up.

In practice this works pretty well -- not usually as bad as this O(mn) worst case analysis. This
is because the inner loop usually finds a mismatch quickly and move on to the next position without
going through all m steps. But this method still can take O(mn) for some inputs. In one bad
example, all characters in T[] are "a"s, and P[] is all "a"'s except for one "b" at the end.
Then it takes m comparisons each time to discover that you don't have a match, so mn overall.

Here's a more typical example. Each row represents an iteration of the outer loop, with each
character in the row representing the result of a comparison (X if the comparison was unequal).
Suppose we're looking for pattern "nano" in text "banananobano".

      0  1  2  3  4  5  6  7  8  9 10 11
      T: b  a  n  a  n  a  n  o  b  a  n  o

    i=0: X
    i=1:    X
    i=2:       n  a  n  X
    i=3:          X
    i=4:             n  a  n  o
    i=5:                X
    i=6:                   n  X
    i=7:                         X
    i=8:                            X
    i=9:                               n  X
    i=10:                                 X

Some of these comparisons are wasted work! For instance, after iteration i = 2, we know from the
comparisons we've done that T[3] = "a", so there is no point comparing it to "n" in iteration i = 3.
And we also know that T[4] = "n", so there is no point making the same comparison in iteration i = 4.

Skipping outer iterations

The Knuth-Morris-Pratt idea is, in this sort of situation, after you've invested a lot of work making
comparisons in the inner loop of the code, you know a lot about what's in the text.

Specifically, if you've found a partial match of j characters starting at position i, you know what's
in positions T[i]...T[i+j-1]. You can use this knowledge to save work in two ways. First, you can skip
some iterations for which no match is possible. Try overlapping the partial match you've found with
the new match you want to find:

    i=2: n  a  n
    i=3:    n  a  n  o
Here the two placements of the pattern conflict with each other -- we know from the i = 2 iteration
that T[3] and T[4] are "a" and "n", so they can't be the "n" and "a" that the i = 3 iteration is
looking for. We can keep skipping positions until we find one that doesn't conflict:
    i=2: n  a  n
    i=4:       n  a  n  o
Here the two "n"'s coincide. Define the overlap of two strings x and y to be the longest word that's
a suffix of x and a prefix of y. Here the overlap of "nan" and "nano" is just "n". (We don't allow
the overlap to be all of x or y, so it's not "nan"). In general the value of i we want to skip to
is the one corresponding to the largest overlap with the current partial match:

String matching with skipped iterations:

    i=0;
    while (i<n)
    {
      for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ;

      if (P[j] == '\0') found a match;
        i = i + max(1, j-overlap(P[0..j-1],P[0..m]));
    }

Skipping inner iterations
The other optimization that can be done is to skip some iterations in the inner loop. Let's
look at the same example, in which we skipped from i = 2 to i = 4:
    i=2: n  a  n
    i=4:       n  a  n  o
In this example, the "n" that overlaps has already been tested by the i = 2 iteration. There's
no need to test it again in the i = 4 iteration. In general, if we have a nontrivial overlap
with the last partial match, we can avoid testing a number of characters equal to the length
of the overlap.

This change produces (a version of) the KMP algorithm:

KMP, version 1:

    i=0;
    o=0;
    while (i<n)
    {
      for (j=o; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ;
      if (P[j] == '\0') found a match;
        o = overlap(P[0..j-1],P[0..m]);
        i = i + max(1, j-o);
    }

The only remaining detail is how to compute the overlap function. This is a function only of
j, and not of the characters in T[], so we can compute it once in a preprocessing stage before
we get to this part of the algorithm. First let's see how fast this algorithm is.

KMP time analysis
We still have an outer loop and an inner loop, so it looks like the time might still be O(mn).
But we can count it a different way to see that it's actually always less than that. The idea
is that every time through the inner loop, we do one comparison T[i+j]==P[j]. We can count the
total time of the algorithm by counting how many comparisons we perform.
We split the comparisons into two groups: those that return true, and those that return false.
If a comparison returns true, we've determined the value of T[i+j]. Then in future iterations,
as long as there is a nontrivial overlap involving T[i+j], we'll skip past that overlap and
not make a comparison with that position again. So each position of T[] is only involved in
one true comparison, and there can be n such comparisons total. On the other hand, there is at
most one false comparison per iteration of the outer loop, so there can also only be n of those.
As a result we see that this part of the KMP algorithm makes at most 2n comparisons and takes
time O(n).
	Julia likes to study KMP algorithm again using lecture note. What she likes to do is to go over the lecture
	notes word by word, and then focus on thinking process.

	She started to review KMP using Chinese blog, but she found out that the blog is not clear, so she likes to
	follow up with more detail lecture note.

	She likes to write and also she likes to read good writing of the KMP algorithm.

	https://www.ics.uci.edu/~eppstein/161/960227.html

	Knuth-Morris-Pratt string matching
	The problem: given a (short) pattern and a (long) text, both strings, determine whether the pattern appears
	somewhere in the text. Last time we saw how to do this with finite automata. This time we'll go through the
	Knuth-Morris-Pratt (KMP) algorithm, which can be thought of as an efficient way to build these automata. I
	also have some working C++ source code which might help you understand the algorithm better.

	First let's look at a naive solution.
	suppose the text is in an array: char T[n]
	and the pattern is in another array: char P[m].

	One simple method is just to try each possible position the pattern could appear in the text.

	Naive string matching:

	for (i=0; T[i] != '\0'; i++)
	{
	for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ;
	if (P[j] == '\0')
	found a match
	}

	There are two nested loops; the inner one takes O(m) iterations and the outer one takes O(n)
	iterations so the total time is the product, O(mn). This is slow; we'd like to speed it up.

	In practice this works pretty well -- not usually as bad as this O(mn) worst case analysis. This
	is because the inner loop usually finds a mismatch quickly and move on to the next position without
	going through all m steps. But this method still can take O(mn) for some inputs. In one bad
	example, all characters in T[] are "a"s, and P[] is all "a"'s except for one "b" at the end.
	Then it takes m comparisons each time to discover that you don't have a match, so mn overall.

	Here's a more typical example. Each row represents an iteration of the outer loop, with each
	character in the row representing the result of a comparison (X if the comparison was unequal).
	Suppose we're looking for pattern "nano" in text "banananobano".

	0 1 2 3 4 5 6 7 8 9 10 11
	T: b a n a n a n o b a n o

	i=0: X
	i=1: X
	i=2: n a n X
	i=3: X
	i=4: n a n o
	i=5: X
	i=6: n X
	i=7: X
	i=8: X
	i=9: n X
	i=10: X

	Some of these comparisons are wasted work! For instance, after iteration i = 2, we know from the
	comparisons we've done that T[3] = "a", so there is no point comparing it to "n" in iteration i = 3.
	And we also know that T[4] = "n", so there is no point making the same comparison in iteration i = 4.

	Skipping outer iterations

	The Knuth-Morris-Pratt idea is, in this sort of situation, after you've invested a lot of work making
	comparisons in the inner loop of the code, you know a lot about what's in the text.

	Specifically, if you've found a partial match of j characters starting at position i, you know what's
	in positions T[i]...T[i+j-1]. You can use this knowledge to save work in two ways. First, you can skip
	some iterations for which no match is possible. Try overlapping the partial match you've found with
	the new match you want to find:

	i=2: n a n
	i=3: n a n o
	Here the two placements of the pattern conflict with each other -- we know from the i = 2 iteration
	that T[3] and T[4] are "a" and "n", so they can't be the "n" and "a" that the i = 3 iteration is
	looking for. We can keep skipping positions until we find one that doesn't conflict:
	i=2: n a n
	i=4: n a n o
	Here the two "n"'s coincide. Define the overlap of two strings x and y to be the longest word that's
	a suffix of x and a prefix of y. Here the overlap of "nan" and "nano" is just "n". (We don't allow
	the overlap to be all of x or y, so it's not "nan"). In general the value of i we want to skip to
	is the one corresponding to the largest overlap with the current partial match:

	String matching with skipped iterations:

	i=0;
	while (i<n)
	{
	for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ;

	if (P[j] == '\0') found a match;
	i = i + max(1, j-overlap(P[0..j-1],P[0..m]));
	}

	Skipping inner iterations
	The other optimization that can be done is to skip some iterations in the inner loop. Let's
	look at the same example, in which we skipped from i = 2 to i = 4:
	i=2: n a n
	i=4: n a n o
	In this example, the "n" that overlaps has already been tested by the i = 2 iteration. There's
	no need to test it again in the i = 4 iteration. In general, if we have a nontrivial overlap
	with the last partial match, we can avoid testing a number of characters equal to the length
	of the overlap.

	This change produces (a version of) the KMP algorithm:

	KMP, version 1:

	i=0;
	o=0;
	while (i<n)
	{
	for (j=o; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ;
	if (P[j] == '\0') found a match;
	o = overlap(P[0..j-1],P[0..m]);
	i = i + max(1, j-o);
	}

	The only remaining detail is how to compute the overlap function. This is a function only of
	j, and not of the characters in T[], so we can compute it once in a preprocessing stage before
	we get to this part of the algorithm. First let's see how fast this algorithm is.

	KMP time analysis
	We still have an outer loop and an inner loop, so it looks like the time might still be O(mn).
	But we can count it a different way to see that it's actually always less than that. The idea
	is that every time through the inner loop, we do one comparison T[i+j]==P[j]. We can count the
	total time of the algorithm by counting how many comparisons we perform.
	We split the comparisons into two groups: those that return true, and those that return false.
	If a comparison returns true, we've determined the value of T[i+j]. Then in future iterations,
	as long as there is a nontrivial overlap involving T[i+j], we'll skip past that overlap and
	not make a comparison with that position again. So each position of T[] is only involved in
	one true comparison, and there can be n such comparisons total. On the other hand, there is at
	most one false comparison per iteration of the outer loop, so there can also only be n of those.
	As a result we see that this part of the KMP algorithm makes at most 2n comparisons and takes
	time O(n).