Skip to content

Instantly share code, notes, and snippets.

@zacoppotamus
Created January 13, 2014 15:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zacoppotamus/8401836 to your computer and use it in GitHub Desktop.
Save zacoppotamus/8401836 to your computer and use it in GitHub Desktop.
Boyer-Moore-Horspool video lesson transcript
// Video is found here: http://www.youtube.com/watch?v=J0eZJI7QRnI
Introduction:
Hello, today I'm going to teach you the Boyer-Moore-Horspool algorithm, a string-matching algorithm that is one of the most widely used ones in computer science.
---
String matching is an important application for many programs. Two obvious examples would be finding files in your computer or text in your editor when writing code. However other, very important applications also include DNA matching or field matching in databases.
One could say that substring search, that is, matching patterns in text is one of the more classic and widely researched areas in computer science.
---
Some of the well-known algorithms for substring searching are: The Brute-Force algorithm, Knuth-Morris-Pratt algorithm, Rabin-Karp Fingerprint Search and Boyer-Moore algorithm.
Today we will look at the Boyer-Moore algorithm. More specifically, we will learn a slight modification of this algorithm using the Horspool heuristic.
Preprocessing:
One of the areas where there is a big window of improvement for such algorithms performance-wise is when comparing our string pattern to the text and encountering a mismatch because of a certain character. The way each algorithm resolves this issue and decides what to do next is critical in terms of efficiency and performance.
---
To deal with this, the Boyer-Moore algorithm, as a preprocessing step before running the algorithm, creates a skip table. In essence, what this table does is dictate to the program what course of action it should assume if a certain character where to occur and cause a mismatch. Typically this is used to make our algorithm skip a set number of characters, which will make the process much faster.
In practice, we create a dictionary-like structure which gives, for every character in our alphabet, the index of its rightmost occurrence in the pattern. In case the character is not in the pattern, our structure will return -1.
Here, this part of the algorithm goes through our chosen alphabet. In this case, my alphabet is the extended ASCII table with 256 elements. What it does initially is go through that set of ASCII characters and initialize their corresponding positions on the table to 0. Then it goes through the characters in our pattern (the text we want to match) and for those characters updates their position in the skip table. So for the word 'tied' you will notice that 't' is in position 0, 'i' is in position 1 and so on... Notice however that if we have two same characters in our pattern, we add to the skip table the position of the rightmost one. You will see why later.
The Algorithm:
Let's look at the algorithm now. If I'm going too fast I encourage you to pause this video and take a look at the code for yourself or go back.
---
In the outer loop of the algorithm, i ranges from 0 to m+n. Ignore the +1, that's just Python specific syntax. We also keep track of a variable that will tell us how many positions we should shift the pattern in the next iteration of the outer loop.
---
What the inner loop does with the j there is start comparing the text with the pattern from the rightmost character. Again, in Python syntax what this range keyword signifies is that we go from m-1 to 0 with a step of -1, thus traversing our pattern from right to left.
---
If when comparing the two strings we find a character mismatch, we find the ASCII code of the character that caused the mismatch, and then see whether that character is in the pattern and if it is, its position.
---
To do that we consult the skip table. For example if a character is in position 2 in the pattern we skip j-2 positions. This is perhaps the trickiest part of the algorithm.
---
After we determine the number of skips in our inner loop we shift our pattern by adding that number to i, in our outer loop.
---
If after comparing the pattern to the text we do not find a character mismatch, that is, if the variable that indicates the number of positions to skip is still 0. We have matched the pattern at position i.
---
Here, I want to show you what happens when a character mismatch is found between the pattern and the text. There are three different cases:
In the first case 'y' and 'n' have been matched but there's a mismatch with 'f' in the text. Because 'f' exists in the pattern our num_skip variable is going to be bigger than or equal to 1. This means that the algorithm will essentially line up the 'f' on the text with that on the pattern.
---
In the second case the mismatch is because of 'y'. If we were to consult the skip table and subtract that value from j, as the algorithm states, we would get a negative num_skip value. That is because the algorithm would try to lineup the rightmost 'y'.
This would mean that our pattern would go left. We don't want that to happen so we make num_skip 1.
---
Finally the third case, which is the simplest. 'c' is the text character that causes the mismatch. 'c' is not in our pattern and we know because by consulting the skip table we get a value of -1. In this case, we shift the pattern by its length.
The Dude:
As with every good tutorial, I believe an algorithm is best demonstrated and understood with an example. In this case I think it is beneficial to provide a higher-level overview of the algorithm before diving into the specifics and the code.
For the text in this example, I'll borrow a line from one of my favorite movies, the Big Lebowski, starring Jeff Bridges. A cult line from the movie that I'll use here is this one, that set the whole tone for the movie.
The Example:
So let's look at our example. On the right you can see the skip table.
At the start, we line our pattern with the text in the 0th position. The algorithm states that everytime we will compare the rightmost character of the pattern with the one it is lined up to the text. I've put lines between the text and the patttern so you can easily see which characters are being compared.
So, starting off, we will compare the character 't' from the text with the character 'd' from the pattern. There is a mismatch here, which means we will have to shift the pattern to the right and compare the characters again. To see how many positions to shift right, we check to see whether the character that caused the mismatch is in our pattern. The character that caused the mismatch is 't' and it is in our pattern. It is the first letter.
---
What we do now is consult our skip table to see the position of the character that caused the mismatch in our pattern. Consequently, we will know by how many positions we should shift the pattern to the right. By looking at our skip table, we see that 't' is in position 0. To decide how many characters we should shift, we subtract the value of t in the skip table from the index of the currently examined letter. 3 minus 0 is 3, so we shift the pattern 3 places.
---
Here we compare 'd' with 'u'. 'u' is not in our pattern and we know that because if we search for 'u' in the skip table we will get -1. This means that we'll have to shift the pattern by its length, which is 4. When you think about it, this makes sense because since the letter 'u' is not in the pattern there is no need to shift our pattern 1,2, or 3 positions, because in each case 'u' will cause the mismatch. This is caused the bad character heuristic and is one of the main reasons for this algorithm's performance.
---
Now we compare 'e' with 'd'. Again a mismatch. We consult our skip table and see that e has the number 2, ie. is the 3rd letter of the pattern. To see how many positions to skip, we subtract 2 from the current character we are examining. That is the 3rd one, so 3-2 is 1 position by which we shift.
---
'a' is not in our pattern so we shift if by the length of the pattern.
---
And neither is space. Let's shift again.
---
And there we have it! With 4 more comparisons, we verify the result and calculate the number of shifts needed to match the pattern.
All in all we needed 6+4 comparisons.
Complexity:
As with every algorithm, it is a good thing to know the complexity, both the time and space, in Big O Notation, in order to gain a better understanding of its performance.
---
On typical inputs, substring search with the Boyer-Moore-Horspool algorithm uses approximately M over N character compares to search for a pattern of length M in a text of length N.
A full-blown proof would require to prove complexity for various string models. But for this video I'll just discuss the complexity of the algorithm in simpler terms.
In a lot of practical situations, it is safe to say that most alphabet characters do not appear anywhere in the search pattern. Simply put, this means that nearly all comparisons will result in M amount of characters being skipped.
The worst case running time for the algorithm concerning patterns in the text is O(mn). This happens when the comparison always fails at the leftmost character the pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment