Created
March 12, 2021 13:41
-
-
Save Abusagit/001d3b862015028b080dc925e8789f06 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def turbo_bmh(string, pattern): | |
""" | |
Bad character heuristic, good suffix heuristic, turbo-shift heuristic implemented on Python | |
""" | |
def _suffices_preprocessing(suffix): | |
suffix[m - 1] = m | |
g = m - 1 | |
for i in range(m - 2, -1, -1): | |
if i > g and suffix[i + m - f - 1] < i - g: | |
suffix[i] = suffix[i + m - 1 - f] | |
else: | |
if i < g: | |
g = i | |
f = i | |
while g >= 0 and pattern[g] == pattern[g + m - 1 - f]: | |
g -= 1 | |
suffix[i] = f - g | |
def good_suffix_preprocessing(): | |
# nonlocal pattern, good_suf, m | |
suffix = [0 for _ in range(m)] | |
_suffices_preprocessing(suffix) | |
for i in range(m): | |
good_suf[i] = m | |
for i in range(m - 1, -1, -1): | |
if suffix[i] == i + 1: | |
for j in range(m - 1 - i): | |
if good_suf[j] == m: | |
good_suf[j] = m - 1 - i | |
for i in range(0, m - 1): | |
good_suf[m - 1 - suffix[i]] = m - 1 - i | |
n = len(string) | |
m = len(pattern) | |
good_suf = [0 for i in range(m)] | |
bad_character = [m for _ in range(256)] | |
for k in range(m - 1): | |
bad_character[ord(pattern[k])] = m - k - 1 | |
bad_character = tuple(bad_character) | |
answers = [] | |
good_suffix_preprocessing() | |
j = 0 | |
while j <= n - m: | |
i = m - 1 | |
while i >= 0 and pattern[i] == string[i + j]: | |
i -= 1 | |
if i < 0: | |
answers.append(j) | |
j += good_suf[0] | |
else: | |
j += max(good_suf[i], bad_character[ord(string[i + j])] - m + 1 + i) | |
return answers |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks, was looking for a quick implementation for some binary searches.
A few notes:
I had problems when testing with UTF-8 texts. Sometimes, a (unicode) character in
string
would have a unicode value way off the 0-0xFF range, so thatbad_character[ord(string[i + j])]
would get an index out of range error.I resolved it this way:
Make sure
string
andpattern
are UTF-8 encoded (.encode()
).(Also, for those on Windows reading from a file, the
open
method reads inCP1252
by default, one must then useopen(testfile, 'r', encoding="UTF-8")
)For this to work though, one must remove the
ord(...)
function, sincestring[i]
andbad_character[i]
will now beint
s essentially.For binary search (e.g. not on texts), as above, make sure to remove the
ord(...)
s, and that you have binarystring
andpattern
data (b'...'
).It seems a pure Horspool (BMH) algorithm implementation (without the "good-suffix shift" preprocessing, nor the "Turbo") was a bit quicker, but my tests were not extensive.
I did my tests on Project Gutenberg's Shakespeare's Complete Works and Montaigne's Essais (in french), both
UTF-8
docs, and somebin
files I was inspecting.