Skip to content

Instantly share code, notes, and snippets.

@Abusagit
Created March 12, 2021 13:41
Show Gist options
  • Save Abusagit/001d3b862015028b080dc925e8789f06 to your computer and use it in GitHub Desktop.
Save Abusagit/001d3b862015028b080dc925e8789f06 to your computer and use it in GitHub Desktop.
def turbo_bmh(string, pattern):
"""
Bad character heuristic, good suffix heuristic, turbo-shift heuristic implemented on Python
"""
def _suffices_preprocessing(suffix):
suffix[m - 1] = m
g = m - 1
for i in range(m - 2, -1, -1):
if i > g and suffix[i + m - f - 1] < i - g:
suffix[i] = suffix[i + m - 1 - f]
else:
if i < g:
g = i
f = i
while g >= 0 and pattern[g] == pattern[g + m - 1 - f]:
g -= 1
suffix[i] = f - g
def good_suffix_preprocessing():
# nonlocal pattern, good_suf, m
suffix = [0 for _ in range(m)]
_suffices_preprocessing(suffix)
for i in range(m):
good_suf[i] = m
for i in range(m - 1, -1, -1):
if suffix[i] == i + 1:
for j in range(m - 1 - i):
if good_suf[j] == m:
good_suf[j] = m - 1 - i
for i in range(0, m - 1):
good_suf[m - 1 - suffix[i]] = m - 1 - i
n = len(string)
m = len(pattern)
good_suf = [0 for i in range(m)]
bad_character = [m for _ in range(256)]
for k in range(m - 1):
bad_character[ord(pattern[k])] = m - k - 1
bad_character = tuple(bad_character)
answers = []
good_suffix_preprocessing()
j = 0
while j <= n - m:
i = m - 1
while i >= 0 and pattern[i] == string[i + j]:
i -= 1
if i < 0:
answers.append(j)
j += good_suf[0]
else:
j += max(good_suf[i], bad_character[ord(string[i + j])] - m + 1 + i)
return answers
@kysko
Copy link

kysko commented May 23, 2021

Thanks, was looking for a quick implementation for some binary searches.

A few notes:

  1. I had problems when testing with UTF-8 texts. Sometimes, a (unicode) character in string would have a unicode value way off the 0-0xFF range, so that bad_character[ord(string[i + j])] would get an index out of range error.

    I resolved it this way:

    • Make sure string and pattern are UTF-8 encoded (.encode()).

      (Also, for those on Windows reading from a file, the open method reads in CP1252 by default, one must then use open(testfile, 'r', encoding="UTF-8"))

    • For this to work though, one must remove the ord(...) function, since string[i] and bad_character[i] will now be ints essentially.

  2. For binary search (e.g. not on texts), as above, make sure to remove the ord(...)s, and that you have binary string and pattern data (b'...').

  3. It seems a pure Horspool (BMH) algorithm implementation (without the "good-suffix shift" preprocessing, nor the "Turbo") was a bit quicker, but my tests were not extensive.

I did my tests on Project Gutenberg's Shakespeare's Complete Works and Montaigne's Essais (in french), both UTF-8 docs, and some bin files I was inspecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment