Last active
July 4, 2016 21:10
-
-
Save AniX/33dddf2a1993ca941692 to your computer and use it in GitHub Desktop.
Tokenize text-field values work-around for Google App Engine Search API
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def partialize(phrase, shortest=5): | |
"""Tokenize the string `phrase` argument for all possible sub-strings | |
at least `shortest` length of characters. | |
This is a work-around for Google App Engine's Search API not supporting | |
partial full-text search (as of time of writing, April 2013 | |
In case of BBCode-formatted phrase, you should first strip() away all | |
BBCode tags before passing the string to this method. | |
""" | |
# See http://stackoverflow.com/questions/12899083/partial-matching-gae-search-api | |
# for original pattern (with-out shortest keyword) | |
if shortest < 1: | |
shortest = 1 | |
if phrase is None: | |
return [u''] | |
tokens = [] | |
for w in phrase.split(): | |
j = shortest | |
while True: | |
if len(w) <= j: | |
tokens.append(w) | |
break | |
for i in range(len(w) - j + 1): | |
tokens.append(w[i:i + j]) | |
if j == len(w): | |
break | |
j += 1 | |
return tokens |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment