Skip to content

Instantly share code, notes, and snippets.

@dobrokot dobrokot/fuzzy_search.py
Last active Dec 20, 2015

Embed
What would you like to do?
fuzzy search regexp generator
# usage:
# LC_ALL=C grep "$(python fuzzy_search.py "HELLO")" input.file.utf8.txt
import sys
s = sys.argv[1]
su = s.decode('utf-8')
utf8_any = '[^\x80-\xbf][\x80-\xbf]*'
#utf8_any = '.'
vars = []
def escape_char(c):
if c in '.[]*\\$^':
return '\\' + c
return c
def esc(r):
return ''.join(map(escape_char, r)).encode('UTF-8')
for i in xrange(len(su)):
vars.append(esc(su[:i]) + utf8_any + esc(su[i+1:]))
vars.append(esc(su[:i]) + esc(su[i+1:]))
if (i != 0):
vars.append(esc(su[:i]) + utf8_any + esc(su[i:]))
sys.stdout.write('\\|'.join(vars))
@zerkms

This comment has been minimized.

Copy link

commented Aug 4, 2013

Why not use re.escape() instead of escape_char()?

@dobrokot

This comment has been minimized.

Copy link
Owner Author

commented Aug 4, 2013

Why not use re.escape() instead of escape_char()?

grep regexes have different syntax from P(y)CRE, and have different set of chars which are "special". Symbol '|' is not special, but if it will be escaped '|' - it will have special meaning for grep.

See: sys.stdout.write('|'.join(vars)). '|' is escaped, to be 'special'.

grep is used instead python re, because grep is much faster on large regexes and large input files.

@zerkms

This comment has been minimized.

Copy link

commented Aug 5, 2013

@dobrokot, makes sense, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.