Skip to content

Instantly share code, notes, and snippets.

@dobrokot
Last active December 20, 2015 14:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dobrokot/6150988 to your computer and use it in GitHub Desktop.
Save dobrokot/6150988 to your computer and use it in GitHub Desktop.
fuzzy search regexp generator
# usage:
# LC_ALL=C grep "$(python fuzzy_search.py "HELLO")" input.file.utf8.txt
import sys
s = sys.argv[1]
su = s.decode('utf-8')
utf8_any = '[^\x80-\xbf][\x80-\xbf]*'
#utf8_any = '.'
vars = []
def escape_char(c):
if c in '.[]*\\$^':
return '\\' + c
return c
def esc(r):
return ''.join(map(escape_char, r)).encode('UTF-8')
for i in xrange(len(su)):
vars.append(esc(su[:i]) + utf8_any + esc(su[i+1:]))
vars.append(esc(su[:i]) + esc(su[i+1:]))
if (i != 0):
vars.append(esc(su[:i]) + utf8_any + esc(su[i:]))
sys.stdout.write('\\|'.join(vars))
@zerkms
Copy link

zerkms commented Aug 4, 2013

Why not use re.escape() instead of escape_char()?

@dobrokot
Copy link
Author

dobrokot commented Aug 4, 2013

Why not use re.escape() instead of escape_char()?

grep regexes have different syntax from P(y)CRE, and have different set of chars which are "special". Symbol '|' is not special, but if it will be escaped '|' - it will have special meaning for grep.

See: sys.stdout.write('|'.join(vars)). '|' is escaped, to be 'special'.

grep is used instead python re, because grep is much faster on large regexes and large input files.

@zerkms
Copy link

zerkms commented Aug 5, 2013

@dobrokot, makes sense, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment