Created
October 15, 2014 20:29
-
-
Save InPermutation/a419fa89542e39ef35bc to your computer and use it in GitHub Desktop.
Shorten a regex, allowing false positives
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
s = 'amsterdam|london|tokyo|indianapolis|new york|shanghai|toronto|san francisco' | |
rg = s.split('|') | |
# count: the length of the shortest word | |
count = len(min(rg, key=len)) | |
# v: list of sets. v[0] is the set of all the first letters, etc. | |
v = [] | |
for i in range(0,count): | |
v.append(set(w[i] for w in rg)) | |
# now generate a regex from v by combining the sets in v | |
def make_block(chars): | |
return '[' + str.join('', chars) + ']' | |
print str.join('', map(make_block, v)) # '[ailnst][aehmon][adknsrw][ dionty][aegfony]' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Use case: You have a bunch of IDs in your Google Analytics that you want to match on. It's very long and doesn't fit anymore.
If false positives are OK, you can try to make it shorter by using character classes. Some false positives from this example: words starting with
lenin
,santa
,toddy
,nnnnn
,thane
, and 8562 other unintended prefixes. If that's acceptable, you can use this toy.