-
-
Save miyagawa/52e8422175f25d982fd9 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have a list of keywords that we want to match against series of text. | |
We use CPAN module Regexp::Trie to compile it into a regular expression. | |
For instance, if we have keywords "mac", "apple" and "android", we'll get | |
$re = qr/(?-xism:(?:a(?:ndroid|pple)|mac))/; | |
Because when we have "mac" in the keyword, we don't want to match against | |
"machines", so when we actually match against a text we use \b the zero-width | |
word boundary class, which works well with ASCII word chars at least. | |
@match = $text =~ /\b($re)\b/g; | |
So far, so good. | |
However when we have punctuation characters at the beginning or end of the | |
keyword we add to the TRIE itself, like "#android", this \b gets in the way, | |
since '#' is not a word character. | |
my $rt = Regexp::Trie->new; | |
$rt->add($_) for qw( #android #apple mac ); | |
$re = $rt->regexp; # qr/(?-xism:(?:\#a(?:ndroid|pple)|mac))/ | |
$text = "I love #android"; | |
my @tags = $text =~ /\b($re)\b/g; # @tags is empty | |
The result is that it doesn't really match those hash tags that appear after | |
a whitespace (most of the time) or the beginning of the text. | |
I'm trying to work around this by changing the matcher to this: | |
@match = $text =~ /(?:^|\b|\s)($re)(?:$|\b|\s)/g; | |
as ugly as it looks, it seems to work. Any suggestions to make it look & work better? |
No - sorry i guess the example was confusing, but if the TRIE contains '#mac' (containing # itself) then \b will NOT match.
my $rt = Regexp::Trie->new;
$rt->add($_) for qw( #mac #win #unix );
my $re = $rt->regexp;
my $text = "I want #mac";
my @m = $text =~ /\b($re)\b/g;
# @m is empty
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm not sure if I got your point right because /\b($re)\b/g DOES match against '#mac'.
Is it that you want to capture '#' as well as the word? If so,
/(?<=\s)([#]?$re)\b/
is the regexp you are looking for.
Dan the Regular Expressionist