Skip to content

Instantly share code, notes, and snippets.

@miyagawa
Created September 1, 2010 00:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save miyagawa/52e8422175f25d982fd9 to your computer and use it in GitHub Desktop.
Save miyagawa/52e8422175f25d982fd9 to your computer and use it in GitHub Desktop.
We have a list of keywords that we want to match against series of text.
We use CPAN module Regexp::Trie to compile it into a regular expression.
For instance, if we have keywords "mac", "apple" and "android", we'll get
$re = qr/(?-xism:(?:a(?:ndroid|pple)|mac))/;
Because when we have "mac" in the keyword, we don't want to match against
"machines", so when we actually match against a text we use \b the zero-width
word boundary class, which works well with ASCII word chars at least.
@match = $text =~ /\b($re)\b/g;
So far, so good.
However when we have punctuation characters at the beginning or end of the
keyword we add to the TRIE itself, like "#android", this \b gets in the way,
since '#' is not a word character.
my $rt = Regexp::Trie->new;
$rt->add($_) for qw( #android #apple mac );
$re = $rt->regexp; # qr/(?-xism:(?:\#a(?:ndroid|pple)|mac))/
$text = "I love #android";
my @tags = $text =~ /\b($re)\b/g; # @tags is empty
The result is that it doesn't really match those hash tags that appear after
a whitespace (most of the time) or the beginning of the text.
I'm trying to work around this by changing the matcher to this:
@match = $text =~ /(?:^|\b|\s)($re)(?:$|\b|\s)/g;
as ugly as it looks, it seems to work. Any suggestions to make it look & work better?
@dankogai
Copy link

dankogai commented Sep 1, 2010

I'm not sure if I got your point right because /\b($re)\b/g DOES match against '#mac'.
Is it that you want to capture '#' as well as the word? If so,

/(?<=\s)([#]?$re)\b/

is the regexp you are looking for.

Dan the Regular Expressionist

@miyagawa
Copy link
Author

miyagawa commented Sep 1, 2010

No - sorry i guess the example was confusing, but if the TRIE contains '#mac' (containing # itself) then \b will NOT match.

my $rt = Regexp::Trie->new;
$rt->add($_) for qw( #mac #win #unix );
my $re = $rt->regexp;

my $text = "I want #mac";
my @m = $text =~ /\b($re)\b/g;
# @m is empty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment