This was my response to a question Explaining Parse Rules on RebolForum wrt. matching whole words in a string.
"The input string can contain any number of delimiters, followed by the word we are searching for, and then any delimiter or else the end of the string. If we find the word, return ??? what exactly? Or, if we don't find the word, then there can be any characters or delimiters to the end. (And then what does the "end skip" do?)
I should start by saying that I don't normally use return
in this way: I don't much for return
at all, rather let all the branches play out to their conclusion. I used it here for expediency.
The problem here is defined as 'whole words'--we need to discern what a whole word is. In Regex, you might write /\bWordToMatch\b/
with that handy little \b
shorthand which is a zero-width match between a stream of \w
(word) characters and \W
(non-word) characters.
In Rebol, there is no \b
, so we can't say, for instance: find "This String" "^(word-boundary)String^(Word-Boundary)"
, so we'll have to resort to Rebol's all-powerful string-splitting penknife: Parse. Indeed, there are no built-in character classes in Rebol (try help bitset!
at the console--compare that to help tuple!
in Rebol/View), and no magic \b
marker, so [to "^(word-boundary)String^(word-boundary)"]
is out of the question.
Our whole word match is based on this--our word is bound at the start either by the head of the string or some \W
and at the end by the tail of the string or some \W
. We now have to ponder \w
and \W
(specifically the latter) in order to see our string in binary terms: /(\W+|\w+)*/
. For the purposes of this exercise, I'm going to set this definitions as follows--our non-word
is simply space
and punctuation
, and our word
is anything that's not that:
word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
To flesh out our thinking, we'll say:
phrase: "Is neither fowl nor owl though."
term: "owl"
A positive match is simply:
positive-match: [term [non-word | end]] ; TARGET followed by any NON-WORD character or END (tail).
The tricky part is making sure that what precedes it is also non-word
. Matching head
is easy:
[any non-word positive-match]
That's great if it's at the beginning or only preceded by space/punctuation. What if there are other words beforehand? Well, if we have a word
sequence that isn't positive-match
, then we assume it will have a non-word
sequence before we can once again test for positive-match
:
negative-match: [some word some non-word]
We go looping through our string, skipping through those negative-match
es until we hit upon a positive-match
:
[
any non-word ; we want to start our loop either at HEAD or after a NON-WORD sequence
some [
positive-match ; always test this first
|
negative-match ; we're at the end of a NON-WORD sequence here
]
]
Ok, this is great, but we need some way to mark out our successful find when we get there. We'll introduce mark
.
mark: none
some [
mark: positive-match
|
negative-match
]
It's difficult to discern though whether mark
is at the beginning of a positive-match
or a negative-match
. If you run this rule now, we'll be at "though."
because we didn't stop the loop. Adding break
will ensure that the loop won't run again:
some [
mark: positive-match break
| ...
]
However if phrase
doesn't contain term
, then we're going to end up with a false positive. Instead of waiting for the Parse rule to play out, I've short circuited the whole thing by just return
ing the mark (breaks out of Parse back to the containing function) and adding end skip
to ensure Parse (and our function) will always return false
(if end
is true
, then skip
will always be false
) if there is no positive-match
. Our function thus behaves like find
.
Ok, so one final tweak to bypass the return
/end skip
hack would be:
some [
mark: positive-match break ; BREAK stops the loop on success
|
(mark: none) negative-match ; MARK is reset
]
And there we have it, mark
will either refer to the point before positive-match
, or be none
. The return value of parse
no longer matters.
find-word: func [
phrase [string!] term [string!]
/local word non-word mark
][
word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
mark: none
parse/all phrase [
any non-word
some [
mark: term [non-word | end] break
|
(mark: none) some word some non-word
]
]
mark
]
probe find-word "Is neither fowl nor owl though." "owl"
probe find-word "Is neither fowl nor owl though." "bowl"