Skip to content

Instantly share code, notes, and snippets.

@rgchris
Last active April 26, 2019 20:17
Show Gist options
  • Save rgchris/1cc1d44b1b2428258c23314ed4088f6c to your computer and use it in GitHub Desktop.
Save rgchris/1cc1d44b1b2428258c23314ed4088f6c to your computer and use it in GitHub Desktop.
Find whole words in a string

This was my response to a question Explaining Parse Rules on RebolForum wrt. matching whole words in a string.

"The input string can contain any number of delimiters, followed by the word we are searching for, and then any delimiter or else the end of the string. If we find the word, return ??? what exactly? Or, if we don't find the word, then there can be any characters or delimiters to the end. (And then what does the "end skip" do?)

I should start by saying that I don't normally use return in this way: I don't much for return at all, rather let all the branches play out to their conclusion. I used it here for expediency.

The problem here is defined as 'whole words'--we need to discern what a whole word is. In Regex, you might write /\bWordToMatch\b/ with that handy little \b shorthand which is a zero-width match between a stream of \w (word) characters and \W (non-word) characters.

In Rebol, there is no \b, so we can't say, for instance: find "This String" "^(word-boundary)String^(Word-Boundary)", so we'll have to resort to Rebol's all-powerful string-splitting penknife: Parse. Indeed, there are no built-in character classes in Rebol (try help bitset! at the console--compare that to help tuple! in Rebol/View), and no magic \b marker, so [to "^(word-boundary)String^(word-boundary)"] is out of the question.

Our whole word match is based on this--our word is bound at the start either by the head of the string or some \W and at the end by the tail of the string or some \W. We now have to ponder \w and \W (specifically the latter) in order to see our string in binary terms: /(\W+|\w+)*/. For the purposes of this exercise, I'm going to set this definitions as follows--our non-word is simply space and punctuation, and our word is anything that's not that:

 word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~} 

To flesh out our thinking, we'll say:

 phrase: "Is neither fowl nor owl though." 
 term: "owl" 

A positive match is simply:

 positive-match: [term [non-word | end]] ; TARGET followed by any NON-WORD character or END (tail). 

The tricky part is making sure that what precedes it is also non-word. Matching head is easy:

 [any non-word positive-match] 

That's great if it's at the beginning or only preceded by space/punctuation. What if there are other words beforehand? Well, if we have a word sequence that isn't positive-match, then we assume it will have a non-word sequence before we can once again test for positive-match:

 negative-match: [some word some non-word] 

We go looping through our string, skipping through those negative-matches until we hit upon a positive-match:

 [ 
     any non-word ; we want to start our loop either at HEAD or after a NON-WORD sequence 
     some [ 
         positive-match ; always test this first 
         | 
         negative-match ; we're at the end of a NON-WORD sequence here 
     ] 
 ] 

Ok, this is great, but we need some way to mark out our successful find when we get there. We'll introduce mark.

 mark: none 

 some [ 
     mark: positive-match 
     | 
     negative-match 
 ] 

It's difficult to discern though whether mark is at the beginning of a positive-match or a negative-match. If you run this rule now, we'll be at "though." because we didn't stop the loop. Adding break will ensure that the loop won't run again:

 some [ 
     mark: positive-match break 
     | ... 
 ] 

However if phrase doesn't contain term, then we're going to end up with a false positive. Instead of waiting for the Parse rule to play out, I've short circuited the whole thing by just returning the mark (breaks out of Parse back to the containing function) and adding end skip to ensure Parse (and our function) will always return false (if end is true, then skip will always be false) if there is no positive-match. Our function thus behaves like find.

Ok, so one final tweak to bypass the return/end skip hack would be:

 some [ 
     mark: positive-match break ; BREAK stops the loop on success 
     | 
     (mark: none) negative-match ; MARK is reset 
 ] 

And there we have it, mark will either refer to the point before positive-match, or be none. The return value of parse no longer matters.

 find-word: func [ 
     phrase [string!] term [string!] 
     /local word non-word mark 
 ][ 
     word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~} 
     mark: none 

     parse/all phrase [ 
         any non-word 
         some [ 
             mark: term [non-word | end] break 
             | 
             (mark: none) some word some non-word 
         ] 
     ] 

     mark 
 ] 

 probe find-word "Is neither fowl nor owl though." "owl" 
 probe find-word "Is neither fowl nor owl though." "bowl"
Rebol [
Title: "Find Word"
Date: 28-Nov-2018
Author: "Christopher Ross-Gill"
]
find-word: func [
phrase [string!] term [string!]
/local word non-word mark
][
word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
mark: none
parse/all phrase [
any non-word
some [
mark: term [non-word | end] break
|
(mark: none) some word some non-word
]
]
mark
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment