A regular expression is a string that describes a text pattern occurring in other strings, m'kay.
With which one can go quite far.
* metacharacters
* character escapes \
* anchors \A\Z or ^$
* character classes [][^]
* quantifiers *?+{min,max}
* grouping ()
* substitutions $1 or \1
Now, to mix them up...
Characters that have special meaning.
. any character (almost*)
\ escape character
| or
\A or ^ anchor: start of string
\Z or $ anchor: end of string
[] list or range to be matched
[^] negative list or range (NOT these characters)
*The dot doesn't match newline characters unless in single-line mode.
\d any digit
\D any non-digit
\w any word character [a-zA-Z0-9_]
\W any non-word char. [^a-zA-Z0-9_]
\s any whitespace char.
\S any non-whitespace char.
\t tab
\n newline
\r carriage return
? zero or one
* zero or more
+ one or more
{n} exactly n of the preceding element.
{min,} min or more of the preceding element.
{min,max} between the min and max the preceding element.
Examples
\.? - zero or one period
\.{2,} - two or more periods
() group
$1,$2 or \1,\2 substitution
In Refine capture groups are referred to by their index in an array, like [0]
or [1]
\p{property} POSIX or Unicode character class
\P{property} *not* in POSIX or Unicode character class
Examples
\p{L} - letter
\p{P} - punctuation
\p{InBasicLatin} - in basic Latin Unicode block
\p{InArabic} - in Arabic Unicode block
Character class intersection &&
[\p{InArabic}&&\p{P}] - a character in the Arabic Unicode block AND ALSO punctuation
m multiline
s single-line
i case-insensitive
Example
((?i)t) - matches t or T
Look for these options in oXygen find/replace, in MarcEdit functions, etc.
There are different regex engines and implementations...
- Java -- OpenRefine uses this one
- Perl 5
- XML Schema
- XPath/XQuery
- etc.
NOTE: oXygen uses both Java and XML regular expressions. MarcEdit uses .NET regular expressions.
Copy and paste this into Regexr...
Charly
Charlie
Charles
Charles.
charles
Charlene
Charlotte
Chuck
Regexes to try...
.
.+
Charles
Charles
Charles\.
Charl.
Charl.+
Charlie?
Charl(y|ie)
Charl(ie|ene)
Charl[eo]
Charl[^eo]
Charl.{2}
Charl.{2,4}
Charl.{4,}
(Charl)(.{4,})
In the Substitution section, at the bottom...
$1
$1_
$2
$2$1
$2r $1ie horser
Now, on to Refine...
General Refine Expression Language
GREL functions that support Regex:
- replace
- match
- partition
- rpartition
- split
In GREL expressions, regexes are wrapped by forward slashes /
e.g. /(.*)/
We can make our own regexipes
https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions
http://docs.oracle.com/javase/tutorial/essential/regex/