Skip to content

Instantly share code, notes, and snippets.

@pmgreen
Last active November 21, 2022 21:49
Show Gist options
  • Save pmgreen/6e133c5dcde65762d29c to your computer and use it in GitHub Desktop.
Save pmgreen/6e133c5dcde65762d29c to your computer and use it in GitHub Desktop.
Quick primer on using regular expressions in OpenRefine.

Using regular expressions in OpenRefine

A regular expression is a string that describes a text pattern occurring in other strings, m'kay.

Basic concepts

With which one can go quite far.

* metacharacters
* character escapes \
* anchors \A\Z or ^$
* character classes [][^]
* quantifiers *?+{min,max}
* grouping ()
* substitutions $1 or \1

Now, to mix them up...

Metacharacters

Characters that have special meaning.

. any character (almost*) 
\ escape character
| or
\A or ^ anchor: start of string
\Z or $ anchor: end of string
[] list or range to be matched
[^] negative list or range (NOT these characters)

*The dot doesn't match newline characters unless in single-line mode.

Shorthands

\d any digit
\D any non-digit
\w any word character [a-zA-Z0-9_]
\W any non-word char. [^a-zA-Z0-9_]
\s any whitespace char.
\S any non-whitespace char.
\t tab
\n newline
\r carriage return

Quantifiers

? zero or one
* zero or more
+ one or more
{n} exactly n of the preceding element.
{min,} min or more of the preceding element.
{min,max} between the min and max the preceding element.

Examples

\.? - zero or one period
\.{2,} - two or more periods

Grouping and substitution

() group
$1,$2 or \1,\2 substitution

In Refine capture groups are referred to by their index in an array, like [0] or [1]

Character classes

\p{property} POSIX or Unicode character class
\P{property} *not* in POSIX or Unicode character class

Examples

\p{L} - letter
\p{P} - punctuation
\p{InBasicLatin} - in basic Latin Unicode block
\p{InArabic} - in Arabic Unicode block

Character class intersection &&

[\p{InArabic}&&\p{P}] - a character in the Arabic Unicode block AND ALSO punctuation

Modes

m multiline
s single-line
i case-insensitive

Example

((?i)t) - matches t or T

Look for these options in oXygen find/replace, in MarcEdit functions, etc.

Flavors

There are different regex engines and implementations...

  • Java -- OpenRefine uses this one
  • Perl 5
  • XML Schema
  • XPath/XQuery
  • etc.

NOTE: oXygen uses both Java and XML regular expressions. MarcEdit uses .NET regular expressions.


Exercise

Copy and paste this into Regexr...

Charly
Charlie
Charles
Charles.
charles
Charlene
Charlotte
Chuck

Regexes to try...

. 
.+ 
Charles
Charles 
Charles\.
Charl.
Charl.+
Charlie?
Charl(y|ie)
Charl(ie|ene)
Charl[eo]
Charl[^eo]
Charl.{2}
Charl.{2,4}
Charl.{4,}
(Charl)(.{4,})

In the Substitution section, at the bottom...

$1
$1_
$2
$2$1
$2r $1ie horser

Now, on to Refine...


GREL

General Refine Expression Language

GREL functions that support Regex:

  • replace
  • match
  • partition
  • rpartition
  • split

In GREL expressions, regexes are wrapped by forward slashes / e.g. /(.*)/

Recipes

We can make our own regexipes


Resources

https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions

http://docs.oracle.com/javase/tutorial/essential/regex/

http://www.regular-expressions.info/

http://regexpal.com

http://www.regexr.com/

field1,field2,field3,field4field1 field2 field3 field4
الناشر المكتبة الأهلية، بغداد،1381/1962-1962 http://arks.princeton.edu/ark:/88435/9s1616928 02493cem a2200505 a 4500 Rural and agrarian issues in Brazil, 1971-2005
الناشر دار الثقافة، بيروت : http:/arks.princeton.edu/ark:/88435/h702q767p 03139ntm a22004817a 450 Lubābwa-baṣīrat dhawī al-albāb fī al-tawḥīd wa-al-ʻadl ʻalá madhhab ahl al-bayt ʻalayhim al-salām ... [etc.].
al-Maktabah al-AhlīyahBaghdād1381/1962-1962 http//dss.princeton.edu/cgi-bin/catalog/search.cgi?studyno=5077 05144nta a22005177a 500 Trapping diaries of Donald Phillips,

Regex Recipes for OpenRefine

Remember to use // surrounding your expression.

identify LDRs that are too short

/(.*){24}/

Much better, match a valid LDR from the schema (MARC21slim.xsd)

/[\d ]{5}[\dA-Za-z ]{1}[\dA-Za-z]{1}[\dA-Za-z ]{3}(2| )(2| )[\d ]{5}[\dA-Za-z ]{3}(4500|    )/

the last character of a line isn't a period

At the end of the expression...

\.\Z

or

\.$

the first character of a line isn't upper-case

\A[a-z]

or

\A\p{Lower}

May need to use ^ instead of \A for the beginning of the string

double spaces

\s\s
\s{2}

Or more than one space...

\s{2,}

match non-Roman script

[^\p{InBasicLatin}]

URLs that don't start with 'http://'

(This uses something we didn't talk about called 'negative lookahead' (?!myregex))

/^(?!http:\/\/)(.*)$/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment