Skip to content

Instantly share code, notes, and snippets.

@arvindpdmn
Last active November 6, 2021 12:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save arvindpdmn/bc41f6afe6447daec0f6b24f1e9e1921 to your computer and use it in GitHub Desktop.
Save arvindpdmn/bc41f6afe6447daec0f6b24f1e9e1921 to your computer and use it in GitHub Desktop.

Regular Expressions
Devopedia, May 2019

0. Introduction

Read the basics of Regular Expressions on Devopedia site.

The rest of this document shows examples of regex for the purpose of learning. We follow PCRE (PHP) flavour. You may use Regex101 to try out these examples online.

For the purpose of this tutorial, we use the following format:

  • Search: input --> /regex/modifier --> result
  • Search & Replace: input --> /regex/replace/modifier --> output

Where input, output or result strings end or begin with whitespace, we'll put them within a pair of double quotes for readability. These quotes are not part of the strings.

For most examples, we use the global modifier g.

1. Basics

Characters and character classes:

1. Hello World --> /o/g --> 2 matches: o, o

2. Hello World --> /l/g --> 3 matches: l, l, l

3. Hello World --> /[A-Z][a-z]/g --> 2 matches: He, Wo

4. Hello World_Or_Planet --> /\W/g --> 1 match: 1 space

5. $22.50 --> /\d/g --> 4 matches: 2, 2, 5, 0

6. R G B --> /\S/g --> 3 matches: R, G, B

7. R G B --> /\s/g --> 2 matches: 2 spaces

8. Off 20%! --> /[^\d]/g --> 6 matches: O, f, f, " ", %, !

9. Off 20%! --> /[\D]/g --> 6 matches: O, f, f, " ", %, !

10. Mr. & Mrs. --> /Mr./g --> 2 matches: Mr., Mrs

11. Mr. & Mrs. --> /Mr\./g --> 1 match: Mr.

Anchors:

1. abc --> /^bc/g --> 0 match

2. abc --> /^.bc/g --> 1 match: abc

3. abcdef --> /.bcd$/g --> 0 match

4. abcdef --> /bcd\w\w$/g --> 1 match: bcdef

5. abcdef --> /\w\w$/g --> 1 match: ef

Boundaries:

1. This is a name --> /\bis\b/g --> 1 match: is

2. catfish concatenate kitty-catty --> /\w+\Bcat\w+/g --> concatenate

3.
First line
Another line at the end --> /\A./mg --> 1 match: F

4.
First line
Another line at the end --> /.\Z/mg --> 1 match: d

Alternation:

1. grey or gray --> /grey|gray/g --> 2 matches: grey, gray
                    /gr(e|a)y/g
                    /gr[ea]y/g

2. cats and dogs --> /^cat|dog/g --> 2 matches: cat, dog

3. cats and dogs --> /^(cat|dog)/g --> 1 match: cat

Quantifiers:

1. Hello World --> /.*/ --> 1 match: Hello World

2. Hello World --> /.*/g --> 2 matches: Hello World, ""

3. Hello World --> /.+/g --> 1 match: Hello World

4. Hello World --> /\w+$/g --> 1 match: World

5. $22.50 --> /\d+/g --> 2 matches: 22, 50

6. colour or color --> /colou?r/g --> 2 matches: colour, color

7. bbc bcci mcc dcccx cccccd --> /c{2,3}./g --> 4 matches: cci, "cc ", cccx, cccc

8. bbc bcci mcc dcccx cccccd --> /c{2,}./g --> 4 matches: cci, "cc ", cccx, cccccd

9. bbc bcci mcc dcccx cccccd --> /c{,3}./g --> 0 matches

Match metacharacters:

1. -3.2 + 4.3 = 1.1 --> /[-\d\.]+/g --> 3 matches: -3.2, 4.3, 1.1

2. My IP address: 192.168.12.44 --> /\d+\.|\d+$/g --> 192., 168., 12., 44

3. Price is $22.50 --> /\$\d+/g --> 1 match: 22
                       /[$]\d+/g

4. /var/html/www --> /\/\w+/g --> 3 matches: /var, /html, /www

5. Cost (in Rupees) --> /\([^\)]+\)/g --> 1 match: (in Rupees)
                        /\([^)]+\)/g

Modifiers or flags:

1. Hello hello HELLO --> /hello/ig --> 3 matches: Hello, hello, HELLO

2. Help me!! Quick!!! --> /\w(\w|\s)+ !+/gx --> 2 matches: Help me!!, Quick!!!
                          /\w[\w\s]+ !+/gx
                          /\w[\w ]+ !+/gx

3. 
First line
Second one --> /.*/g --> 4 matches: First line, "", Second one, ""

4. 
First line
Second one --> /.*/sg --> 2 matches: "First line\nSecond one", ""

5.
east to west
best is better
better than best --> /[a-z]est$/mg --> 2 matches: west, best

Groups (capturing and non-capturing):

1. My name is John Smith --> /My name is (\w+) (\w+)/ --> 1 match: (My name is John Smith, John, Smith)

2. Color is #12de87 --> /#([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})/ig --> 1 match: (#12de87, 12, de, 87)

3. Color is #12de87 --> /#([0-9a-f]{2}){3}/ig --> 1 match: (#12de87, 87)

4. Color is #12de87 --> /#(?:[0-9a-f]{2}){3}/ig --> 1 match: #12de87

5. Color is #12de87 --> /[0-9a-f]{2}/ig --> 3 matches: 12, de, 87

6. abcd abcD aBc abcdefG --> /[a-z]{2}(?:[a-z]{2,3})?/g --> 5 matches: abcd, abcD, aB, abcde, fG

7. bbc bcbi mkk deeex fffffd --> /([a-z]).\1+/g --> 3 matches: bcb, eee, fffff

Search and replace with capturing groups:

1. My name is John Smith 
--> /My name is (\w+) (\w+)/First name: \1; Last name: \2/
--> First name: John; Last name: Smith

2. My name is John Smith 
--> /My name is (?P<first>\w+) (\w+)/First name: \g<first>; Last name: \g<last>/
--> First name: John; Last name: Smith

3. "What", "when' and 'who' 
--> /(["'])\w+\1/g
--> 2 matches: "What", 'who'

2. Basic Exercises

Change date format from yyyy-mm-dd to dd-mm-yyyy:

/(\d{4})-(\d{1,2})-(\d{1,2})/\1-\2-\3/

Match an email address:

/[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,4}/

Match an IPv4 address:

# Without capture
/\b(?:\d{1,3}\.){3}\d{1,3}\b/

# Capture the parts
/\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b/

# Capture the parts with range checking
/\b(25[0-5]|2[0-4]\d|[01]?\d{1,2})\.
(25[0-5]|2[0-4]\d|[01]?\d{1,2})\.
(25[0-5]|2[0-4]\d|[01]?\d{1,2})\.
(25[0-5]|2[0-4]\d|[01]?\d{1,2})\b/x

Redirect a URL request:

https://techcrunch.com/2015/08/15/the-future-of-marketplace
--> /\/(\d{4})\/(\d{2})\/(\d{2})\//\/\3\/\2\/\1\//
--> https://techcrunch.com/15/08/2015/the-future-of-marketplace

Extract middle names, if any:

John Max Smith
Karthik Kumar
Jane McDonald
Salman al-Uzza Khan
Aparna Ranjan Roy
--> /^\S+ (\S+) \S+$/gm
--> 3 matches in group 1: Max, al-Uzza, Ranjan

3. Advanced

Lazy (non-greedy) match:

1. rupee (INR), dollar (USD), pound (GBP) --> /\(.+?\)/g --> 3 matches: (INR), (USD), (GBP)

2. Hello World, here we come, again! --> /.*?,/g --> 2 matches: "Hello World,", " here we come,"
                                         /[^,]*,/g

3. 12496 --> /\d{2,3}?/g --> 2 matches: 12, 49

4. 12496 --> /\d{2,3}?$/g --> 1 match: 496

Possessive quantifiers:

1. 1249698 --> /\d++9/ --> no match

2. 1249698 --> /\d+9/ --> 1 match: 124969

Atomic grouping of the form (?>...):

1. 1 and 4 are integers. --> /\b(?>integer|insert|in)\b/g --> no match but fails faster

2. Let's insert 3 at the start. --> /\b(in|insert)\b/g --> 1 match: insert

3. Let's insert 3 at the start. --> /\b(?>in|insert)\b/g --> 0 match since we don't backtrack

Lookaround assertions of the forms (?=...), (?!...), (?<=...), (?<!...):

1. That Iraqi must be questioned. --> /q(?=u)\w+/ig --> 1 match: questioned

2. That Iraqi must be questioned. --> /q(?!u)\w+/ig --> 1 match: qi

3. _rabbit _dog _mouse DIC:cat:dog:mouse --> /_(\w+)\b(?=.*:\1\b)/ --> 2 matches: _dog, _mouse

4. He employs 1 cook, 5 waiters and 2 cleaners. --> /\b[a-z]+(?<!s)\b/ig --> 3 matches: He, cook, and
                                                    /\b[a-z]+[^s\s]\b/ig

5. He employs 1 cook, 5 waiters and 2 cleaners. --> /\b[a-z]+(?<=s)\b/ig --> 3 matches: employs, waiters, cleaners

Conditionals of the form (?ifthen|else):

# https://www.regular-expressions.info/conditional.html
1. bd bc abc abd --> /(a)?b(?(1)c|d)/g --> 3 matches: bd, abc, bd

2. bd bc abc abd --> g(a?)b(?(1)c|d)/g --> 2 matches: bc, abc

Recursion of the form (?R):

# https://www.regular-expressions.info/recurse.html
1. aaazz azz aaazzz --> /a(?R)?z/g --> 3 matches: aazz, az, aaazzz

2. 
(full || (half%3==0)) || (full && half)
--> /\((?>[^()]|(?R))*\)/g
--> 2 matches: (full || (half%3==0)), (full && half)

4. Advanced Examples

Find in Apache server log all requests between 7-8 AM that result in an error:

127.0.0.1 - frank [10/Oct/2000:07:55:36 -0700] "GET /logo.png HTTP/1.0" 201 2326 
127.0.0.1 - john [10/Oct/2000:06:22:42 -0700] "GET /help HTTP/1.0" 404 -
127.0.0.1 - mary [10/Oct/2000:14:55:36 -0700] "GET /home HTTP/1.0" 500 120 
127.0.0.1 - arun [10/Oct/2000:12:55:36 -0700] "GET /about HTTP/1.0" 200 4377 

--> /:0[4-7](:\d{2}){2} -\d{4}] "([^"]+)" .*(?<=[45]\d{2}) (?:-|\d+)\s*$/gm

--> 1 match: second line will match

Add commas to numbers (thousands, millions, billions):

\d{1,3}(?=(\d{3})+(?!\d))

Add commas to numbers (Indian convention of thousands, lacs, crores):

# Using alternation
/\d(?=(?:\d{2})+(\d{3})(?!\d)|(\d{3})(?!\d))/

# Simpler one
/\d{1,2}(?=(\d{2})*\d{3}(?!\d))/

More readable version of the above, using x modifier that ignores whitespace in regex:

/\d(?=
  (?:\d{2})+(\d{3})(?!\d) | # >=100000
  (\d{3})(?!\d) # >=1000 && <100000
)/

Match content within nested HTML tags:

He said, "<span>I <strong>really don't</strong> like <em>ginger</em> tea or 
<b>black</b> coffee</span>", but he <span>was</span> lying.

--> /<(\w+)>(?=([^\1]+?)<\/\1>)/g

--> 5 matches: group 2 contains the desired inner content of each tag

Try the following and see why they aren't suitable:
1. /<(\w+)>.*<\/\1>/g
2. /<(\w+)>.*?<\/\1>/g
3. /<(\w+)>[^<]*<\/\1>/g

Match words (case insensitive) repeated within the same sentence:

Hello world, and hello again. I am a programmer, but I'm not a good at program-writing. I wish to learn it better.

--> /\b(\w+)\b(?=[^.?!]+\b\1\b)/i

--> 3 matches: Hello, I, a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment