Skip to content

Instantly share code, notes, and snippets.

@CaseyDeriso
Last active February 21, 2023 04:13
Show Gist options
  • Save CaseyDeriso/ca65d2a1e8ea61856fc2602c73415a4f to your computer and use it in GitHub Desktop.
Save CaseyDeriso/ca65d2a1e8ea61856fc2602c73415a4f to your computer and use it in GitHub Desktop.
REGEX tutorial | URL validator

Matching a URL with a Regular Expression

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

A regular expression (regex) is a definition for a search pattern. Regex uses a series of commands to compare a string to a set of rules.

In this case, we are trying to determine if the input is a URL.

Summary

In this episode of Who Wants to be a Regular Expression, we will break down each component of this regular expression to gain an understanding of how regular expressions work. We will look into the what why and how's of regex components to gain a deep understanding of regular expressions.

Table of Contents

Regex Components

Anchors

^(https?:\/\/)?([\da-z\.-]+)

The first component of our regular expression looks quite complicated, but let's take it one piece at a time.

The first piece of our URL validator is a component called an Anchor which checks the string at a particular boundary point.

Anchors come in 4 types:

^ - Beginning

$ - End

\b  \B - Word Boundary and NOT Word Boundary

An anchor can be used, in this example, to check that the string begins with 'http'.

 ^http

Our URL validator is a little more complicated than that, it has the ability to check for two different options for the beginning anchor.

Grouping and Capturing

(https?:\/\/)?([\da-z\.-]+)

Just like in plain 'ol mathematics, it is sometimes important to perform operations in groups.

Looking at our regular expression again, we know from the anchor tag that this piece of the expression needs to be at the beginning of the string, but what is it checking?

In our case, we are checking a group to match the beginning of our test string.

https?:\/\/ and \da-z\.-]+ are rules we want to bunch together to perform an operation on, or use as a back reference.

Quantifiers

https?

What if you're looking for a component of a string that is allowed to not exist?

In our case, we want 'http' and 'https' to be valid, so we use the '?' quantifier to say that there may be 1 or 0 's' after 'http'

Quantifiers come in 6 types:

'?' -- Optional -- Matches 1 or 0
'+' ---- Plus ---- Matches 1 or more
'*' ---- Star ---- Matches 0 or more

OR Operator

Another component of regex is the OR or Alteration operator

'|' - Alteration - Matches the expression before or after the quantifier (acts like a boolean OR)

(https)|(http) is a valid way to re-write the previous piece of our expression, https?.

By capturing 'https' and 'http', we were able to use the OR operator to choose one or the other with our test string.

Greedy and Lazy Match

By default, '+' and '*' quantifiers are greedy, meaning that they will match as many characters as possible.

If we use the a '?' after another quantifier \w+?, \w*? it becomes a lazy match and will match as few characters as possible.

A greedy match will match duplicate characters the first time

  'o+' will capture all characters of the string 'ooooooooo'

A lazy match will try and match the least amount first, then continue trying 
to match one more at a time untill it matches all repeated characters

  'o\w+' will capture all characters in the string 'ooooooo'
  
  Only after trying:

    'o', 'oo', 'ooo', 'oooo', 'ooooo', and 'oooooo'

Depending on your input, it may be advantageous to use a lazy match to improve performance of your regex.

Character Sets and Classes

Character sets are groups of characters that we can define using square brackets.

'[abc]' will capture only the characters 'a', 'b', and 'c'

You can negate a set to choose everything except a chosen set of characters

'[^abc]' will capture all characters except 'abc' in this string:

  'abcD3fgh!'

'[a-z]' will capture all characters in a range from 'a' to 'z'

Character classes are like shortcuts we can use in regular expressions to capture groups of characters

In our URL matching regex, ([\da-z\.-]+) we are using the character class /d to find any digit.

Character classes include:

'.' --- dot  ---- matches any character EXCEPT line breaks

'\w' --- word --- matches any word character. '\w' captures caps and lowercase letters, numbers,
and underscores.

'\W' - NOT word - matches anything the word character does not. '\W' captures accented characters
such as 'è', commas, and white space

'\d and \D' ------ digit and NOT digit ------ matches any numeric or NON-numeric character

'\s and \S' - whitespace and NOT whitespace - matches any whitespace character or NON-whitespace '\s' 
matches spaces, tabs and line breaks'

Flags

Flags are another important component of regular expressions which are used at the end of a query.

'i' --- ignore case --- makes the whole expression match regardless of capitalization. 

'g' -- global search -- retains information of the previous match, allowing multiple searches to occur.
By default, global search is turned off to prevent infinite matching, however it needs to be used
if you plan to match multiple strings in a single search. 

'm' --- multi-line ---- The multi-line flag allows the '^' and '$' anchors to reference to beginning and 
end of a line instead of the whole string when there are line breaks in your test string. 

In this example, we will see how the global flag can help us find multiple links in a test string.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Will only capture a url as a string by itself:

  'www.caseyderiso.com'

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/gm

By using the global and multiline flags, we can capture multipile links that are line-broken

  `www.caseyderiso.com
   www.caseyderiso.com`

Boundaries

Word boundaries are a zero-length position defined in regular expressions in relation to a word

Every word in a string will have two invisible positions on either side that defines a word boundary

  '*Look* *mom*' has 4 word boundaries, marked with '*'.

By adding the '\b' component before an after a search, we can make it a whole word search, and not catch unwanted characters. For example:

'\bhi\b' will only capture the whole word 'hi' in the following string:

  'height hi hit'

Back-references

Back references are another tool in regular expressions that allows us to keep the code clean and type less

'\1' - backreference - refers to a previously defined capture group, in this case, the first catpure group

For example, when capturing a quote from a needy child:

'(mom) hey \1 \1 \1 \1'

will capture the entire string:

'mom hey mom mom mom mom'

Using a back reference in your regex

Look-ahead and Look-behind

Regular expressions allow us to look ahead and behind when searching through a string, but without including the parameter we are searching for.

'(?=.txt)' -- look ahead -- will match any preceding rule only if its followed by '.txt'

'(?<=.>) -- look behind -- will match and subsequent rule only if its proceeded by a '.'

for example, if we wanted to find file names, but not show the file extension:

'.+(?=.js)' will capture only the files name 'server' in the following string

  'server.js'

and will not capture strings that are not JS files:

  'main.handlebars'

Author

I'm Casey Deriso. I'm a dedicated life-long-leraner who wants to share and expand my knowledge on all things programming related!

My GitHub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment