giannifontanot/regex-tutorial.md

## regex-tutorial.md

      
    Raw
  

              regex-tutorial.md
            
          
    Regex For The Uninitiated - URL Example

Regex expressions are not easy to decipher. What is that? The first time you see one, you may think, "Hey, someone's keyboard is broken, look at this alien code". Well, it happens that it is a very clever code, actually. In just one Regex line you have a complete set of instructions to evaluate whether a user's input follows a pattern or not. In this tutorial, we will clearly, concisely, and undoubtedly explain how a regular expression checks a URL pattern.  :-)
Summary

The first half of this tutorial contains an explanation of all the parts that make a regex, and how they work together. The second half is a real example of a regular expression that checks how a URL string matches a regex pattern. In this tutorial we will use the following regex literal:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

... and we will have fun with it.
Table of Contents


Regex Components

Grouping Constructs
Bracket Expressions
Character Classes
Character Escapes
Quantifiers
Anchors
Flags


A Real Regex URL Example

Surrounded by slashes
The Backbone
Group One
Group Two
Group Three
Group Four
What Goes First, What goes Last
Global Flag


Conclusion
External Resources
Author

Regex Components

Grouping Constructs

The grouping constructs are the groups of characters that belong to a pattern divided into groups. Each group is meant to define a rule for a certain set of characters. The groups are defined by using parenthesis, and you can have any number of groups in a regex.
For example, in the regex that we are using today there are four groups:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  (           ) (           )  (            )(           )

Bracket Expressions

The characters in those groups can come alone or also in ranges. It is easier to define a range of characters at once than write them all everytime we need them. The same is true for numbers. A regex can have a range of letters, numbers, symbols, or a combination of any of them. A range is defined inside a pair of square brackets. Following is a list of the most common ranges:

[a-z] lowercase letters
[0-9] numbers
[-\/] some symbols

in our example, the bracket expressions (which contain mostly ranges) would be:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                 [\da-z\.-]     [a-z\.]       [\/\w \.-]

Character Classes

Sometimes, it is even easier to define very well known ranges by using even shorter codes. You can use an abbreviated form of ranges, called Character Classes. An example of Character Classes used in our regex are:

[\d] all digits
[\w] all alphanumeric characters

and you can see them in our regex here:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                  \d                             \w 

Character Escapes

As you can see, in the former point we escape the letters d and w, because they mean something different than just a letter d or w. We escape characters so they do not mean what they regularly do. Examples:

\s s escaped means: match a single whitespace character
\d d escaped means: match all digits
\/ slash escaped means just a slash. Without escape, it would be the character that wraps a regex literal
\. dot escaped means just a dot. Without escape, it would match any character except the newline character (\n)

Using our regex, let's see where the escaped characters are:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
          \/\/    \d   \.    \.                \/\w \.     \/

Quantifiers

The quantifiers indicate the minimum and maximum characters allowed in a certain range. It is like limiting the number of characters allowed in a string.

{2} example of a matching exactly two times
{2,} example of a matching at least two times
{2,6} example of a minimum of two characters and a maximum of six
? means optional group
+ means matches at least one time, but no limit of subsequent matches
* means may be zero but if it matches there is no limit of subsequent matches

let's find the quatifiers in our example:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
        ?      ?           +           {2,6}            * *  ?   

Anchors

The anchors are the patterns that must match the begining or the end of a string. The are only two symbols for anchors:

^ indicates that a pattern must begin with this pattern
$ indicates that a pattern must end with this pattern

In our example of a regex matching a URL:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
 ^                                                            $  

Flags

Sometimes we need to indicate the limits of the search on the pattern. In this case, we can match the pattern by setting this flags next to the regex literal:

g means global search
i means case insensitive, no distinction between upper and lowercase matches
u means match with full unicode

In this particular case, the only flag used is the global search:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                                                                g   

A Real Regex URL Example

Surrounded by slashes

Lets use a real example. The regex we are working with today has these slashes around it. And although it is full of alphanumeric characters, it is not surrounded by quotes. How is that possible?
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
/                                                              /

Let me explain a little further about a concept called literals. A literal is what a variable holds in, let's say that it is actually the content of a variable. For example, to define a string literal, we surround it by using quotes ("The quick brown fox jumps over the lazy dog"). For a binary literal, we precede the binary numbers with 0b (a binary literal looks like this 0b101), and last in this short example, to define a float number literal we write an F at the end of the decimal number (199.33F). The syntax to define a regex literal is to surround it, not with quotes, but with a pair of slashes (/). If you want to learn a little more about literals you can find a link at the end of this tutorial.
The Backbone

One interesting thing is that a URL can be divided into certain groups of characters that maintain a certain order all the time. The regex literal is divided into groups using the parenthesis. In our URL regex example, we can see four groups, indicated by parenthesis:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  (           ) (           )  (            )(           ) 

You can see that there are some characters that do not belong to any of those groups. This is very important because the groups and the characters out of those groups make the backbone of our regex. Important note: If you see a backslash in front of a character it is because the character needs to be escaped due that it already has some meaning in a regex literal. You escape a character by preceding it with a backslash \. For example, a dot in a regex will be \. a slash will be \/. In our example, the groups and the characters next to those groups are:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  (           )? (           ) \. (            )(           ) \/

Well, we now understand the \. and the \/  which are escaped characters for a dot and a slash. Just to make it more readable, let's write an example of an URL by replacing the escaped characters for the real ones:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  (           )? (           ) . (            )(           ) /

What about the question mark? What is that for? Well. It is a special instruction for the regex. It means that the preceding group is optional. Really? what is the preceding group? let's see. Let's think about the pattern we have at hand when we write a URL. What are ALWAYS the first characters in a URL? The are the characters 'http://'. But not always. Sometimes it would be 'https://'  or sometimes the URL will begin right with the name of the domain, no 'http://' preceding it. How can this kind of pattern be achieved by using regex? It is simple. Use a question mark to indicate that the preceding character or group of characters is optional. In this case, the ? preceding the first group indicated that first group is optional.
Group One

Look at the inside of the first group. Another question mark. It is in front of the character 's'. What does it mean? Easy. The character 's' is optional. Let's see how would it look:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  (  https:// )?(           ) .(            )(           ) 

or maybe this one, making the s optional:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( http://   )?(           ) .(            )(           ) 

But that first group has its own question mark. It means that the whole group is optional. The whole expression may look like this and still be valid, with only three groups:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                (           ) .(            )(           ) 

It follows the pattern either way. So our first group is solved. We are here:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( https://  )?(           ) .(            )(           ) 

Lets have a look to the second group.
Group Two

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                ([\da-z\.-]+) 

It contains characters in a pair of brackets and a plus sign. The brackets contain a range of characters. These brackets contain the following range, no order in particular:

all digits (the letter d escaped)
lowercase letters from a through z (letter a dash z)
the dot (the dot escaped)
a dash (character dash)

And we can see that the pattern is to be matched at least one time due to the presence of the plus sign, so it must exist. This plus sign is the opposite of the question mark (we already learned the ? makes a group optional). Our second group is solved. We are here:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( https://  ) (mail.google) .(            )(           ) 

Let's analyze the third group.
Group Three

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                               ([a-z\.]{2,6})

What is the range of this group?

lowercase letters from a through z (letter a dash z)
the dot (the dot escaped)

Great. But what are the numbers in curly braces? Those are the minimum and a maximum number of characters allowed for the preceding group. Minimum 2 and maximum 6? Just letters and dot? What part of the URL is that? That is the Top Level Domain, TLD for short. Those are the letters in a URL that identifies the country, or the top-level organization it is pointing to, like Europe or the US military (find more about it in the External Resources Section at the end of this tutorial). Let's see how our pattern goes. Using our previous example:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( https://  ) (mail.google) .(    com     )(           ) 


.com aligns with minimum 2, maximium 6 characters
.edu aligns too
.eu
.mil
.co.uk
.mobi

Good. The third group solved, let's move on to the fourth.
Group Four

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                                             ([\/\w \.-]*)

What is the range of this group?

a slash (slash escaped)
all alphanumeric characters (the w escaped)
the dot (a dot escaped)
a dash (character dash)

This group precedes an asterisk. The asterisk is the regex instruction that allows the pattern to match any number of times, even zero. This group will allow alphanumeric, dots, dash, and slash any number of times. This group is important because, after the TLD letters, we really do not know what is the structure of the route into the server. We are here:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( https://  ) (mail.google) .(    com     )(/user-1.jpg) 

Look at the asterisk at the end of the group. It means that the group may not exist at all, or that it may exist and repeat many times. One way to look at it:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( https://  ) (mail.google) .(    com     ) 

Another way to see how this would also match:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
  ( https://  ) (mail.google) .(    com     )(/forders01/pictures/user-1.jpg) 

Done with group four!
What Goes First, What Goes Last

But we are not really done yet. There are some characters in our pattern that still need some explanation.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
 ^                                                            $                              

Those ^ and $ symbols are considered anchors. The first one ^ indicates how a pattern should begin, and $ indicates how a pattern should end. Let's think one moment. A URL must begin with 'http://' and must end with '/' and some characters after it. The ^ is the only symbol that appears before the group that affects it. Thus, this example shows that the pattern must begin with 'http://' (with an optional 's') but it is optional to write it. If it exists in the URL, it will go first.
Now, there is only one character still to analyze inside the regex literal:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                                                           \/?                            

Inside the regex literal is an escaped slash. As you can see, it precedes a question mark, making it optional. That means that any URL can end with a slash, and it is OK to do that.
Global Flag

Now, at last, let's analyze the character outside of the regex literal. What is that letter g? Well, it defines the limits of the search on the pattern. In this case, the regex should be tested against all the possible matches in a string. It is called a flag.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/g
                                                                g                           

Conclusion

We just reviewed how a seemingly difficult regex was solved. By dividing into groups it was clear that the task was not that hard. We have found that the old saying 'divide and conquer' is still current. We are ready for more!
External Resources


Wikipedia article on literals
List of Top Level Domains
Test your regex and get accurate feedback
A much longer tutorial on regex

Author


@giannifontanot