tlcoles/regex-tutorial.md

## regex-tutorial.md

      
    Raw
  

              regex-tutorial.md
            
          
    A quick guide to understanding regex

For this week's challenge of the ESMT Coding Boot Camp, I explain regular expressions, commonly known as regex. I will draw from diverse online resources, which I will link to for your further research, and I will detail the parts of a regex using a specific example of a regex for URL.
Not interested in the history of regex or key definitions? Click here to skip directly to the tutorial.
Table of Contents


History of regex
Definitions
Tutorial

Regex components
Grouping and capturing
Quantifiers
Bracket expressions
Flags


Acknowledgements
Author

History

Before we dive into what regex can do, a brief word to its origins. According to Wikipedia, the American mathematician Stephen Cole Kleen formalized the concept of a regular expression in 1951 with his definition of mathematical notation called "regular events." In the late 1960s, the use of regex in computer science and programming grew. Today, among many situations where regex can save the day, regex is widely used in programming languages and libraries, including Javascript, on which this tutorial is focused.
Read more about the history of regex on Wikipedia.
Definitions

A regular expression, or regex, is a pattern comprised of a sequence of characters and used for searching text. The aforementioned characters are either literal characters or metacharacters.
Let's use the word "apple" as an example.
A search for a literal character a would show one match with the "a" in apple. A search for a p would show two matches in apple.
Metacharacters, on the other hand, have special meanings. For example, in the sentence
I like to eat Gala apples.
the word apples could be described and found with \w\w\w\w\w\w. The metacharacter \w stands for an alphanumeric character, meaning letters A to Z (whether lowercase or uppercase) and digits 0 to 9. The word apples is six characters long. Be careful to recognize, however, that \w\w\w\w\w\w stands for any six-characters in a row. You can see how that would work on the preceding text in the image below.

There are many metacharacters (see the JavaScript RegExp Reference on W3Schools). We will examine them in additional detail in the exercise below.
Tutorial

For this tutorial, we will search for a domain in a body of lorem ipsum text refashioned in Markdown, a markup language for plain-text editors. (This tutorial was written using Markdown.) The sample text was generated by Fillerama, a lorem ipsum generator that uses content from the animated series Futurama. Click here for a screenshot of the sample text on Fillerama.com. The text has letters, numbers, and special characters and, of course, so does the URL. We thus want to find the unique combination of characters written as a regex to describe a URL.
In the Markdown version created for this tutorial, I added a line that stated "For more Futurama-derived lorem ipsum, visit [Fillerama.io](http://fillerama.io/)."
To find a URL in that body of text, I use the following regex:
/(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)/g
The regex above matches both Fillerama.ioand http://fillerama.ioas seen in this screenshot of the test using regexr.com.

For this tutorial, I rely heavily on Regexr as well as resources W3Schools, MDN, and Rexegg. Please see the acknowledgements below for all resources used in this tutorial.
Regex components

A regex comprises of anchors, boundaries, quantifiers (greedy and lazy), conditionals, character classes, character escapes, flags/modifiers, capturing and non-capturing groups, bracket expressions, backreferences, and lookarounds. In this tutorial, I will only address those used in the regex of a URL. However, I encourage you to read about them all at RexEgg, a comprehensive resource on regular expressions. (See the acknowledgements section for links to RexEgg and more.)
Grouping and capturing

Because our regex begins with grouping – specifically (?:http(s)?:\/\/) – let's begin our detailed breakdown by explaining what grouping does.
Parentheses (also known as round brackets) work in regular expressions much like they do in mathematical equations. That is, () are used to group together expressions so that its group value can be evaluated and used. In the example regex
/(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)/g
we have (?:http(s)?:\/\/) and (?:\.[\w\.-]+).
In the first grouping, the literal characters h, tt, p, s, :, and // really only matter as a group, either as http:// or https://. If you were to search for the literal characters in that same sample text, you would find many, many instances of those literal characters – 177 matches, in fact, according to regexer.
As you can see in the screenshot of our example, however, the regex finds two matches. One of them, Fillerama.io, includes neither http:// nor https://. Which brings us to...
Quantifiers

In our regex, (?:http(s)?:\/\/)?, you can see a single ? within and directly behind the grouping. This is unlike the ?: at the start of the parentheses, which designates the group as a non-capturing group. Rather, the single ? is an example of a quantifier, one of four  –  +, *, ?, and {x,y} – that modify the literal character or the metacharacters directly to their left.
The site RexEgg has a great example of these:

in A+ the once or more quantifier + applies to the character A
in \w* the zero times or more quantifier * applies to the metacharacter \w
in carrots? the zero times or once quantifier ? applies to the character s not to carrots
in (?:apple,|carrot,){1,9} the x times at least, y times at most quantifier {x,z}applies to the entire subexpression (?:apple,|carrot,).

Click here to read more about Mastering Quantifiers on rexegg.com.
In our example, the (s)? says that the match includes an http that may or may not have an s. Similarly, the (?:http(s)?:\/\/)? says that the match may have an instance of https:// or http:// or – as in the case of the match [Fillerama.io] – neither.
Bracket expressions

Following the group, our regex continues with [\w.-]+. As you learned above, the + is the quantifier that means once or more of the character(s) directly left of it. In this case, the square brackets, [] designates what exactly should be matched.
In a regex, [] means match with any character within these brackets. In this case, it could be a match of once or more letters or numbers, represented by the \w; a match of once or more periods, represented by the literal character .; or a match of once or more dashes, represented by the literal character -.
The German luxury automotive manufacturer Mercedes-Benz has the domain https://www.mercedes-benz.com. Our regex has no problems detecting that it is as valid a domain as Fillerama.io.
There's much more that you can do with bracket expressions. Check out how bracket expressions work with character classes, for example, on Gnu.org.
Flags

Last but not least to be understood in /(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)/g is the trailing g. According to Javascript.info, JavaScript uses just six regex flags: i, g, s, m, u, and y. Click that link to get the details.
In our example, g, which stands for global search, means that the  all instances that match the regex will be shown, not just the first instance. The gis why both Fillerama.ioand http://fillerama.iowere matched, not just Fillerama.io.
Acknowledgements

There is obviously a lot more to regex than can be explained in this brief. For example, not all regex concepts - Anchors,  Quantifiers,  Character Classes,  Grouping and Capturing Bracket Expressions, Greedy and Lazy Match,  Look-ahead and Look-behind – are relevant to the URL regex. For this exercise, the criteria include "Be as concise as possible." So let's do that.
Moreover, a Google search on "find a URL in text with regex" will uncover other regular expressions that describe the greatest variety of URL examples. I encourage you to read and explore more, as I did, with the resources below.

Regular expression on Wikipedia
RexEgg
Regexr
RegExTester
Adding Images to markdown files in Gist.markdown
Regular Expressions Cookbook, second edition via O'Reilly.com

Author

Tammi L. Coles is a professional writer and editor in corporate and nonprofit communications. Her work has appeared in diverse publications – including press releases, annual reports, news articles, corporate blogs, and grant proposals – as well as in business media like Forbes, Harvard Business Review, MIT/Sloan, and European Business Review, for which she was either the editor or the ghost writer. She is currently learning full-stack web development to extend her technical writing skills. See her in action on GitHub via @tlcoles.