Skip to content

Instantly share code, notes, and snippets.

@jmail1014
Last active March 15, 2022 18:33
Show Gist options
  • Save jmail1014/38ed751b53a87131ed842d8a09eea62c to your computer and use it in GitHub Desktop.
Save jmail1014/38ed751b53a87131ed842d8a09eea62c to your computer and use it in GitHub Desktop.

Regex Tutorial

Hi! I am going to explain regex. My regex example is an email identifier. You can use regex to identify or restrict certain kinds of emails, so this can be written in many ways. Mine will be for a simple and general email address.

Summary

The regex or regular expression I will be describing is ^\w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}$, which defines a regular email address. I will explain what each section means and how it will define the email address it is looking for and what it will exclude. ^ is the beginning and just states this. \w will include any alphanumeric character including underscore. +@ includes one @ character. The Bracket [a-zA-Z_] includes any character a thru z upper and lowercase as well as underscore. +?. includes one period. The second bracket [a-zA-Z]{2,3} will include two minimum or three maximum characters a thru z in upper or lowercase, this is where the DNS will go (com).

Table of Contents

Regex Components

Anchors

Anchors are what start and end the regex. They are usually the ^ and $ characters. The ^ character is what starts the regex and everything that follows is a parameter that we are asking for. Although if ^ is inside brackets it will mean not. The $ character is the end of the regex and what proceeds it must be read and followed as well. The $ character can also mean what it must end with if it is followed by another character. For example $e must end with an e. In this email address example the ^ character is the beginning and the $ character signifies the end.

Quantifiers

Quantifiers are limitations to the quantity of characters in a regex. In this email example ^\w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}$ the quantifiers are +? and {2,3}. The +? means that it is only looking for one instance of the dot character or period /. The curly braces {2,3} means that the characters in the brackets before the curly braces [a-zA-Z] will only be 2 to 3 characters length.

OR Operator

The OR operator in this example is within the bracket expression so it is not labeled just implied. [a-zA-Z_] means that these characters can be lowercase alphabetical or upper case alphabetic characters or include an underscore _. This also takes place in the second bracket [a-zA-Z] as it may include lower or upper case alphabetic characters. If the OR is placed outside a bracket it will be signified with an | operator.

Character Classes

Character classes define a set of characters and begin with the /. In this example ^\w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}$ the \w and . are considered character classes. \w is any alphanumeric character and or underscore _. This is basically saying the same thing as [a-zA-Z0-9]. In this setting it is giving parameters to the beginning of the email address, specifically the address before the @ character. . is a period character. It can only contain a period in this place. With the quantifiers +? proceeding it, this means the period character can only be one. So no more or less.

Flags

There are no flags used in this particular example. A flag would be added to the end of a regex as an added restriction or functional definition. An example of a flag would be a g, i or m after the regexs second slash. The g defines a global search for this regex. The i would mean it is not case sensitive. The m would include and apply to multiple lines. These are just three examples out of the 6 possible flags.

Grouping and Capturing

Grouping and capturing are ways to group parts of the regex and check the characters of each in their exact orders. These groups are in parenthesis. This example does not have grouping or capturing. An example of that would be (program), the regex would have to match the characters program exactly. The parenthesis are the group and the characters and order of the characters are the capturing.

Bracket Expressions

Bracket expressions are character ranges inside brackets []. They define what is to be or can be included in the resulting regex. In this example ^\w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}$ there are two bracket expressions on each side of the period. [a-zA-Z_] states that the characters a thru z, both upper and lowercase, as well as the underscore can be included. [a-zA_Z] states that the characters a thru z can be included and can also be lower or uppercase. This bracket expression is also followed by a character count restriction of {2,3}, which means it can only be two to three characters of the character range. A bracket expression can also contain a negative character group which defines what must not be included. This would look the same as a positive range except with that it would start with a ^ character. [^no] would mean that both n and o must be excluded.

Greedy and Lazy Match

Greedy by definition means the regex will match as much as it can for as long as it can. A lazy match would mean it is going to match by the least amount possible. Quantifiers are greedy. In this example ^\w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}$ the the + and {} are greedy quantifiers and define how much the pattern will be matched. The ? character is also a quantifier but is considered lazy since it will match the fewest occurrence defined by the quantifier proceeding it. To better explain this, in the example the +?. means that the period is only to appear the least number of times set by the + character, which is once. So +?. is one period. The {2,3} is a pattern set to the preceding character range [a-zA-Z] and means it will be a minimum of 2 characters and a maximum of three. You could also write this as a single number {2} meaning only two characters exactly or with a comma after {2,} meaning it's at least two characters.

Boundaries

/b is a boundary called a word boundary. It is similar to an anchor. It is used before and after character sets to identify that set of characters. This example does not use boundaries. An example of one would be /bbed or /bbed/b. In /bbed bed is identified on its own or within other words, such as embedded. When the boundary is placed both before and after a set of characters then only that set or word will be returned exactly how it is defined. So /bbed/b would only identify bed.

Back-references

A back-reference identifies a repeated character or characters. Using grouping and capturing you can identify these characters once and then use the backreferences to identify any repeated occurrences. There are no backreferences in this example.

Look-ahead and Look-behind

There are both positive and negative lookaround assertions. They define the beginning or end of the regex and add an extra definition. The look-ahead will look ahead of the regex and make sure to include (positive) or exclude (negative) whatever is defined after the ? character. In this example there are no lookaround assertions. An example of one would include parenthesis around it like in grouping and capturing.

Author

My Name is Jessica Long and I am currently a coding boot-camp student at UNCC. I love building things and solving bugs. Check out my work so far in [github](https://github.com/jmail1014"my profile").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment