One of the biggest challenges of programming with web interfaces and various web applications is consistently parsing through URLs.
It can often be incredibly difficult and messy to find certain pieces of a URL, or even just simply determining if a user input is a URL to begin with.
Thankfully most general purpose programming languages these days have built in or third party libraries that allow usage of matchers called Regular Expressions (often called "regex" or "regexp"). These handy character sequences allow developers to easily detect and parse through strings fitting a particular pattern.
In this gist, I'll be breaking down one to help with URLs:
/^(https?:\/\/)?([\da-z\.\-]+)\.([a-z\.]{2,6})([\/\w\.\-]*)?\/?$/
This regex includes 4 grouping constructs to help grab specific pieces of information when matching a string with it and supports most basic URLs. Here is an example using it in javascript:
const regex = /^(https?:\/\/)?([\da-z\.\-]+)\.([a-z\.]{2,6})([\/\w\.\-]*)?\/?$/;
const [full, group1, group2, group3, group4] = regex.exec('https://github.com/Shengaero');
console.log(full); // https://github.com/Shengaero
console.log(group1); // https://
console.log(group2); // github
console.log(group3); // com
console.log(group4); // /Shengaero
The start and end anchors are the ^
and $
(respectively)
The regex uses three types of quantifiers:
?
- Preceeding element is optional, either none or one can be present.*
- Preceeding element is optional and repeatable, either none, one, or more than one can be present.+
- Preceeding element is not optional and repeatable, either one or more than one can be present.
The regex has four grouping constructs, here is an explaination of them in order:
- Group 1:
(https?:\/\/)?
- Optional "http://" or "https://" - Group 2:
([\da-z\.\-]+)
- A sequence of one or more characters including any of the following:- Numbers
- Lowercase Letters
- Periods
- Hyphens
- Group 3:
([a-z\.]{2,6})
- A sequence of 2 to 6 characters including any of the following:- Lowercase Letters
- Periods
- Group 4:
([\/\w\.\-]*)?
- Optional sequence of zero or more characters including any of the following:- Slashes
- Word Characters
- Periods
- Hyphens
The regex uses a few character classes:
\d
- Numbersa-z
- Lowercase Letters\w
- Word characters, includes the following:- Numbers
- Letters (both lowercase and uppercase)
- Underscores
The regex also has several character escapes
\/
- escape for a slash\.
- escape for specifying a period character
Kaidan Gustave is a Java/Javascript/Kotlin developer making web applications for browsers. Feel free to check him out on GitHub.