Skip to content

Instantly share code, notes, and snippets.

@Shengaero
Created October 20, 2022 20:56
Show Gist options
  • Save Shengaero/1a11bc60b8b6a4b372223e7402ddab38 to your computer and use it in GitHub Desktop.
Save Shengaero/1a11bc60b8b6a4b372223e7402ddab38 to your computer and use it in GitHub Desktop.
URL Regex Walkthrough

URL Regex Walkthrough

One of the biggest challenges of programming with web interfaces and various web applications is consistently parsing through URLs.

It can often be incredibly difficult and messy to find certain pieces of a URL, or even just simply determining if a user input is a URL to begin with.

Thankfully most general purpose programming languages these days have built in or third party libraries that allow usage of matchers called Regular Expressions (often called "regex" or "regexp"). These handy character sequences allow developers to easily detect and parse through strings fitting a particular pattern.

Summary

In this gist, I'll be breaking down one to help with URLs:

/^(https?:\/\/)?([\da-z\.\-]+)\.([a-z\.]{2,6})([\/\w\.\-]*)?\/?$/

This regex includes 4 grouping constructs to help grab specific pieces of information when matching a string with it and supports most basic URLs. Here is an example using it in javascript:

const regex = /^(https?:\/\/)?([\da-z\.\-]+)\.([a-z\.]{2,6})([\/\w\.\-]*)?\/?$/;

const [full, group1, group2, group3, group4] = regex.exec('https://github.com/Shengaero');

console.log(full);      // https://github.com/Shengaero
console.log(group1);    // https://
console.log(group2);    //         github
console.log(group3);    //                com
console.log(group4);    //                   /Shengaero

Table of Contents

Regex Components

Anchors

The start and end anchors are the ^ and $ (respectively)

Quantifiers

The regex uses three types of quantifiers:

  • ? - Preceeding element is optional, either none or one can be present.
  • * - Preceeding element is optional and repeatable, either none, one, or more than one can be present.
  • + - Preceeding element is not optional and repeatable, either one or more than one can be present.

Grouping Constructs

The regex has four grouping constructs, here is an explaination of them in order:

  • Group 1: (https?:\/\/)? - Optional "http://" or "https://"
  • Group 2: ([\da-z\.\-]+) - A sequence of one or more characters including any of the following:
    • Numbers
    • Lowercase Letters
    • Periods
    • Hyphens
  • Group 3: ([a-z\.]{2,6}) - A sequence of 2 to 6 characters including any of the following:
    • Lowercase Letters
    • Periods
  • Group 4: ([\/\w\.\-]*)? - Optional sequence of zero or more characters including any of the following:
    • Slashes
    • Word Characters
    • Periods
    • Hyphens

Character Classes

The regex uses a few character classes:

  • \d - Numbers
  • a-z - Lowercase Letters
  • \w - Word characters, includes the following:
    • Numbers
    • Letters (both lowercase and uppercase)
    • Underscores

Character Escapes

The regex also has several character escapes

  • \/ - escape for a slash
  • \. - escape for specifying a period character

Author

Kaidan Gustave is a Java/Javascript/Kotlin developer making web applications for browsers. Feel free to check him out on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment