Skip to content

Instantly share code, notes, and snippets.

@yinonc
Last active January 21, 2023 16:29
Show Gist options
  • Save yinonc/f53a6dc99cabd20798de982ceeb3b225 to your computer and use it in GitHub Desktop.
Save yinonc/f53a6dc99cabd20798de982ceeb3b225 to your computer and use it in GitHub Desktop.
Regular Expressions - Overview

Regex:

  1. Introduction
  2. Modes
  3. Literal characters
  4. Metacharacters
  5. Backreferences
  6. Special Characters
  7. Useful Expressions

Introduction

Regular expressions are symbols representing a text pattern. They are used for matching, searching and replacing text.
The goal in regular expressions is to match both what you want and only what you want!

Modes

  • Standard - /re/
  • Global - /re/g
  • Case-insensitive - /re/i
  • Multiline Anchors - /re/m
  • Dot-matches-all - /re/s

Modes are defined after the last / of the regular expression, and could be used together. For example, using both global and case insensitive modes: /re/gi

Literal characters

/car/ matches "car"
/car/ matches the first three letters of "carnival"

  • Case sensitive by default (best practice)
    For example: /car/ doesn't match anything in "Carnival"
  • Standatd (non-global) matching - earliest (leftmost) match is always prefered.
Example:

word: "pazzazz"
/zz/ - will match pazzazz
/zz/g - will match pazzazz

Metacharacters

There are only few metacharacters to learn:
\ . * + - { } [ ] ^ & $ | ? ( ) : ! =

  • . - Any character except newline
    Examples:
  • /h.t/ - matches "hot" , "hat" , "hit" but not "heat"
  • /.a.a.a/ - matches "banana" , "#aga!a" , " a asa"

Notice for common mistake:
/9.00/ - matches "9.00", "9500" and "9-00"

  • \ - Escape the next metacharacters
    Note that literal characters shouldn't be escaped
    Examples:
  • /9\.00/ - matches "9.00" but not "9500" or "9-00"
  • /\/home\/usr\/doc\.txt/ - matches "/home/usr/doc.txt"
  • [,] - Defining a character set (begin and end), but only one
    Order of characters does not matter
    Note: Metacharacters shouldn't be escaped inside a character set - they are already escaped (Except ], -, ^, \) Examples:

    • /gr[ea]y/ - matches both "grey" and "gray"
    • /gr[ea]t/ - doesn't match "great"
    • /h[abc.xyz]t/ - matches "hat" and "h.t" - the . is already escaped.
    • /var[[(][0-9][)\]]/ - matches "var(3)" and "var(4)"
    • /file[0\-\\_]1/ - matches "file01", "file-1", "file\1" and "file_1"

    Shorten character set:

    • \d - all digits (same as [0-9])
    • \w - work character (same as [a-zA-Z0-9_])
    • \s - whitespace (same as [ \t\r\n])
    • \D - not digits (same as [^0-9])
    • \W - not work character (same as [^a-zA-Z0-9_])
    • \S - not whitespace (same as [^ \t\r\n])
  • - - Range of characters - represents all characters between two characters
    Only inside a character set
    Examples:

    • /[0-9]/ - matches for any digit
    • /[A-Za-z]/ - matches for all letters
    • /[a-ek-ou-y]/ - any letter in the specified range

Caution:
/[50-99]/ - is not all numbers from 50 to 99

  • ^ - Negate a character set - adding it as the first of character set
    Still represents only one character
    Examples:
    • /see[^mn]/ - matches "seek" and "sees" but not "seem" or "seen"

Caution:
/see[^mn]/ - matches "see " but not "see"

  • * - Preceding item zero or more times
    Examples:

    • /apples*/ - matches "apple", "apples" and "applessss"
    • /\d\d\d\d*/ - matches numbers with three digits or more
  • + - Preceding item one or more times
    Examples:

    • /apples+/ - matches "apples" and "applessss", but not "apple"
    • /<[^>]+>/ - matches any HTML tag
  • ? - Preceding item zero or one time
    Note that literal characters shouldn't be escaped
    Examples:

    • /apples*/ - matches "apple", "apples" but not "applessss"
    • /colou?r/ - matches "color" and "colour"
  • {, } - Starting and ending quantified repetition of preceding item
    Getting {min,max} - positive numbers. Min must always be included (can be zero). Max is optional.
    Examples:

    • /\d{4,8}/ - matches numbers with four or eight digits
    • /\d{4}/ - matches numbers exactly four digits
    • /\d{4,}/ - matches numbers with four or more digits (max is infinite)
  • (, ) - Grouping metacharacters
    Makes the expressions easier to read. Cannot be used inside character set.
    Examples:

    • /(abc)+/ - matches "abc" and "abcabcabc"
    • /(in)?dependent/ - matches "independent" and "dependent"
    • /run(s)?/ - is the same as /runs?/
  • | - Match previous or next expression
    Examples:

    • /apple|orange/ - matches "apple" and "orange"
    • /w(ei|ie)rd/ - matches "weird" and "wierd"
    • /(AA|BB|CC){6}/ - matches "AABBAACCAABB" and more..
    • /(\d\d|[A-Z][A-Z]){3}/ - matches "112233", "AA66ZZ", "11AA44" and more..
  • Anchors Metacharacters:
    Anchors refers to a position, not an actual character. They are zero-width.

    ^: Start of string / line. (Not the same as at start of a character set)
    $: End of string / line
    Examples:

    • /^apple/ - matches "apple" only if it's on a beginning of a string/line
    • /apple$/ - matches "apple" only if it's on a end of a string/line

Backreferences:

Stores the matched portion in parentheses.
/a(p{2}l)l/ matches "apple" and stores "ppl". It is done automatically by default.
Refer to first backreference with \1.
\1 through \9 - backreferences for positions 1 to 9.
Usage:

  • Can be used in the same expression as the group.
  • Can be accessed after the match is complete (programming language needed).
    Examples:
  • /(apples) to \1/ - matches "apples to apples"
  • /(ab)(cd)(ed)\3\2\1/ - matches "abcdefefcdab"
  • /<(i|em)>.+?</\1>/ - matches "Hello" and "Hello"

Special Characters:

  • Spaces - space is a regular character
  • Tabs - tabs are matchable by \t
  • Line - \r, \n, \r\n
    ** depends on your file mode
  • Non-printable characters:
    • bell \a
    • escape \e

Useful expressions:

  • Names:
    • /^\w+/ - Not that good solution
    • /^[A-Z][a-z.']+ [A-Z][a-z.']+/ - Matches first name and last
  • Email Adresses:
    • /^[\w.\-]+@[\w.\-]+\.[A-Za-z]{2,3}$/ - Matches email
  • URLs:
    • /^(http|https):\/\/[\w.\-]+(\.[\w\-]+)+[/#?]?.*$/
  • IPs:
    • /^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/m - It is long, but assures that we won't get higher than 255 for each number.
  • HTML tags:
    • /<([^>]+)>(.*?)</\1>/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment