yinonc/regex.md

## regex.md

      
    Raw
  

              regex.md
            
          
    Regex:


Introduction
Modes
Literal characters
Metacharacters
Backreferences
Special Characters
Useful Expressions

Introduction

Regular expressions are symbols representing a text pattern. They are used for matching, searching and replacing text.

The goal in regular expressions is to match both what you want and only what you want!
Modes


Standard - /re/
Global - /re/g
Case-insensitive - /re/i
Multiline Anchors - /re/m
Dot-matches-all - /re/s

Modes are defined after the last / of the regular expression, and could be used together.
For example, using both global and case insensitive modes: /re/gi
Literal characters

/car/ matches "car"

/car/ matches the first three letters of "carnival"

Case sensitive by default (best practice)

For example: /car/ doesn't match anything in "Carnival"
Standatd (non-global) matching - earliest (leftmost) match is always prefered.

Example:

word: "pazzazz"

/zz/ - will match pazzazz

/zz/g - will match pazzazz
Metacharacters

There are only few metacharacters to learn:

\ . * + - { } [ ] ^ & $ | ? ( ) : ! =

. - Any character except newline

Examples:


/h.t/ - matches "hot" , "hat" , "hit" but not "heat"
/.a.a.a/ - matches "banana" , "#aga!a" , " a asa"

Notice for common mistake:

/9.00/ - matches "9.00", "9500" and "9-00"

\ - Escape the next metacharacters

Note that literal characters shouldn't be escaped

Examples:


/9\.00/ - matches "9.00" but not "9500" or "9-00"
/\/home\/usr\/doc\.txt/ - matches "/home/usr/doc.txt"


[,] - Defining a character set (begin and end), but only one

Order of characters does not matter

Note: Metacharacters shouldn't be escaped inside a character set - they are already escaped (Except ], -, ^, \)
Examples:

/gr[ea]y/ - matches both "grey" and "gray"
/gr[ea]t/ - doesn't match "great"
/h[abc.xyz]t/ - matches "hat" and "h.t" - the . is already escaped.
/var[[(][0-9][)\]]/ - matches "var(3)" and "var(4)"
/file[0\-\\_]1/ - matches "file01", "file-1", "file\1" and "file_1"

Shorten character set:

\d - all digits (same as [0-9])
\w - work character (same as [a-zA-Z0-9_])
\s - whitespace (same as [ \t\r\n])
\D - not digits (same as [^0-9])
\W - not work character (same as [^a-zA-Z0-9_])
\S - not whitespace (same as [^ \t\r\n])


- - Range of characters - represents all characters between two characters

Only inside a character set

Examples:

/[0-9]/ - matches for any digit
/[A-Za-z]/ - matches for all letters
/[a-ek-ou-y]/ - any letter in the specified range


Caution:

/[50-99]/ - is not all numbers from 50 to 99

^ - Negate a character set - adding it as the first of character set

Still represents only one character

Examples:

/see[^mn]/ - matches "seek" and "sees" but not "seem" or "seen"


Caution:

/see[^mn]/ - matches "see " but not "see"


* - Preceding item zero or more times

Examples:

/apples*/ - matches "apple", "apples" and "applessss"
/\d\d\d\d*/ - matches numbers with three digits or more


+ - Preceding item one or more times

Examples:

/apples+/ - matches "apples" and "applessss", but not "apple"
/<[^>]+>/ - matches any HTML tag


? - Preceding item zero or one time

Note that literal characters shouldn't be escaped

Examples:

/apples*/ - matches "apple", "apples" but not "applessss"
/colou?r/ - matches "color" and "colour"


{, } - Starting and ending quantified repetition of preceding item

Getting {min,max} - positive numbers. Min must always be included (can be zero). Max is optional.

Examples:

/\d{4,8}/ - matches numbers with four or eight digits
/\d{4}/ - matches numbers exactly four digits
/\d{4,}/ - matches numbers with four or more digits (max is infinite)


(, ) - Grouping metacharacters

Makes the expressions easier to read. Cannot be used inside character set.

Examples:

/(abc)+/ - matches "abc" and "abcabcabc"
/(in)?dependent/ - matches "independent" and "dependent"
/run(s)?/ - is the same as /runs?/


| - Match previous or next expression

Examples:

/apple|orange/ - matches "apple" and "orange"
/w(ei|ie)rd/ - matches "weird" and "wierd"
/(AA|BB|CC){6}/ - matches "AABBAACCAABB" and more..
/(\d\d|[A-Z][A-Z]){3}/ - matches "112233", "AA66ZZ", "11AA44" and more..


Anchors Metacharacters:

Anchors refers to a position, not an actual character. They are zero-width.
^: Start of string / line. (Not the same as at start of a character set)

$: End of string / line

Examples:

/^apple/ - matches "apple" only if it's on a beginning of a string/line
/apple$/ - matches "apple" only if it's on a end of a string/line


Backreferences:

Stores the matched portion in parentheses.

/a(p{2}l)l/ matches "apple" and stores "ppl". It is done automatically by default.

Refer to first backreference with \1.

\1 through \9 - backreferences for positions 1 to 9.

Usage:

Can be used in the same expression as the group.
Can be accessed after the match is complete (programming language needed).

Examples:


/(apples) to \1/ - matches "apples to apples"
/(ab)(cd)(ed)\3\2\1/ - matches "abcdefefcdab"
/<(i|em)>.+?</\1>/ - matches "Hello" and "Hello"

Special Characters:


Spaces - space is a regular character
Tabs - tabs are matchable by \t
Line - \r, \n, \r\n

** depends on your file mode
Non-printable characters:

bell \a
escape \e


Useful expressions:


Names:

/^\w+/ - Not that good solution
/^[A-Z][a-z.']+ [A-Z][a-z.']+/ - Matches first name and last


Email Adresses:

/^[\w.\-]+@[\w.\-]+\.[A-Za-z]{2,3}$/ - Matches email


URLs:

/^(http|https):\/\/[\w.\-]+(\.[\w\-]+)+[/#?]?.*$/


IPs:

/^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/m - It is long, but assures that we won't get higher than 255 for each number.


HTML tags:

/<([^>]+)>(.*?)</\1>/