sudkumar/regex.md

## regex.md

      
    Raw
  

              regex.md
            
          
    Regular Expression

Regular expression are used to match a string or a set of string
General format

/matching_pattern/flags


matching_pattern: Any pattern to match with a string
flags: This tells how the match should occur

g: for global
i: for case insensitive
m: multi line matching


Lets learn with some examples
Simple expressions

/sudhir/

This will match with sudhir
/s.s/g

This will match with sas and s2s etc. "." means: match with any character.
These are called meta characters and most of the time have special meanings.
Meta Characters

. 

Match any character except newline e.g.: /1.5/g matches 105, 1.5, 1e5 etc.
* 

At least zero occurrence of presiding match e.g.: /12*5/ matches 15, 125, 1225
+

At least one occurrence of presiding match e.g.: /12+5/ will match with 124, 1225, 1222225 etc.
?

0 or 1 occurrence of presiding match e.g: /12?5/ will only match with 15 and 125, not with 1225.

Matches in the regular expression are greedy, they will grab the string as much as they can until there is a match. Sometimes it is useful and sometimes it is not. To eliminate the greediness of a meta character, we use ?. For example:
/<.*>/ will match with <p> and also with <p>Matched</p> .  But  /<.*?>/ will with <p> not with <p>Matched</p>.

[]

To match a subset from a possible set. e.g.: /1[abcd]2/ will match with 1a2, 1b2, 1c2, 1d2 but not with 12, 1e2, 132 etc.
{}

To match a number of quantity of preceding pattern. e.g.: /2{2}/g will match with 22 not with 2 or with 222.


a{n} : a with n times


a{n,} : a with at least n times


a{n,m}: a with at least n and maximum m times.
()


To group a set of matches: e.g.: /(ab)/ will match with ab. The main use of "()" is used to save the sub-matched string that regular expression will store it in \N where N represents the Nth group matched string.

To Sometimes we don't want grouping to be saved as it need some memory and a bit slower. For that we can use "?:" to remove the saving. e.g.: /(?:ab)/ will match with ab but will not be saved by the regular expression.These are called assertions.

^

Represents start of string, or negation of match. Also called Anchors e.g.:
/^sudh/g will match with sudh but not with asudh.
/[^abcd]/g will match with k, m, 0, 1 but not with a, b, c, d
$

Represents end of string. Also called anchors. e.g.:
/sudh$/g will match with sudh, asudh but not with sudha.
|

Used for "or" meaning. e.g.: /(ab|ba)/g will match with ab or ba
Assertions

Assertions are match with zero string length.
\b

Boundary matching. matched between words and none words. e.g.:
/\b/g and input is ad then it will match at both ends, and if the input is a$b then it will match at start and at end and also between a and $, and $ and b. But if the input is a$ then it will match at start and after will match at after a, not at the end as $ and empty strings forms a single non-word.
/\bfoo\b/g will match with foo inside "boo foo bar" but not in "boofoobar"
\B

Also boundary matching, called no-word boundary matching. Matches between word and word, and non-word and non-word. e.g.:
/\B/ and input is "a", then there will be no match but if the input is "ab", then it will match between a and b.
(?=a)

Something should follow a. This is used in lookahead. e.g.:
/a(?=b)/g will match with ab but b (called lookahead) will not take part in the matching.
(?!a) 

Not followed by a. e.g.:
/b(?!a)/g will match with bc, bd but not with ba as a followed something.
Some predefined classes of patters


\w : [a-zA-Z0-9_]
\d : [0-9]
\s ~: [\t\r\n ]
\W : other then \w : [^\w]
\D : other then \d : [^\d]
\S : other then \s : [^\s]

Back referencing

What if we want to know, what was matched with the last group. e.g.:
/('|").+?('|")/g will match with "He is my friend" but partially with "This doesn't belong to me." as it will be matched till first '. What we want is that we want to know what was the first matched character.
/('|").+?\1/g and here \1 will be replaced with whatever is matched from first group, in our case, among ' and " at the matching time.
Some more examples

Q. Anything that doesn't contains a string, say "foo"
/^(?!.*foo).+$/


Explanation : Lookahead and lookback are not matched with the previous pattern. So the above pattern says: Start from beginning, look if .*foo follows something. If is the case, not matching, if it's not, let the match be happen with ".+"

Q. A 8+ character password with at least one uppercase letter, one letter and one among *, #
/^(?=.*[A-Z])(?=.*\d)(?=.*[*#]).{8,}$/  


Explanation: As from the previous example, lookahead will not capture a matched part. So from start, we search for [A-Z] and then for \d and then for [*#], to get if they follow something (means they are present in password) and then (as nothing has yet been captured by pattern) we pass the entire password to .{8,} which checks for length to be at least 8 characters.