hkolbeck/gist:6966069

## gistfile1.rst

      
    Raw
  

              gistfile1.rst
            
          
    Regular expressions are often very intimidating, in part because they are so information-dense and because a symbol's meaning can depend entirely on context. Unlike many things in computer science, regular expressions actually get easier (at least in my opinion) when you go back to the backing theory.
Regular expressions (regexes for short) are like equations for text. Specificaly they describe some number of similar pieces of text such that if we have some text and a regular expression we can ask "Does this text contain the pattern described by this regex?" We'll assume the text we want to examine is some basic english, which strangely has only the letters 'a', 'b', and 'c'. I'll surround regular expressions with forward slashes (/), and text with double quotes (") To start of with, we'll have only a few tools for doing that:
The bare minimum


empty - ε
Just like 0 is super important in math, we need some way of specifying "nothing", the absence of text. It's common to use a lower case epsilon or lambda for this. We'll see why we need this later.


concatenation :
There's not really a symbol for this, it just says that if we have two regular expressions, we can stick them together to make a new one.


literals - a,b,c
The most basic building block are letters, more generally called 'literals', they mean themselves! /abc/ is a regular expression, albeit a boring one. It matches exactly one piece of text, namely "abc". Does it match "aabc"? For now, we'll say 'no', but many regex engines would say 'yes'. We'll learn more about why when we talk about "anchors" later.


alternation - |
If we're going to match more than exact pieces of text, we need a way of saying "this OR that", alternation does that. The regex /a|b/ matches exactly two strings, "a" and "b". We can also chain them together, so /a|b|c/ matches "a", "b", or "c".


grouping - ():
What if we wanted to match a pattern like "either two 'a's in a row or three 'b's in a row". Just like in algebra, we use parenthesis to group things together. So /(aa)|(bbb)/ will match either "aa" or "bbb".


kleene star - *:
The last piece we need is some way of specifying repitition. If we wanted to say "a piece of text that's entirely the letter 'a', or empty" we could say something like /ε|a|(aa)|(aaa)|(aaaa)|(aaaaa)|(aaaaaa)/ and so on, but that's hard to read, and since we can only go on so long, we'll fail to recognize text that's longer than we put in our pattern. Instead, we can write /a*/. * attaches to the thing to its left, which might be a literal, but could also be a group, and specifies that that thing could appear 0 or more times


Try to write patterns that match:

Just the strings "abc", "bac", and "cab"
Any string of 1 or more 'a's. How will it be different from 0 or more?
Any string that starts with an 'a'


That's it! That's all we need to make any regular expression! Everything else is just syntactic sugar to make things a bit easier. Ok, a lot easier. How would you write a pattern that matched any possible text? In our simple world of just 'a', 'b', and 'c' it's not too long to write: /(a|b|c)*/. Take a moment to think about why that matches anything. Try to comeup with something it wouldn't match. Alas, most text isn't just 'a's, 'b's and 'c's, there are often more complicated letters like 'q's and 'y's and non-letters like '7' and ','. Suddenly our "Match any string" pattern gets a lot more complicated. /(a|b|c|d|...|A|B|C|D|...|1|2|3|4|...)*/ Not exactly practical to write out all the time, so most regex engines let you use some helpers. They're more symbols to remember, but they're all defined in terms of the operations we've already seen.
Extensions


any character - .
Instead of writing out the whole (a|b|c|...) mess, we can use . to mean "any single character", so we can match any string by saying /.*/, or use /a.b/ to match any of "aab", "abb", or "acb".


one or more - +
Quite often we want one or more of a thing, rather than zero or more. The standard way to answer question two in the last bit would be /aa*/ (one 'a', followed by zero or more), but instead we can just write /a+/. This works just like the '*' in that it attaches to the thing, whether that's a literal or a group, before it.


zero or one - ?
? attaches to literals or groups just like * or+, but says that a thing appears exactly zero or one time. If we wanted to match "An 'a', possibly followed by exactly one 'b'" We'd have to write /a(b|ε)/ before, but now we can write /ab?/.


character classes - []
Sometimes we want more than just a single literal, but less than any literal. Character classes give us a shorthand for writing that out. /[ab]/ is equivalent to /(a|b)/ will match either "a" or "b". It's also sometimes useful to say "anything but", so we can say /[^c]/ to say "anything but c". Finally, since it'd get tiresome to write out [0123456789] any time we wanted to say "a decimal digit", we can write a range class: [0-9] for the same effect.


escaping - \
What if we needed write a pattern to match things that contain '' or '?' or any of the other symbols we've assigned special meaning? In order to do that, we need to temporarily remove their special meaning. The usual way to do this is to prefix them with a backslash. ``/.?/will match any sequence of letters that ends with a '?'. If you wanted to match a literal '\\', you'd write/\/. Escapes are also used to specify single characters that might otherwise be heard to write or read, for instancetmeans a tab character,x03B5`` is the hex code for 'ε'.


predefined character classes
For bonus confusion, sometimes a \ adds special meaning instead of taking it away. Remember how we saved some time by saying [0-9]? It turns out that's really common, so most engines let us write \d for "any decimal digit". Some other common ones are \s for any whitespace character, and \w for any word character, which for historical reasons means any uppercase or lowercase letter, digit, or underscore ('_'). Usually the capitalized version (\S instead of \s) means the complement or opposite, so \S means "Anything but whitespace".


Try to write patterns for:

Any string with even length. Hint: /../ matches any two character string.
Any string of odd length.
One or more repetitions of "ab"
How would you write a pattern equivalent to /abb?b*/ that's a bit clearer?


Finally, many regex engines provide a few extra niceties:


anchors - ^ and $
It turns out that most of the time we're looking for some small chunk of a larger piece of text, instead of trying to match the whole thing. To make that more convenient, most engines you'll run into (at least in Ruby, Python, Perl, or Go, but not Java) will treat a pattern like /cab/ as if you'd written /.*cab.*/. To provide the behavior we've seen before, they provide "Anchors". Anchors are special characters that don't match any symbols in the text, but instead mean the beginning '^' or end '$' of a piece of text (usually a line). Using anchors, we can go back to the exact patterns we had before, like /^cab$/, which will match only the string "cab" again. From now on we'll assume that if we want our pattern to be anchored, we need to say so.


How would you use an anchor to answer question 3?


counts - {}
Often we want something other than the 0 or more of *, 1 or more of +, or 0 or 1 of ?. Many regex engines let us make our own using squiggly brackets. /^a{5}$/ will match "aaaaa" instead of writing out /^aaaaa$/. We can also get a bit fancier. /^a{4,10}$/ will match any string of 'a's that's at least 4 long, but not more than 10. If we just wanted to say "Longer than 4" with no upper bound, we could write /a{4,}/`.


capture groups - ()
Often we want to extract some small chunk which matches a pattern out of a larger text. Capture groups are a way of saying "save the text that matched this part of the pattern for later". They don't affect whether a text matches, but they're very useful. If we wanted to search through a text for US phone numbers, saving the area code, we might do something like /1-(\d{3})-\d{3}-\d{4}/. The parentheses around the area code portion form a capture group. After we successfully match our pattern against a piece of text, we can examine the text that matched just that portion. The accepted way of doing that varies by language.


backreferences
The "regular" in "regular expression" means something very specific to computer scientists and linguists. It turns out that by using backreferences, we're no longer actually dealing with "regular" expressions, but many engines include them because they're very handy. Backreferences let you refer to earlier capture groups while you're still in the pattern. That's useful if we want to ask something like "Does this text contain any adjacent repeated words?" Text like "A dog jumped over the the cat". Backreferences let us answer using patterns like / (\w+) \1 /. \1 is a backreference to whatever text matched the \w+, the spaces are the to make sure we don't pull off chunks of words.


Using Regular Expressions

Regular expressions are usually used for one of three purposes: Matching, Extraction, and Substitution.


Matching
The most basic use, just asking the yes/no question "Does this text contain this pattern?"


Extraction
If our pattern contains capture groups, after matching we can extract the portions that matched for use elsewhere in our program.


Substitution
Instead of just finding matching portions of a larger text, replace them with some other string. Let's say we wanted to replace all the tabs in a document with four spaces. In perl, we'd write something like: text =~ s/\t/    /g, where the g stands for global, meaning change every match you find, not just the first one. In Ruby you'd do the same thing with text.gsub!(/\t/, "     ").


Questions? Suggestions for improvement? I'm @cbeckpdx on twitter or cbeck on freenode IRC.