Skip to content

Instantly share code, notes, and snippets.

@joker314
Last active September 9, 2017 13:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save joker314/5a13e9111097a7e9791fa3c278a53f5d to your computer and use it in GitHub Desktop.
Save joker314/5a13e9111097a7e9791fa3c278a53f5d to your computer and use it in GitHub Desktop.

Ok, so, you might know about JavaScript regular expressions. Well, here is a tutorial about them, but written by a 13 year old, so it isn't actually any good!

Regular expressions go between / characters. Here is an exampe, /hi/.

Ok, now then. Let's learn how to match the string abc. Well, that's quite simple.

/abc/. Yey! So putting letters next to each other makes them match one after the other.

Ok, now, after the second / we can put a g to make it match globally, that is, we can extract abc from xyzabcghi.

/abc/g.

Great! Eh?

Character class

Basics

What if we want to match either an A, a B, or a C, but not all three one after the other? Well, if you put them into these square brackets ([]) then you create what's called a character class, which is a group of characters, where any of them could be matched! Great, right?

/[abc]/g, now that matches a, b, and c. Great, right?

Shortcuts.

They are great, aren't they?

Well, I want to match a single digit! This should be easy, we already know how to make a character class.

/[0123456789]/g. Done, right?

Well, yes, it works. But it's a bit long, isn't it?

I wish there was a way of saying "a number between 0 and 9". Well, it turns out there is! Yey!

/[0-9]/g. Wow, that's much shorter. What if I want to match a digit, or a decimal point? Well, we can do that! /[0-9.]g/.

Huh, that looks a bit weird? What happened. Well, remember that [0-9] means [0123456789] so [0-9.] means [0123456789.].

That makes sense.

Can I do that, but without using all the numbers. Let's say I have a regular expression, [34567]. How can we shorten that?

Well, [3-7] is the answer! Yey!

What about letters, can we do the alphabet? Yes! [a-z] WOW!

Case Insensitive

So, now that we're going for letters, we might want to be able to not care about whether a letter is uppercase or lowercase.

The way we do that, is by putting an i after the /. So, let's say we have /abc/g which matches abc ONLY. If we do /abc/gi (or /abc/ig/, it doesn't matter), then we can match

  • abc (still)
  • abC
  • aBc
  • aBC
  • Abc
  • AbC
  • ABc
  • ABC That's way more possibilities!

The Backslash \

Introduction

Never, ever, underestimate the backslash. What it does, is, it gives characters that don't have special meaning a special meaning, and take away the special meaning from characters that do.

Removing special meaning

Let's do a quick example! /abc\[/g matches "abc[". Usually, [ means the beggining of a character class, but not if you put a \ before it!

And \, it has a special meaning, so if you want to match the string "abc\[" then you need to escape both the \ and the [.

So, we get, /abc\\\[/g. abc for the abc, \\ for the \ and \[ for the [.

Adding special meaning

So, we have already shortened our digit-matching code to [0-9]. Can we get shorter? As it turns out, if you put \d then the d gets some special meaning! It means "digit".

Let's try it out /abc\d/ is the same as /abc[0-9]/. Isn't this great? I, personally, think this is.

Even cooler, if you make the d a capital letter, then it negates its meaning. So, for example, \d means digit, \D means NOT a digit.

  • \b, a word boundary, that is, the end or start of a string; or the point before or after a space character that must be before or after a word-character (see \w about word-characters). **Important: ** word boundaries are points of length zero where the change between words and word-boundaries occurs, and they don't match characters!
  • \B, anything that isn't a word boundary
  • \c<capital letter>, it's complicated, and I wouldn't worry about it 😄. Note that this doesn't have a negative, and also that there are two characters after the backslash, which is unusual.
  • \d a digit, /[0-9]/ is the workaround
  • \D anything that isn't a digit
  • \f form feed. This is a character. It doesn't have a negative.
  • \n is a newline character. It's what seperates lines on most operating systems. Doesn't have a negative.
  • \r is a carrige return, it's a bit like the \n charater.
  • \s is a space-character, and it includes the tab character, the space character, the newline character, the carriage return character, and many more.
  • \S is everything that isn't a space character
  • \t is the tab character, you know, the one that takes out about 4 spaces worth of gap.
  • \v something called a "vertical tab". I know, right?
  • \w, a word-character! This is the same as /[a-z0-9_]/i (or, /[a-zA-Z0-9_]/).
  • \W everything that isn't a word character
  • \<number goes here> we'll cover these later!
  • \0 is a NUL character, which you shouldn't need to worry about.
  • there are a couple more, but we will cover those later.

Negated character classes

Ok, so, let's say you want to match everything except for a, b and c. Well, if you put a ^ as the first character in a character class, it negates it. [^abc] is what we want!

So \D is [^0-9] since \d is [0-9].

Quantifiers

Basics

Sometimes, we want to repeat the same pattern over and over again. And programmers are lazy, they don't want to write \d\d\d\d\d\d\d\d\d\d\d.

So, let's learn something new! Adding {X,Y} where X and Y are numbers, after something that can be matched, then it tries to find between X and Y of that thing (inclusive). Note that {X, Y} probably won't be valid, so avoid any spaces in there.

If we want between 4 and 6 a characters, we can do a{4,6}.

What about exactly 7 digits. Let's do this: \d{7}.

What about a letter, 3 or more times. Well, we can do that like so [a-z]{3,} (note the extra coma).

Shortcuts

Ok, now, again, we want some shortcuts. {0,1} can be shortened to ?, {1,} can be shorterned to +, and {0,} can be shorterned to *.

Greedy v.s. Lazy

If we have a+, then will aa match the first a, the second a, and then both of them together? No. Usually, it will act greedy. It will try and match as many characters as possible.

If you want a quantifier to act lazy, that is, match as few characters as possible, then you should put ? after it.

However, putting a?? is pointless, because a non-greedy version of "0 or 1" is "0" so it gets ignored, but makes the RegExp engine take more time to run. Likewise with a*?.

(To be continued...)

@bob1171
Copy link

bob1171 commented Sep 9, 2017

nooo i want more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment