Gordon-Magill/~regex.md

## ~regex.md

      
    Raw
  

              ~regex.md
            
          
    RegEx: Understanding commonly used passwords

grep (get, regular expression, print) is a -nix program that facilitates finding information in your filesystem by using the flexibility of RegEx (regular expressions) to sift through large amounts of text content in files. Knowledge of RegEx will allow you to be very specific in the types of information that you retrieve using GREP (or possibly additionally specific in what information you do NOT wish to retrieve, making the query more powerful and flexible).
Summary

We will work through the following grep commands. In this case grep is going to search through a large set of publicly leaked passwords to search for the number of people who use very basic passwords. Hopefully this exercise will entice you to improve the security of your passwords! Here's a link to the rockyou password list used here.
In most cases grep will be called with the -Ec flags, where -E forces the use of extended regular expressions (which will be used heavily) and -c simply displays the count of matches (i.e. how many people had passwords that matched our regex pattern). If you want to see examples of the passwords that match these criteria, you can instead use the -Em 20 flag which tells grep to only display a maximum (m) of 20 results.
Table of Contents


Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Flags
Character Escapes
Greedy vs Lazy Matching
Boundaries
Back References
Look-ahead and Look-behind

Regex Components

Anchors

grep -Ec '^.*\d$' rockyou.txt

In this case we want to understand how many people use a number at the end of their password (see examples above). In regex this is accomplished with the $ anchor - any character immediately before the $ is only matched if it occurs at the end of a line. In this case \d$ indicates that we're going to search for a digit (\d) at the end of the line ($).
It's important to note that in rockyou.txt that each password is put on its own line, so using both ^ and $ will be very common in order to ensure complex RegEx patterns are greedily matched to a single line as long as no newline characters are included in the pattern.
In this case 8,316,321 of the 14,344,173 passwords in the list (58%!) have a trailing digit.
But what does the .* mean in this pattern? What does greedy matching mean? This brings us to...

Quantifiers

grep -Ec "^.{6}$" rockyou.txt

Please, please don't use passwords that are less than 6 characters long. Methods of cracking passwords are most effective on short passwords.
The stipulation that the pattern match 6-character passwords originates from the {6} component of the pattern, which strictly matches a sequence of 6 characters that preceed the {. In this case igore the . (we'll get to that later), but for now just know that it matches most characters. So ^.{6}$ translates to a pattern that matches the start of a distinct line (^) followed by any 6 characters (.{6}) and then end of the line ($). If you wish to match 6 or more characters, try using {6,}. If you want at most 6 characters, use {,6}. For passwords that have 6-12 characters, use {6,12}.
Other quantifiers of note include:

* matches a pattern 0 or more times
+ matches a pattern 1 or more times
? matches a pattern 0 or 1 time (effectively making that pattern portion optional)

Coming back to the password lengths, it appears as though people tend to choose passwords that resemble some sort of skewed guassian distribution:


Grouping Constructs

grep -Ec '^(?:password)(\d*)$' rockyou.txt

In this context our grouping constructs, denoted by ( and ), don't actually do much. Technically speaking each set of parentheses dictates a sub-pattern, so (patternA)(patternB) will match strings that have some chunk that that matches patternA followed by another chunk that matches patternB. In other uses of RegEx grouping constructs can be used to 'capture' the subset of characters and re-use them later (in this example, maybe recording down the last 4 digits (\d{4}) to see what types of numbers people like to use at the end of their passwords. Non-capturing grouping constructs can be defined by including ?: just inside the parentheses.
In this case only 1201 people (<0.01%) use a password that's simply the word password followed by a number.

Bracket Expressions

grep -Ec '^.*[^a-zA-Z0-9].*$' rockyou.txt

So far we've been using .* to indicate that we want zero or more (*) arbitrary non-whitespace characters (.). What if we want to define a specific set of characters to match to? We can use bracket expressions to specify which characters we want to include in the pattern followed by an optional quantifier. In this case we want to see any passwords that contain at least one special character ([^a-zA-Z0-9], where we assume any non-alphanumeric character is 'special'). Notably bracket expressions can take character ranges as well like [a-z] to match all lowercase characters. Convsersely, a bracket expression can also selectively exclude certain characters if a ^ is included just inside the left bracket, such as [^0-9] to indicate any character that's not a digit 0-9.
In this case only around 7% of users use a password that has a special character in it.

Character Classes

grep -Ec '^\w\d\s.*$' rockyou.txt

What if we don't want to use bracket expressions for everything to specify characters? We have several options:

. matches any character save for newlines (\n)
\d matches any digit 0-9 (and so is functionally equivalent to [0-9])
\w matches any letter, digit, or underscore (and so is functionally equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (dangerously also including newlines!)

In the above example this translates to a password that begins (^) with a letter/digit/underscore (\w) followed by a digit (\d) followed by a whitespace (\s) followed by any number of arbitrary characters (.*) at the end ($).

The OR Operator

grep -Ec '^(\w\s\w+|\w+\s\w)$' rockyou.txt

What if our pattern has some flexibilty - we want to include multiple types of matches in the same pattern. We can use a grouping construct (()) with a pipe (|) character to define multiple options. In this example we want to find passwords that have a single character (\w) then some whitespace (\s) and additional characters (\w+) OR a reversed case where the single isolated character is at the end.
If we switch to a slightly different version of this pattern, we can see how many people are using passwords with very basic character sets:
grep -Ec '^([a-z]*|[A-Z]*|[0-9]*)$' rockyou.txt

A whopping 44% of all passwords are either all lowercase, all uppercase, or all numbers! This is critical because brute force password cracking works much more effectively on simple character sets!

Flags

grep -Ec -i '[A-Z0-9]+' rockyou.txt

In this case flags commonly used as part of RegEx don't necessarily play nicely with GNU grep. In JS for instance you might define a regular expression with an optional flag tacked on to the end (outside the // bounds):
let myRegEx = /[a-z0-9]+/i;
These flags modify the regex in some unique ways:

g forces a global search - allowing for multiple matches
i makes the pattern case-insensitive
m allows for multi-line pattern matching

In the case of grep, we're applying the i flag before the expression to be consistent with grep's expected arguments. Note how despite specifying capital letters ([A-Z]) that most of the examples are of lowercase passwords.

Character Escapes

grep -Ec '^.*[\[\]]+.*$' rockyou.txt

RegEx tends to use special characters to define a pattern - but what if you want your pattern to interpret a character as something to be matched rather than something that defines RegEx behavior? The \ character can be used to escape the character after it, negating any special RegEx properties it might have (e.g. .* would mean 0 or more of any non-whitespace character, while \.* would mean 0 or more period characters). In the above example, we're looking for passwords that use literal brackets ([\[\]]) somewhere in the password's body.

Greedy vs Lazy Matching

Unfortunately BSD grep (default on macOS zsh) does not support usage of ? to indicate non-greedy matching. However, a javascript example might give you some idea of what to expect:
let quote = 'He said "hello", but she said "goodbye"';

// Greedy example
let regexp = /".+"/g;
quote.match(regexp); // "hello", but she said "goodbye"

// Lazy example
let regexp = /".+?"/g;
quote.match(regexp); // "hello", "goodbye"
Source: javascript.info
The greedy example (no ?) keeps matching characters to the .+ pattern, including the " character, resulting in the matched string having two quoted words inside it. The lazy example (with ?) truncates its value at the smallest valid matching string.

Boundaries

grep -Ec '^1\b.+' rockyou.txt

The \b special backslash expression marks an empty string where on one side there are word characters (things that would match to \w) and on the other side there are non-word characters (or no characters at all). In this case we're looking to find passwords with a leading 1 followed by a word break (\b) followed by 1 or more (+) non-whitespace characters (.). In the examples above a variety of non-word characters are seen creating a condition that matches \b.

Back References

grep -Ec '^(.+)\1+$' rockyou.txt

Remember grouping constructs (())? A key reason to use them is to re-use the captured content again.
In the above example (.+) (one or more characters) is captured in group 1 (group 0 is the whole matching string, while group 1 is the first matching pattern captured in the first group construct) followed by one or more instances (+) of the captured value in group 1 (\1). Effectively this figures out any passwords that are entirely composed of any repeating substring...and it turns out that about 0.5% of all passwords meet this criteria!

Look-ahead and Look-behind

Alas BSD grep does not support lookahead and lookbehind operators, but once again javascript has a reasonable example:
let quote = "3 big burritos? That'll cost you 40€";
quote.match(/\d+(?=€)/); // 40
Source: javascript.info
In this example the pattern is looking for one or more (+) digits (\d), but will only match if those digits are immediately followed by a € symbol ((?=€)). This causes the 3 to not be matched, instead matching with the 40. Lookahead (following the format of X(?=Y)) can be contrasted with lookbehind (following the format (?<=Y)X).
Negated versions of these (X(?!Y) and (?<!Y)X) work similarly, matching the character X only if Y is not immediately after or before, respectively.

Author

Gordon Magill
A casual observer of infosec and interesting tools
Github: Gordon-Magill