Skip to content

Instantly share code, notes, and snippets.

@eljayman
Last active January 23, 2023 03:01
Show Gist options
  • Save eljayman/01f0e8811784b08e155c2597548ee182 to your computer and use it in GitHub Desktop.
Save eljayman/01f0e8811784b08e155c2597548ee182 to your computer and use it in GitHub Desktop.
JS RegEx Comment Search

JavaScript Regular Expression for Comment Search

In my studies of regular expressions (regex's) I came across an interesting use: To look for comments in JavaScript (JS). Text that isn't a necessary part of the program. Being a junior developer, I find comments edifying when looking through code. As for my own comments, I find it helpful to look again when debugging and see if I made a mistake implementing what I thought to be the right steps. And when comments as pseudocode don't make sense, what does that say about the code itself?

Summary

Whenever I search for something, odds are regex's are at work behind the curtain. So what are regex's and how do they apply to JS? Simply put, regex's are a means of searching for patterns in text. Need to validate a user's email address input before sending data to the back-end? Can I use a means to dynamically find all the URL's in a document? JS also has built-in string methods that can have regex's passed in as an argument, such as replace(), and methods that requirie the use of regex's. The exec(), and test() methods require regex's. The uses for regex's is so broad it turns out that learning regex's applies not just to JS, but all coding, and all text-searching.

This gist serves as an introduction to regex. I want to understand and explain the following regex:

/\/\/.*|\/\*[^]*?\*\//g

I will discuss quantifiers, alternation, character classes and sets, flags, matching (lazy vs. greedy), and boundaries. And how by putting all of this together I am able to find all the comments in JS (HTML and CSS as well.)

Table of Contents

Regex Components

The process JS goes through to match the regex in text is a linear sequence through the characters. This can be modified, and the full extent of modifications possible is beyond the scope of this document. Essentially, each character in the expression is checked against the text (or vise-versa depending on the method) from left to right until one fails, or the whole expression matches.

Syntax for regex in JS depends on whether the literal is used, which I am using for this study, or the constructor function RegExp. The differences are minor, and there are use cases for the constructor function that are beyond the scope of this document. I will have a demonstration example at the end.

To verify matches, I will use the regex test() method which takes a string as an argument and returns true if the expression finds a match, otherwise it returns false, to demonstrate how regex's are utilized by JS like this:

regEx = /\/\/.*|\/\*[^]*?\*\//;

console.log(regEx.test("// this is a comment"));
 ::> true
 
console.log(regEx.test(`/* 
this is also a comment,
that takes multiple lines
*/`));
 ::> true
 
console.log(regEx.test("What will this result?"))
 ::> false

The following components are used in the comment search regex. For more information on regex, visit regexr.com. It's a great regex reference (hence the name.) To take a deeper dive into JS and regex checkout Eloquent JavaScript. The free online book by Marjin Haverbeke has a chapter on regex's, a great deep-dive on this topic.

Character Classes and Sets

Any letter in english, any number, and all special characters can be used in regex's. Icons and other emoji-type special characters that exist in UTF-8 can be used, but their use is beyond the scope of this document.

Here is a simple example:

regEx = /1/;

console.log(regEx.test("1"));
 :>> true
 
console.log(regEx.test("a"));
 :>> false

The comment search regex uses a set to capture all of the non-empty values. As with everything else in this regex, this is somewhat unusual. In the following example the test will look at each character in the string and compare against all of the values in the set, and return true if any matches are found. This is case sensitive, unless the i flag is used. More on that later.

regEx = /[abc]/

console.log(regEx.test("a"));
 :>> true
 
console.log(regEx.test("A"));
 :>> false

Another common use of sets is to define a range of characters separated by a dash within the brackets.

regEx = /[a-z]/;

console.log(regEx.test("test"));
 ::> true
 
console.log(regEx.test("123"));
 ::> false

Other character classes have special behavior. The dot . character matches any character except line breaks and carraige returns. A backslash preceding w like \w matches any word character, the equivalent of [A-Za-z0-9_] and its counterpart \W is the equivalent of [^A-Za-z0-9_]. The caret's special behavior at the beginning of a set negates the set.

regEx = /\W/;
text1 = "test";
text2 = "this is a text with spaces, a comma, and a period.";

console.log(regEx.test(text1));
::> false

console.log(regEx.test(text2));
::> true

The first test returns false because there are only word characters. The second returns true. Other useful special character classes include \d for any digit, the same as [0-9], \D, the same as [^0-9], or the whitespace class \s and it's counterpart \S.

Notice the - inside the set indicates a range. We can include the - in a set, but I include it in a set as the first character like so:

regEx = /[-.]/;

dashText = "-";
dotText = ".";

console.log(regEx.test(dashText));
 :>> true
 
console.log(regEx.test(dotText));`
 :>> true

The dash - and dot . exhibit special behavoirs depending on where they are used inside or outside a set. The . outside a set is any character, while a - outside a set is a -. The . inside a set is a . and a - inside a set at any position but the first indicates a range. I will discuss more special behavoirs below when I need to escape those behaviors.

Quantifiers

  • * The star quantifier means 'zero or more' matches. When used with the dot like .* it is considered a 'wild card' because it matches any character. So the regex /.*/ is used to capture everything. A more targeted approach is usually better.

  • ? The question mark is the 'lazy' modifier, and will be covered in more detail below. It is also the optional modifier. When placed after a character, the character becomes optional. For example:

regEx = /colou?r/;

console.log(regEx.test("color"));
::> true

console.log(regEx.test("colour"));
::> true
  • +, {} I am not using the plus or bracket notation for quantifiers, but they are worth mentioning briefly. The + after a character or set means one or more matches. And the {} can have one or two values inside, like {5} meaning exacltly five matching, {5, 8} meaning five to eight matching, or {5,} meaning five or more matching. The {} is typically used with a range of values.

Escape

I showed that some characters, or character classes exhibit special behavior. To search for the literal characters, rather than use their special behavior, it becomes necessary to 'escape' these behaviors. This is normally done with a preceeding backslash \, with the exception being using a - at the beginning of a set to include it in the set, rather than indicating range.

For example, if I want to search for a dot outside a set I need to escape the special behavior:

regEx = /\./;

console.log(regEx.test("."));
::> true

console.log(regEx.test("A"));
::> false

This example is simple enough. The dot would normally match A but I escaped it's normal behavior with a \.

I am looking for comments, which are preceeded by two slashes //. If I do not escape their default behavior, our code becomes a comment itself! Similarly I need to use a literal * rather than use it as a quantifier. So // becomes \/\/ and /* and */ become \/\* and \*\/.

Alternation

  • | The pipe character is used in JS regex's to denote a choice between any of the patterns separated by it. Since comments may be one of two syntaxes I want to make sure I can select either pattern with the same regex. Here is a simple example:
regEx = /dog|cat|mouse/;

text1 = "dogs are man's best friend";
text2 = "cats own the internet";
text3 = "mouse goes 'click'";

console.log(regEx.test(text1));
::> true

console.log(regEx.test(text2));
::> true

console.log(regEx.test(text3));
::> true

Here I show that one regex can be used to search for many different string values using the | for alternation. Notice the regex matches dogs and cats with the s on the end of both words.

Greedy and Lazy Match

I mentioned in the quantifiers section that ? can be used to make a quantifier lazy. +, *, ?, and {} are greedy by default, meaning they will find as many matches as they can. For this regex it would cause a problem if I had multiple comments using the /* <comment> */ format. If I had something like this:

/* <comment> */
<code>
/* <comment> */

The default behavior of the regex /\/\*[^]*\*\/ would start at the first /* and end at the last */ selecting all the code between the comments. Focus on [^]* to clarify the difference between lazy and greedy matching. This says any non-anything, zero or more times. The greedy-by-default behavior will check to the end of the input and find the last matching instance of the closing */ and select everything between, unintentionally grabbing our code.

Modifying the * behavior with the lazy quantifier ? changes its behavior to match the first instance of */ instead. Hence /\/\*[^]*?\*\/ is a better regex because it isn't greedily selecting more than just comments.

It is best practice to use lazy regex quantifier behavior to avoid bugs like this.

Boundaries

I briefly mentioned line breaks \n and carraige returns \r because I am using the wildcard .* in this regex and they are the only characters that aren't matched (and the following lines are not matched because of this.) These are boudaries that exist at the end of lines, and I am matching them in the second part of the regex with [^].

They are not the only boundaries in regex's. The \b is a word boundary. It includes spaces, commas, dashes, apostrophe's, quotations, and other characters between a word and non-word character. Using \b can be very helpful when looking for words in plain text as they are usually between boundaries.

Flags

In the final position of a regex can be one or more flaga. Flags modify the regex in particular ways.

  • g makes the regex a global search, so rather than finding only the first match, it starts at the index of the previous match each time it is called (lastIndex is a property of the JS regex object). This is particularly useful when I use the exec() method.

  • i makes the regex case-insensitive.

Those are the two most common, useful flags, and there are few others, but their use is outside the scope of this document.

Regex as a Tool

The Javascript literal syntax opens and closes with a slash/. Inside the slashes is where I will define the regex (what I am searching for.) Starting at the second position, a backslash\, that is an escape, which allows the use of a literal character that would otherwise have special meaning. The third position is another slash/, but this time it is the literal character. Repeating this pattern at positions four and five looks for the start of a comment, two slashes//. The sixth position is the dot . which represents any character, followed by a star * which modifies the dot to include zero or more matches, all the way to the end of the line.

In other words, the first seven positions of this regular expression would select the first instance of two slashes, and then all characters and spaces after on that line, if the eighth position was a slash. But the eighth character is a pipe |. This is alternation, or looking for additional conditions by checking against alternate expressions.

Now the alternate expression. In the ninth position another escape backslash \, followed in the tenth position by another literal slash /. The eleventh position is another escape backslash \, followed in the twelth position by a star *. This time a literal star.

So the alternate begins with /* which is how multiple line comments begin in JS.

In positions thirteen through fifteen is a caret in brackets [^]. Brackets indicate a set of characters JS will look for to match against. This set is the caret, which is used to negate the set. With most character classes in regex the negated set is made available as well. In this case the negated set is empty! In other words anything following position twelve matches the negated empty set. Following the negated empty set at position 16 is the star *, back to its special behavior as a quantifier that matches one or more instances of the negated empty set. This time, however, it has a quantifier of its own in position 17, the question mark ? or lazy modifier in this case (the question mark is also an optional, more on that here.) This helps ensure that a document with multiple comments don't have code between comments selected.

At position eighteen is another escape backslash \, followed by another literal star * at position nineteen. Position twenty is another escape backslash \, followed by another literal slash / at position twenty-one.

The comment closing */ comes after the lazy negated empty set.

At twenty-two is the regex closing slash / but this isn't done yet, as there is a g in position twenty-three. This is the global flag. Remember how I didn't want to find just the first instance of a comment, but all the comments? The global flag g finds all instances of the regex within the search parameters.

Now that I have a way to access all of the comments in my JS I can begin to use methods like replace(), match() or exec(). The replace() method takes two arguments. The first is a string or regex, and the second is a string or function that replaces the match. I can use this to remove the comments from code by passing the regex above in as the first argument and an empty string as the second. I have ommited the g flag here because there is only one comment.

regEx = /\/\/.*|\/\*[^]*?\*\//
text = '1 // comment';

console.log(text); 
  ::> prints out '1 // comment'
  
function removeComments(text) {
 return text.replace(regEx, "")
}

console.log(removeComments(text)); 
  ::> prints out '1'

Next I can use the regex to return the value of comments. Using the match() method, I can pass a commented string into a function and use the regex as the argument then return an array with the value of the matching section of the string. In this case, the whole string:

regEx = /\/\/.*|\/\*[^]*?\*\//;
text = "// the match function returns an array of the first matching string";

function displayComment(text) {
  return text.match(regEx);
}

console.log(displayComment(text));
 ::>  [
  '// the match function returns an array of the first matching string'
]

In this next example I am passing multiple commented lines into the exec() method, which is capable of returning an array of all matches and groups (grouping is a regex ability that is not in the scope of this document.) The purpose of this example is to show that exec() is a method on the regex that returns all matches. The result object includes the index position of the first matching character(lastIndex). This method is usually called with a while loop to iterate over all the matches, each starting at the lastIndex of the previous match, and going until there are no more matches. Notice the g flag:

text = `
this is the // first line of commented code
this is the // second line of commented code
this is the // third line of commented code
`;
regEx = /\/\/.*|\/\*[^]*?\*\//g;

let comment;
while ((comment = regEx.exec(text))) {
  console.log(comment[0]);
}
 ::> 
// first line of commented code
// second line of commented code
// third line of commented code

And finally for demonstration purposes here is what the comment finding regular expression looks like as an argument for the constructor function. Notice the flags get passed in as the second argument:

const regEx = new RegExp("\\/\\/.*|\\/*[^]*?\\*\\/", "g");

Author

Leland Johnson is a junior developer who enjoys writing about code just as much as writing code, and reading about it.

Find me on github

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment