Skip to content

Instantly share code, notes, and snippets.

Last active May 26, 2021
What would you like to do?
Matching a string literal using regex

Blog 2019/9/1

<- previous | index | next ->

Matching a string literal using regex

When implementing a regex-based lexer / tokenizer, coming up with a regex which matches string literals can be a bit tricky.

Every time I do this, it has been long enough since my previous attempt that I've forgotten the particulars. So this is note a to my future self.

Note: it can be tricky find the right phrase to put into google to find good resources for this. Searching for "string literal regex" seems to work well.

The naive string matcher

You'll probably start with this:


A string is:

  • the opening quote
  • zero of more of
    • any character other than a quote
  • the closing quote

This works for simple strings, like "I said hello to the baker.".

However, it breaks for strings which contain other strings, for example "I said \"Hello!\" to the baker.". This would match two strings:

  • "I said \"
  • " to the baker."

The sub-string matcher

Here's an approach which seems to work:


A string is:

  • the opening quote
  • zero or more of:
    • either:
      • any character other than a quote or backslash
      • a backslash followed by any character
  • the closing quote

One way to think of this is that we disallow backslashes unless they are followed by another character. So, "\" is not a valid string, while "\\", "\t" and "\"" are valid strings.

But there is one last gotcha: the . doesn't mean "any character", it actually means "any character other than a newline", so this regex won't work with multi-line strings.

The fix is to replace . ("any character other than a newline") with [\s\S] ("either a whitespace or non-whitespace character"):



Thanks to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment