Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active December 2, 2023 02:30
Show Gist options
  • Save cellularmitosis/6fd5fc2a65225364f72d3574abd9d5d5 to your computer and use it in GitHub Desktop.
Save cellularmitosis/6fd5fc2a65225364f72d3574abd9d5d5 to your computer and use it in GitHub Desktop.
Matching a string literal using regex

Blog 2019/9/1

<- previous | index | next ->

Matching a string literal using regex

When implementing a regex-based lexer / tokenizer, coming up with a regex which matches string literals can be a bit tricky.

Every time I do this, it has been long enough since my previous attempt that I've forgotten the particulars. So this is note a to my future self.

Note: it can be tricky find the right phrase to put into google to find good resources for this. Searching for "string literal regex" seems to work well.

The naive string matcher

You'll probably start with this:

"[^"]*"

A string is:

  • the opening quote
  • zero of more of
    • any character other than a quote
  • the closing quote

This works for simple strings, like "I said hello to the baker.".

However, it breaks for strings which contain other strings, for example "I said \"Hello!\" to the baker.". This would match two strings:

  • "I said \"
  • " to the baker."

The sub-string matcher

Here's an approach which seems to work:

"([^"\\]|\\.)*"

A string is:

  • the opening quote
  • zero or more of:
    • either:
      • any character other than a quote or backslash
      • a backslash followed by any character
  • the closing quote

One way to think of this is that we disallow backslashes unless they are followed by another character. So, "\" is not a valid string, while "\\", "\t" and "\"" are valid strings.

But there is one last gotcha: the . doesn't mean "any character", it actually means "any character other than a newline", so this regex won't work with multi-line strings.

The fix is to replace . ("any character other than a newline") with [\s\S] ("either a whitespace or non-whitespace character"):

"([^"\\]|\\[\s\S])*"

Thanks

Thanks to:

Copy link

ghost commented Dec 8, 2022

Thank you for this!! Saved me so much time!

@oreganoli
Copy link

Very cool, thanks! Comes in useful for all sorts of source code processing - I was trying to deobfuscate some code when I ran into this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment