Skip to content

Instantly share code, notes, and snippets.

@nirvdrum
Created February 13, 2024 18:33
Show Gist options
  • Save nirvdrum/73c9c437fbd88923035f753231ec4c74 to your computer and use it in GitHub Desktop.
Save nirvdrum/73c9c437fbd88923035f753231ec4c74 to your computer and use it in GitHub Desktop.

Regexp Encodings

There are two string values being updated as we go along:

  • RegularExpressionNode#unescaped
  • RegularExpressionNode#source

unescaped is supposed to be the source string according to the interface. However, it doesn't adapt to many situations.

Regex CRuby 3.3.0 Source Prism RegularExpressioNode#unescaped (Pre-changes)
/\x00/ "\\x00" "\\x00"
/x0/ "\\x0" "\\x0"
/\xa/ "\\xa" "\\xa"
/\M-\C-?/ "\xFF" (invalid multibyte escape) "\\xFF"
/u{80}/ "\\u{80}" "\\u{80}"
/u0080/ "\\u0080" "\\u0080"
/\x80\u{80}/ "\x80\u{80}" (invalid multibyte escape) "\\x80\\u{80}"
/\\x0/ "\\x0" "\\x0"
bin/prism parse -e '/\x00/'
bin/prism parse -e '/\x0/'
bin/prism parse -e '/\xa/'
bin/prism parse -e '/\M-\C-?/'
bin/prism parse -e '/\u{80}/'
bin/prism parse -e '/\u0080/'
bin/prism parse -e '/\x80\u{80}/'
bin/prism parse -e '/\\x0/'

After my first round of changes to better track the byte values behind the source strings in RegularExpressionNode, we ended up with:

Regex Prism RegularExpressioNode#unescaped Prism RegularExpressioNode#source
/\x00/ "\u0000" "\\x00"
/x0/ ""\u0000" "\\x0"
/\xa/ "\n" "\\xa"
/\M-\C-?/ "\xFF" "\\xFF"
/u{80}/ "\u0080" "\\u{80}"
/u0080/ "\u0080" "\\u0080"
/\x80\u{80}/ "\x80\u0080" "\\x80\\u{80}"
/\\x0/ "\u0000" "\\x0"

Regexp Encoding Modifiers

/u UTF-8
/e EUC / EUC-JP
/s SJIS / Windows-31 J
/n ASCII-8BIT

Source Encoding: US-ASCII

No Character Escapes

Regexp Encoding
/garçon/ invalid multibyte char (US-ASCII) (SyntaxError)
/garçon/u invalid multibyte char (US-ASCII) (SyntaxError)
/garçon/e invalid multibyte char (US-ASCII) (SyntaxError)
/garçon/s invalid multibyte char (US-ASCII) (SyntaxError)
/garçon/n invalid multibyte char (US-ASCII) (SyntaxError)

Hex Character Escapes

Regexp Encoding
/\x80/ ASCII-8BIT
/\x80/u invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/e invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/s invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/n ASCII-8BIT
/gar\xC3\xA7on/ ASCII-8BIT
/gar\xC3\xA7on/u UTF-8
/gar\xC3\xA7on/e EUC-JP
/gar\xC3\xA7on/s Windows-31J
/gar\xC3\xA7on/n ASCII-8BIT

UTF-8 Character Escapes

Regexp Encoding
/gar\u{E7}on/ UTF-8
/gar\u{E7}on/u UTF-8
/gar\u{E7}on/e incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/s incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/n incompatible character encoding: /gar\u{E7}on/ (SyntaxError)

Source Encoding: ASCII-8BIT

No Character Escapes

Regexp Encoding
/garçon/ ASCII-8BIT
/garçon/u regexp encoding option 'u' differs from source encoding 'ASCII-8BIT' (SyntaxError)
/garçon/e regexp encoding option 'e' differs from source encoding 'ASCII-8BIT' (SyntaxError)
/garçon/s regexp encoding option 's' differs from source encoding 'ASCII-8BIT' (SyntaxError)
/garçon/n ASCII-8BIT

Hex Character Escapes

Regexp Encoding
/\x80/ ASCII-8BIT
/\x80/u invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/e invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/s invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/n ASCII-8BIT
/gar\xC3\xA7on/ ASCII-8BIT
/gar\xC3\xA7on/u UTF-8
/gar\xC3\xA7on/e EUC-JP
/gar\xC3\xA7on/s Windows-31J
/gar\xC3\xA7on/n ASCII-8BIT

UTF-8 Character Escapes

Regexp Encoding
/gar\u{E7}on/ UTF-8
/gar\u{E7}on/u UTF-8
/gar\u{E7}on/e incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/s incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/n incompatible character encoding: /gar\u{E7}on/ (SyntaxError)

Source Encoding: UTF-8

No Character Escapes

Regexp Encoding
/garçon/ UTF-8
/garçon/u UTF-8
/garçon/e regexp encoding option 'e' differs from source encoding 'UTF-8' (SyntaxError)
/garçon/s regexp encoding option 's' differs from source encoding 'UTF-8' (SyntaxError)
/garçon/n regexp encoding option 'n' differs from source encoding 'UTF-8' (SyntaxError)

/.../n has a non escaped non ASCII character in non ASCII-8BIT script: /garçon/

Hex Character Escapes

Regexp Encoding
/\x80/ invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/u invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/e invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/s invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/n ASCII-8BIT
/gar\xC3\xA7on/ UTF-8
/gar\xC3\xA7on/u UTF-8
/gar\xC3\xA7on/e EUC-JP
/gar\xC3\xA7on/s Windows-31J
/gar\xC3\xA7on/n ASCII-8BIT

UTF-8 Character Escapes

Regexp Encoding
/gar\u{E7}on/ UTF-8
/gar\u{E7}on/u UTF-8
/gar\u{E7}on/e incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/s incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/n incompatible character encoding: /gar\u{E7}on/ (SyntaxError)

Source Encoding: EUC-JP

No Character Escapes

Regexp Encoding
/garçon/ EUC-JP
/garçon/u regexp encoding option 'u' differs from source encoding 'EUC-JP' (SyntaxError)
/garçon/e EUC-JP
/garçon/s regexp encoding option 's' differs from source encoding 'EUC-JP' (SyntaxError)
/garçon/n regexp encoding option 'n' differs from source encoding 'EUC-JP' (SyntaxError)

/.../n has a non escaped non ASCII character in non ASCII-8BIT script: /gar\x{C3A7}on/

Hex Character Escapes

Regexp Encoding
/\x80/ invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/u invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/e invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/s invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/n ASCII-8BIT
/gar\xC3\xA7on/ EUC-JP
/gar\xC3\xA7on/u UTF-8
/gar\xC3\xA7on/e EUC-JP
/gar\xC3\xA7on/s Windows-31J
/gar\xC3\xA7on/n ASCII-8BIT

UTF-8 Character Escapes

Regexp Encoding
/gar\u{E7}on/ UTF-8
/gar\u{E7}on/u UTF-8
/gar\u{E7}on/e incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/s incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/n incompatible character encoding: /gar\u{E7}on/ (SyntaxError)

Source Encoding: Windows-31J

No Character Escapes

Regexp Encoding
/garçon/ Windows-31J
/garçon/u regexp encoding option 'u' differs from source encoding 'Windows-31J' (SyntaxError)
/garçon/e regexp encoding option 'e' differs from source encoding 'Windows-31J' (SyntaxError)
/garçon/s Windows-31J
/garçon/n regexp encoding option 'n' differs from source encoding 'Windows-31J' (SyntaxError)

/.../n has a non escaped non ASCII character in non ASCII-8BIT script: /gar\xC3\xA7on/

Hex Character Escapes

Regexp Encoding
/\x80/ invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/u invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/e invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/s invalid multibyte escape: /\x80/ (SyntaxError)
/\x80/n ASCII-8BIT
/gar\xC3\xA7on/ Windows-31J
/gar\xC3\xA7on/u UTF-8
/gar\xC3\xA7on/e EUC-JP
/gar\xC3\xA7on/s Windows-31J
/gar\xC3\xA7on/n ASCII-8BIT

UTF-8 Character Escapes

Regexp Encoding
/gar\u{E7}on/ UTF-8
/gar\u{E7}on/u UTF-8
/gar\u{E7}on/e incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/s incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
/gar\u{E7}on/n incompatible character encoding: /gar\u{E7}on/ (SyntaxError)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment