Skip to content

Instantly share code, notes, and snippets.

@dannymcgee
Last active February 19, 2024 04:08
Show Gist options
  • Save dannymcgee/96b09dc2a061e7b23dc7930ff0f218f4 to your computer and use it in GitHub Desktop.
Save dannymcgee/96b09dc2a061e7b23dc7930ff0f218f4 to your computer and use it in GitHub Desktop.
VS Code Grammar Tips

Hey, would you mind giving me tips/information on improving the syntax highlighting for my programming language? Ive been working on the language pretty seriously for months but the vscode highlighting isnt that good still.

Sure! It's kind of a lot for a Reddit comment, but here are some resources that should get the ball rolling:

  • VS Code Grammar Bootstrap

    This is a template repo (not an actual GitHub template, so you have to do a manual find-and-replace after cloning/forking) with tooling to generate the *.tmLanguage.json files from TypeScript objects. This is a huge help because you can use regex literals to define the patterns (better syntax highlighting, and no more double-escaping everything!), split the grammar into multiple files, write helper functions to remove some boilerplate, etc. There's also a regex function there that you can invoke like a tagged template to combine multiple regular expressions, like:

    export const functionCall = {
      match: regex`/${identifier}(?=\()/`,
      name: "entity.name.function.mylang",
    }
  • @vscode-devkit/nx & @vscode-devkit/grammar

    These might be more useful if you're familiar with Nx. The 1st is an Nx plugin version of the infrastructure from the repo above, so you can nx generate/nx build your extension/grammar projects, and the 2nd is the type definitions and regex function from that repo extracted to an npm library.

  • VS Code Klipper Support

    This is my latest VS Code project, and it's a pretty comprehensive case study — it uses recursive language embedding, multi-line matches, scoping based on indentation level, etc. If there's something tricky that can be done with a VS Code grammar, there's probably an example of it here.

  • TextMate 1.x Manual | Language Grammars

    This is the go-to reference for how VS Code/TextMate grammar definitions actually work. All the examples here are written in a different syntax, but it should be pretty trivial to mentally convert them to JSON/TypeScript.

  • TextMate 1.x Manual | Regular Expressions

    This is the reference for the exact flavor of regular expressions supported by VS Code/TextMate grammars.

  • Regexr

    This is a general-purpose regex debugging/authoring tool which I use all the time for more complicated expressions. The PCRE flavor isn't an exact match for the VS Code/TextMate engine, but it's close enough that it won't make a difference in the vast majority of cases.

  • Rubular

    For those corner cases where a PCRE expression isn't working the way you would expect, this is a regex tester app that does use the exact same regex flavor as VS Code/TextMate. The UX is not nearly as nice as Regexr, which is why I only use it as a backup.

And here are some general tips:

  • When deciding what scope name to use for a particular token, I always check to see what TypeScript uses for something similar. TypeScript is Microsoft's baby, and it's the language used to build VS Code, so it has by far the most comprehensive support from the built-in themes and most third-party themes. So if you yoink all of your scope names straight from the TypeScript grammar you're pretty much guaranteed to have good results with any theme instead of having to ask folks to install one that specifically supports your language.

    You can see all of the TextMate scopes that are applied to a token by placing your cursor on that token and invoking the Developer: Inspect Editor Tokens and Scopes command from the Command Palette (Ctrl+Shift+P on Windows).

  • 99% of the time I use one of three patterns for tokenizing:

    1. Simple match, for when you can trivially tokenize a string of characters with a simple expression:

      {
        match: /\b(?:if|else|for|in|switch|case)\b/,
        name: "keyword.control.mylang",
      }
    2. Match / captures, for when certain tokens have specific meanings when they're grouped together a certain way. The numbers correspond to the capture groups created by the expression:

      {
        match: /(\.)([_a-zA-Z][_a-zA-Z0-9]*)/,
        captures: {
          // Matches the dot character
          1: { name: "punctuation.accessor.mylang" },
          // Matches the identifier after the dot
          2: { name: "variable.property.mylang" },
        }
      }
    3. Begin / End / Patterns. This is the most complicated one but also probably the one I use the most. The begin pattern marks the start of some chunk of code, the end pattern marks its end, and patterns lets you selectively include the patterns (from your repository or inline) that will be used to tokenize everything that comes inbetween.

      This type of pattern does not give a damn about line breaks — it will keep tokenizing until it hits something that matches end.

      beginCaptures/endCaptures let you tokenize the begin/end matches themselves, in the same format as the captures key in the previous example.

      export const funcSignature = {
        // Matches `fn foo(`
        begin: regex`/(fn)\s+(${identifier})\s*(\()/`,
        beginCaptures: {
          // Matches the `fn`
          1: { name: "storage.type.function.mylang" },
          // Matches the identifier
          2: { name: "entity.name.function.mylang" },
          // Matches the `(`
          3: { name: "punctuation.definition.parameters.begin.mylang" },
        },
        end: /\)/,
        endCaptures: {
          // `0` in a `captures` block matches the entire expression
          0: { name: "punctuation.definition.parameters.end.mylang" },
        },
        patterns: [
          // Include the `comments` pattern from your `repository`
          { include: "#comments" },
          // Inline pattern for a parameter declaration
          {
            begin: regex`/(${identifier})\s*(:)/`,
            beginCaptures: {
              // The parameter name
              1: { name: "variable.parameter.mylang" },
              // The `:`
              2: { name: "keyword.operator.type.annotation.mylang" },
            },
            // Using a lookahead to match either a comma or a closing paren. Because it's a lookahead,
            // the parent pattern will still be able to match the closing paren itself -- this pattern
            // gets popped as soon as it sees that the _next_ character matches
            end: /(?=[,\)])/,
            // Nested pattern inclusion for the type itself, since those can be pretty complicated
            // (type arguments, namespaces and scope operators, etc.)
            patterns: [
              // We'll need to repeat anything that was valid for the enclosing pattern
              // that's also valid here
              { include: "#comments" },
              // Types are used in lots of different places, so that pattern also has its own key in
              // the repository
              { include: "#types" },
            ],
          },
          // We could include a whole `punctuation` pattern here, but a comma is the only other token
          // that's actually valid in this pattern, so we'll just define it inline
          {
            match: /,/,
            name: "punctuation.separator.parameter.mylang",
          },
        ],
      }

That's a lot, I know (and I don't even want to know how badly Reddit is going to butcher my Markdown formatting). Not everything, but between that and the links above hopefully it's enough to get you started. Feel free to DM me if you have any specific questions or issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment