Skip to content

Instantly share code, notes, and snippets.

@savetheclocktower
Last active February 10, 2024 23:19
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save savetheclocktower/c9607b97477d4817911e4f2f8db89679 to your computer and use it in GitHub Desktop.
Save savetheclocktower/c9607b97477d4817911e4f2f8db89679 to your computer and use it in GitHub Desktop.
Resources for Pulsar's experimental modern-tree-sitter enhancement

Reference query documentation

This file serves as a shorthand reference for all custom predicates used via #set!, #is?, and #is-not?.

Tips for predicates

Terse queries

Currently, web-tree-sitter has a constraint that makes it difficult to use complex queries effectively in Pulsar. Consider what happens if we want to mark two captures at once:

(string
  "\"" @punctuation
  (#is? test.first)
) @string

Despite our placement of the #is? clause after the @punctuation capture, it will be applied to both the @punctuation capture and the @string capture. That means that the @punctuation capture will always work for the first quotation mark in the string, but the @string will always fail unless it’s the first node among its siblings.

This occurs because #is?, #is-not, and #set! don’t accept a capture as an argument, which means they’ll always apply to every capture in a query. If there were a built-in #first? predicate that worked like #match? or #eq?, we’d be able to target a specific capture like this…

(string
  "\"" @punctuation
  (#first? @punctuation)
) @string

…and other captures would know to ignore that predicate because it’s targeting something else. But right now there’s no way to write a custom predicate in web-tree-sitter that takes a capture as an argument.

We could implement some sort of weird convention and inflict it on all predicates…

(foo
  (bar) @bar
  (#is? test.descendantOfType "@bar/zort")
) @string

…but the real solution is to wait until web-tree-sitter supports custom predicates. It’s feasible to implement; it just hasn’t been done yet.

Until then, shorter queries are better:

(string) @string

(string "\"" @punctuation
  (#is? test.first))

Arguments

All three kinds of predicates take up to two arguments: a key and an optional value.

(#set! foo) ; key only
(#set! foo bar) ; key plus value

Since these predicates merely store data for later processing, they function like key-value pairs, and it’s not possible to use the same test twice with the same predicate:

; (this won't work)
((_)
  (#is-not? test.type bar)
  (#is-not? test.type baz))

To work around this, certain scope tests accept more than one argument, as documented below. They do so by accepting a string with multiple values separated by spaces:

; (this will work)
((_)
  (#is-not? test.type "bar baz"))

If a scope test accepts multiple arguments, it will be indicated below; otherwise you should assume it does not.

Values

The value of a predicate will always be interpreted by Tree-sitter as a string. If the value contains no spaces, you can typically leave it unquoted; otherwise you’ll need to wrap it in double quotes:

(#is? test.type string) ; (valid)
(#is? test.type string template_string) ; (invalid)
(#is? test.type "string template_string") ; (valid)

If a predicate does not take an argument, the value will be ignored, and may therefore be omitted. For instance, the following forms are equivalent for using the test.first predicate, since it has no need for an argument:

(#is? test.first)
(#is? test.first true)
(#is? test.first "lorem ipsum dolor")

Highlight predicates

Highlight predicates use the highlight namespace and are used to control implementation details of syntax highlighting. You typically won’t have to apply these.

invalidateOnChange

((template_string) @_IGNORE_
  (#set! highlight.invalidateOnChange true))

Tells the highlighting engine that any edit falling within the bounds of the captured node should trigger a re-highlighting of the entire node.

By default, when a change is made in a buffer, Pulsar will re-highlight

  1. the entire buffer line on which the change happened, and
  2. all regions of the tree that Tree-sitter recognizes as being structurally different as a result of the buffer change.

Many highlighting changes would not be noticed by item 2 because they don’t change the structure of the tree… but that’s OK, because nearly all of them are handled properly by item 1.

The classic example of a highlighting rule that evades both prongs is switching a plain block comment /* to a documentation comment /**, or vice versa. This change doesn’t result in structural changes to the tree (it’s a comment node before and after the change), and it often affects more than one buffer line.

Use invalidateOnChange when (a) a node’s scope name will vary based on a content test (like a #match? predicate) and (b) the node’s range can span multiple lines. In these cases, invalidateOnChange signals that any future buffer changes that happen within the bounds of this node should force the node’s entire range to be re-highlighted instead of just the screen line on which the change was made.

Capture predicates

Capture predicates are two special kinds of settings relating to captures. They enforce whether a capture is applied or ignored based on whether anything else has already tried to capture the same buffer range. They function like scope tests, but they introduce their own side effects.

Capture predicates use the capture namespace.

final

(#set! capture.final true)

Passes only if another capture has not previously declared final for the exact same range. All captures are tested against final; the first one that declares it gets to “monopolize” a given range.

Tests a range, not a node; susceptible to scope adjustments.

shy

(#set! capture.shy true)

Passes only if another capture has not matched for the same range, whether it has declared final or not.

Tests a range, not a node; susceptible to scope adjustments.

Scope tests

Scope tests are #is?/#is-not? predicates that use the test namespace. They’re used to filter captures before applying special behaviors. Despite their name, they are used by several different kinds of queries.

An #is? predicate defines criteria that must pass in order for a capture to be accepted, and an #is-not? predicate defines criteria that must fail in order for a capture to be accepted. Thus any test defined below can be used in either kind of predicate.

type

(#is? test.type ERROR)
(#is? test.type "string template_string")

Argument: The name of a node, or multiple nodes separated by spaces.

Passes only if the captured node’s type matches any of the specified types. Works on both named and anonymous nodes.

Accepts any number of types separated by spaces. Remember that string parameters with spaces must be quoted.

hasError

(#is? test.hasError)

Passes only if node.hasError() returns true, which will occur when the node has any descendant of type ERROR. This is therefore a shorthand for (#is? test.ancestorOfType ERROR).

injection

(#is? test.injection)

Passes only if this capture occurs in an injection layer, rather than the root language layer.

root

(#is? test.root)

Passes only if the captured node is the root node in the tree.

first

(#is? test.first)

Passes only if the captured node is the first among its siblings — considering all siblings, both named and anonymous.

last

(#is? test.last)

Passes only if the captured node is the last among its siblings — considering all siblings, both named and anonymous.

firstOfType

(#is? test.firstOfType)

Passes only if the captured node is the first of its type among its siblings — considering all siblings, both named and anonymous.

lastOfType

(#is? test.lastOfType)

Passes only if the captured node is the last of its type among its siblings — considering all siblings, both named and anonymous.

firstTextOnRow

(#is? test.firstTextOnRow)

Passes only if there is no non-whitespace content before the node begins on its starting row.

lastTextOnRow

(#is? test.lastTextOnRow)

Passes only if there is no non-whitespace content after the node ends on its ending row.

descendantOfType

(#is? test.descendantOfType comment)
(#is? test.descendantOfType "comment string")

Argument: The name of a node, or multiple nodes separated by spaces.

Passes only if the captured node has at least one ancestor of the given type(s).

ancestorOfType

(#is? test.ancestorOfType escape_sequence)

Argument: The name of a node, or multiple nodes separated by spaces.

Passes only if the captured node has at least one descendant of the given type(s).

rangeWithData

(#is? test.rangeWithData isLegalPositionForDecorator)

Argument: The name of an arbitrary data key.

Passes only if the given range has previously had data set on it with the given key name via a #set! predicate.

Tests a range, not a node; susceptible to scope adjustments.

descendantOfNodeWithData

(#is? test.descendantOfNodeWithData isLegalPositionForDecorator)

Argument: The name of an arbitrary data key.

Passes only if the captured node has at least one node in its ancestor chain against whose range data has been set for the given key name.

Checks the data for its ancestor nodes’ inherent ranges; not susceptible to scope adjustments.

startsOnSameRowAs

(#is? test.startsOnSameRowAs lastChild.startPosition)

Argument: A node position descriptor.

Passes only if the captured node starts on the same row as the row indicated in the given descriptor.

endsOnSameRowAs

(#is? test.endsOnSameRowAs lastChild.startPosition)

Argument: A node position descriptor.

Passes only if the captured node ends on the same row as the row indicated in the given descriptor.

config

(#is? test.config language-foo.enableHangingIndent) ; implicit boolean
(#is? test.config "language-foo.enableHangingIndent false") ; explicit boolean
(#is? test.config "language-foo.startingIndentLevel 0") ; integer
(#is? test.config "language-foo.braceStyle 1tbs") ; string

Argument: A configuration key in the same style as used with atom.config, optionally separated by a space from its desired value.

Passes only if the given configuration key equals the given value — or, if value is omitted, whether the given configuration key is set to true.

The configuration value will be retrieved from atom.config with a scope descriptor containing exactly one scope: the grammar’s base scope name (e.g., source.js or text.html.basic). Scope-specific configurations will thus be retrieved, but at no more granular a level than the language itself.

Any retrieved configuration values will be cached until the next configuration change of any sort.

Since the second argument to #set! is always parsed as a string, values are interpreted as follows:

  • Values true and false are coerced to boolean true and false.
  • Values consisting only of digits are coerced to integers.
  • All other values are assumed to be string values.

Scope adjustments

Scope adjustments are #set! predicates that use the adjust namespace. They alter the buffer range against which a scope is applied. They’re useful for situations where the range to be scoped does not correspond to the exact range of a Tree-sitter node.

Scope adjustments have one major limitation: they can be used to reduce the range of a scope, but they cannot be used to enlarge the range of a scope past the node defined in the capture. If, for instance, you need to apply a scope to two consecutive sibling nodes, you cannot capture the first sibling and adjust the range to end after the next sibling — you must instead capture their common parent node and adjust both ends of the range.

Scope adjustments can be chained, but only in certain circumstances. Some adjustments are inherently “relative” and can operate on the result of earlier adjustments. Others are inherently “absolute” and will ignore any earlier adjustments.

startAt

(#set! adjust.startAt firstChild.endPosition)

Argument: A node position descriptor representing the start of the range.

Alter the given range to start at the start/end position of a specific descendant node.

Cannot target non-descendants such as sibling or parent nodes. References an absolute position measured from the node — hence will ignore any earlier adjustments made in a given capture.

endAt

(#set! adjust.endAt lastChild.startPosition)

Argument: A node position descriptor representing the end of the range.

Alter the given range to end at the start/end position of a specific descendant node.

Cannot target non-descendants such as sibling or parent nodes. References an absolute position measured from the node — hence will ignore any earlier adjustments made in a given capture.

offsetStart

(#set! adjust.offsetStart 1)

Argument: An integer describing the number of characters to offset in either direction.

Alter the given range’s start position by a positive or negative number of characters.

Cannot extend the range past the bounds of the captured node. Respects earlier adjustments in a chain.

offsetEnd

(#set! adjust.offsetEnd -1)

Argument: An integer describing the number of characters to offset in either direction.

Alter the given range’s end position by a positive or negative number of characters.

Cannot extend the range past the bounds of the captured node. Respects earlier adjustments in a chain.

startAndEndAroundFirstMatchOf

(#set! adjust.startAndEndAroundFirstMatchOf "\\*\\*/")

Argument: A regular expression describing a match pattern.

Test the given pattern against the node’s text content. If matched, alters the given range’s start and end positions to reflect the exact string indices of the start and end of the match.

References an absolute position derived from a regular expression — hence will ignore any earlier adjustments made in a given capture.

startBeforeFirstMatchOf

(#set! adjust.startBeforeFirstMatchOf "\\*\\*/")

Argument: A regular expression describing a match pattern.

Test the given pattern against the node’s text content. If matched, alters the given range’s start position to reflect the exact string index of the start of the match.

References an absolute position derived from a regular expression — hence will ignore any earlier adjustments made in a given capture.

startAfterFirstMatchOf

(#set! adjust.startAfterFirstMatchOf "\\*\\*/")

Argument: A regular expression describing a match pattern.

Test the given pattern against the node’s text content. If matched, alters the given range’s start position to reflect string index immediately after the end of the match.

References an absolute position derived from a regular expression — hence will ignore any earlier adjustments made in a given capture.

endBeforeFirstMatchOf

(#set! adjust.endBeforeFirstMatchOf "\\*\\*/")

Argument: A regular expression describing a match pattern.

Test the given pattern against the node’s text content. If matched, alters the given range’s end position to reflect the exact string index of the start of the match.

References an absolute position derived from a regular expression — hence will ignore any earlier adjustments made in a given capture.

endAfterFirstMatchOf

(#set! adjust.endAfterFirstMatchOf "\\*\\*/")

Argument: A regular expression describing a match pattern.

Test the given pattern against the node’s text content. If matched, alters the given range’s end position to reflect string index immediately after the end of the match.

References an absolute position derived from a regular expression — hence will ignore any earlier adjustments made in a given capture.

Fold predicates

All scope tests are also available for captures in indents.scm. In addition, special predicate behaviors are available in the fold namespace:

invalidateOnChange

(#set! fold.invalidateOnChange true)

Tells the fold engine that any edit falling within the bounds of the captured node should force Pulsar to re-test this fold for validity.

Pulsar keeps a cache of row-by-row foldability so that it doesn’t have to re-test every fold in the document after every single buffer change. In general, it assumes that any edits made to row Z won’t affect a row that begins on row Y.

But sometimes you might need to define a fold that can switch from valid to invalid — or vice versa — based on an edit on a row other than the fold’s starting row. If so, specify this predicate so that every edit within this fold’s range will invalidate the fold cache for each row of the range.

This is very similar to the highlight.invalidateOnChange setting.

This predicate has an effect on @fold captures, but not on @fold.start or @fold.end captures.

Fold adjustments

Fold adjustments are used to customize where a fold ends. (A fold’s beginning is always at the end of its starting line and can’t be adjusted.)

All of these adjustments use the fold namespace. They have an effect on @fold captures, but not on @fold.start or @fold.end captures.

Fold adjustments can be chained, but only in certain circumstances. Some adjustments are inherently “relative” and can operate on the result of earlier adjustments. Others are inherently “absolute” and will ignore any earlier adjustments.

endAt

(#set! fold.endAt lastChild.startPosition)

Argument: A node position descriptor representing the end of a fold.

Specify the point at which a @fold capture ends. Defaults to lastChild.startPosition when omitted. Must resolve to a position on a row later in the buffer than the node’s starting row.

Ignores any adjustments made earlier in the capture.

offsetEnd

(#set! fold.offsetEnd -2)

Argument: An integer representing a character offset.

Shifts the end of the fold a fixed number of characters in either direction.

Respects earlier range adjustments.

adjustEndColumn

(#set! fold.adjustEndColumn 4)

Argument: An integer representing a absolute column of a specific row in the buffer.

Adjusts the column of the fold range’s end point to the given value.

Respects earlier range adjustments.

adjustToEndOfPreviousRow

(#set! fold.adjustToEndOfPreviousRow true)

Adjusts the end of the fold range such that it ends at the end of the previous row.

Equivalent to (#set! fold.adjustEndColumn 0) followed by (#set! fold.offsetEnd -1).

Respects earlier range adjustments.

Indent predicates

All scope tests are also available for captures in indents.scm. In addition, special predicate behaviors are available in the indent namespace:

matchIndentOf

(#set! indent.matchIndentOf parent.startPosition)

Argument: A node position descriptor.

Indicates a row whose indentation should be used as a reference point. Mandatory on @match captures; has no effect on other captures.

offsetIndent

(#set! indent.offsetIndent 1)

Argument: A integer, positive or negative.

Indicates an offset from the level suggested by matchIndentOf. Optional on @match captures; has no effect on other captures.

allowEmpty

(#set! indent.allowEmpty true)

Force this capture to pass even if its node's text is an empty string. (May still fail, or be ignored, for other reasons.)

force

(#set! indent.force true)

Force a @dedent or @match capture to assert itself (triggering a possible indentation change) even when it isn’t the first non-whitespace content on the line. (Not recommended unless the author has an alternative means of preventing the indentation change from happening after every keystroke.)

Creating a Grammar

Pulsar's modern syntax highlighting and code folding system is powered by Tree-sitter. Tree-sitter parsers create and maintain full syntax trees representing your code.

Modeling the buffer as a syntax tree gives Pulsar a comprehensive understanding of the structure of your code, which has several benefits:

  1. Syntax highlighting will not break because of formatting changes.
  2. Code folding will work regardless of how your code is indented.
  3. Editor features can operate on the syntax tree. For instance, the Select Larger Syntax Node and Select Smaller Syntax Node commands allow you to select conceptually larger and smaller chunks of your code.
  4. Community packages can use the syntax tree to understand and manipulate code more intelligently.

Tree-sitter grammars are relatively new. Many languages in Pulsar are still supported by TextMate grammars — and TextMate grammars are easier to write from scratch, so they may still be the best option for more obscure languages. But if an up-to-date tree-sitter parser already exists for a language, a tree-sitter grammar will be more performant and easier to author.

NOTE: Atom was the first editor to support Tree-sitter grammars, but the legacy implementation predated many of Tree-sitter’s current features. The need for Pulsar to switch to the web-tree-sitter bindings for compatibility reasons was an opportunity to update the architecture and the original tree-sitter grammars. The new grammars should deliver better syntax highlighting, indentation, and code folding, while still offering better performance than TextMate-style grammars. Legacy Tree-sitter grammars are deprecated and will soon be removed from Pulsar.

Getting Started

There are three components required to use Tree-sitter in Pulsar: a parser, a grammar file, and a handful of query files.

The Parser

Tree-sitter generates parsers based on context-free grammars that are typically written in JavaScript. The generated parsers are C libraries that can be used in other applications as well as Pulsar.

They can also be developed and tested on the command line, separately from Pulsar. Tree-sitter has its own documentation page on how to create these parsers. The Tree-sitter GitHub organization also contains a lot of example parsers that you can learn from, each in its own repository.

Pulsar uses web-tree-sitter — the WebAssembly bindings to tree-sitter. That means that you’ll have to build a WASM file for your parser before it can be used.

If you want to use an existing parser, you’ll probably be able to find it on NPM. If you’ve written your own parser, it’s a good idea to publish it to NPM yourself. Either way, you should install it as a devDependency for your language-* package.

You can then go into the directory for your parser and use the Tree-sitter CLI to build the WASM file:

cd node_modules/tree-sitter-foo
tree-sitter build-wasm .

Building WASM files from the tree-sitter CLI requires either a local installation of Emscripten or use of a Docker image. See this reference for details.

The Package

Once you have a WASM file, you can use it in your Pulsar package. Packages with grammars are, by convention, always named starting with language-. You'll need a folder with a package.json, a grammars subdirectory, and a single JSON or CSON file in the grammars directory, which can be named anything.

We’ve also decided to put our WASM file in the grammars/tree-sitter subdirectory, though this is just a convention. The SCM files alongside our WASM file will be explained in a moment.

language-mylanguage
├── LICENSE
├── README.md
├── grammars
│   ├── mylanguage.cson
│   └── tree-sitter
│       ├── grammar.wasm
|       ├── folds.scm
|       ├── highlights.scm
|       ├── indents.scm
|       └── tags.scm
└── package.json

The Grammar File

The mylanguage.cson file specifies how Pulsar should use the parser you created.

Basic Fields

It starts with some required fields:

name: 'My Language'
scopeName: 'source.mylanguage'
type: 'modern-tree-sitter'
parser: 'tree-sitter-mylanguage'
  • scopeName - A unique, stable identifier for the language. Pulsar users will use this identifier in configuration files if they want to specify custom configuration based on the language. Examples: source.js, text.html.basic.
  • name - A human readable name for the language.
  • parser - The name of the parser node module that will be used for parsing. This should point to the NPM package from which the WASM file was built. (This value is currently unused, but is required as a way of future-proofing in case Pulsar should migrate to a different tree-sitter binding in the future.)
  • type - This should have the value modern-tree-sitter to indicate to Pulsar that this is a modern Tree-sitter grammar, as opposed to a legacy Tree-sitter grammar (soon to be removed from Pulsar) or a TextMate grammar.

Tree-sitter Fields

The treeSitter configuration key holds the fields that specify the paths on disk to the grammar and its query files:

treeSitter:
  grammar: 'tree-sitter/grammar.wasm'
  highlightsQuery: 'tree-sitter/highlights.scm'
  foldsQuery: 'tree-sitter/folds.scm'
  indentsQuery: 'tree-sitter/indents.scm'

All values are paths that will be resolved relative to the grammar configuration file itself. Of these, grammar is the only required field.

  • grammar — The path to the WASM file you generated earlier.
  • highlightsQuery — The path to a file (canonically called highlights.scm) that will tell Pulsar how to highlight the code in this language. (Most Tree-sitter repositories include a highlights.scm file that can be useful to consult, but should not be used in Pulsar, because its naming conventions are different from Pulsar’s.)
  • foldsQuery — The path to a file (canonically called folds.scm) that will tell Pulsar which ranges of a buffer can be folded.
  • indentsQuery — The path to a file (canonically called indents.scm) that will tell Pulsar when it should indent or dedent lines of code in this language.
  • tagsQuery — The path to a file (canonically called tags.scm) that will identify the important symbols in the document (class names, function names, and so on) along with their locations. If present, Pulsar will use this query file for symbol navigation. (Most Tree-sitter repositories include a tags.scm file that can be understood as-is by Pulsar and is a good starting point.)

You can skip indentsQuery if your language doesn’t need indentation hinting, foldsQuery if it doesn’t need code folding, or even highlightsQuery in the unlikely event that your language does not need syntax highlighting.

Any of the settings that end in Query can also accept an array of relative paths, instead of just a single path. At initialization time, the grammar will concatenate each file’s contents into a single query file. This isn’t a common need, but is explained further below.

Language Recognition

Next, the file should contain some fields that indicate to Pulsar when this language should be chosen for a given file. These fields are all optional and are listed in the order that Pulsar consults them when making its decision.

  • fileTypes - An array of filename suffixes. The grammar will be used for files whose names end with one of these suffixes. Note that the suffix may be an entire filename, like Makefile or .eslintrc. If no grammars match (or more than one grammar matches) for a given file extension, ties are broken according to…
  • firstLineRegex - A regex pattern that will be tested against the first line of the file. The grammar will be used if this regex matches. If no grammars match (or more than one grammar matches) for a given firstLineRegex, ties are broken according to…
  • contentRegex - A regex pattern that will be tested against the contents of the file. If the contentRegex matches, this grammar will be preferred over another grammar with no contentRegex. If the contentRegex does not match, a grammar with no contentRegex will be preferred over this one.

Comments

The last field in the grammar file, comments, controls the behavior of Pulsar's Editor: Toggle Line Comments command. Its value is an object with a start field and an optional end field. The start field is a string that should be prepended to or removed from lines in order to comment or uncomment them.

In JavaScript, it looks like this:

comments:
  start: '// '

The end field should be used for languages that only support block comments, not line comments. If present, it will be appended to or removed from the end of the last selected line in order to comment or un-comment the selection.

In CSS, it would look like this:

comments:
  start: '/* '
  end: ' */'

Syntax Highlighting

The HTML classes that Pulsar uses for syntax highlighting do not correspond directly to nodes in the syntax tree. Instead, Pulsar queries the tree using a file called highlights.scm and written using Tree-sitter’s own query language.

Here is a simple example:

(call_expression
  (identifier) @support.other.function)

This entry means that, in the syntax tree, any identifier node whose parent is a call_expression should be given the scope name support.other.function. In the editor, such an identifier will be wrapped in a span tag with three classes applied to it: syntax--support, syntax--other, and syntax--function. Syntax themes can hook into these class names to style source code via CSS or LESS files.

Some queries will be quite easy to express, but some others will be highly contextual. Consult some built-in grammars’ highlights.scm files for examples.

Scope tests: advanced querying

Tree-sitter supports additional matching criteria for queries called predicates. For instance, here’s how we can distinguish a block comment from a line comment in JavaScript:

((comment) @comment.block.js
  (#match? @comment.block.js "^/\\*"))

((comment) @comment.line.js
  (#match? @comment.line.js) "^//")

We’re using the built-in #match? predicate, along with a regular expression, to search the text within the comment node. Our regexes are anchored to the beginning of the string and test whether the opening delimiter signifies a block comment (/* like this */) or a line comment (// like this). In the block comment’s case, we don’t have to attempt to match the ending (*/) delimiter — we know it must be present, or else the tree-sitter parser wouldn’t have classified it as a comment node in the first place.

Unfortunately, there aren’t many built-in predicates in web-tree-sitter alongside #match? and #eq? (which tests for exact equality). But the ones that are present — #set!, #is?, and #is-not — allow us to associate arbitrary key/value pairs with a specific capture. Pulsar uses these to define its own custom predicates.

For instance, you may want to highlight things differently based on their position among siblings:

(string
  "\"" @punctuation.definition.string.begin.js
  (#is? test.first))
(string
  "\"" @punctuation.definition.string.end.js
  (#is? test.last))

In most tree-sitter languages, a string node’s first child will be its opening delimiter, and its last child will be its closing delimiter. To add two different scopes to these quotation marks, we can use the test.first and test.last custom predicates to distinguish these two nodes from one another.

Prioritizing scopes

It’s common to want to add one scope to something if it passes a certain test, but a different scope if it fails the test.

; Scope this like a built-in function if we recognize the name…
(call_expression (identifier) @support.function.builtin.js
  (#match? @support.function.builtin.js "^(isFinite|isNaN|parseFloat|parseInt)$"))

; …or as a user-defined function if we don't.
(call_expression (identifier) @support.other.function.js)

This doesn’t do what you might expect because a given buffer range can have any number of scopes applied to it. That means the "parseFloat" in parseFloat(foo) will be given both of these scope names, since it matches both of these captures.

How can we get around this? One option is the #not-match? predicate — the negation of #match? — to ensure that anything which passes the first test will fail the second, and vice versa:

; Scope this like a built-in function if we recognize the name…
(call_expression (identifier) @support.function.builtin.js
  (#match? @support.function.builtin.js "^(isFinite|isNaN|parseFloat|parseInt)$"))

; …or as a user-defined function if we don't.
(call_expression (identifier) @support.other.function.js
  (#not-match? @support.function.builtin.js "^(isFinite|isNaN|parseFloat|parseInt)$"))

This is a fine solution for our oversimplified example, but would get pretty complicated if there were more than one fallback.

Another approach is to use Pulsar’s custom predicates called final and shy. For instance, we could use final on the first capture to “claim” it:

; Scope this like a built-in function if we recognize the name…
(call_expression (identifier) @support.function.builtin.js
  (#match? @support.function.builtin.js "^(isFinite|isNaN|parseFloat|parseInt)$")
  (#set! capture.final true))

; …or as a user-defined function if we don't.
(call_expression (identifier) @support.other.function.js)

The final predicate means that the first capture will apply its own scope name, then prevent all further attempts to add a scope to the same buffer range. This works because two different captures for the same node will be processed in the order in which their queries are defined in the SCM file — so if a token were to match both of these captures, it’s guaranteed that the first capture would be processed before the second.

Another option would be to use shy on the second capture:

; Scope this like a built-in function if we recognize the name…
(call_expression (identifier) @support.function.builtin.js
  (#match? @support.function.builtin.js "^(isFinite|isNaN|parseFloat|parseInt)$"))

; …or as a user-defined function if we don't.
(call_expression (identifier) @support.other.function.js
  (#set! capture.shy true))

The shy predicate creates a true fallback option; it only applies its scope if no other scope — not just one that uses capture.final — has previously been applied for the same buffer range. But it doesn’t “lock down” its buffer range the way that final does, so a later capture could add another scope to the same range.

There’s one caveat to mention. Consider this query:

(call_expression (identifier) @support.other.function.js @meta.something-else.js
  (#set! capture.final true))

This is a valid Tree-sitter query; you can assign up to three capture names at once. But the outcome might be surprising: support.other.function.js will be applied, and meta.something-else.js will not. This happens because these two capture names aren’t processed simultaneously; they’re processed in sequence. So the capture.final predicate will act after the first capture name and prevent the second from being applied.

Here’s one way to rewrite this to have the intended effect:

; Apply each capture in its own query…
(call_expression (identifier) @support.other.function.js)

; …and use capture.final only on the second capture name.
(call_expression (identifier) @meta.something-else.js
  (#set! capture.final true))

Other scope tests

Pulsar defines many custom predicates, otherwise known as scope tests, to help grammar authors define accurate syntax highlighting. All scope tests are prefaced with test. in #is? and #is-not? predicates.

You’ve already seen an example with first and last. Other examples include:

  • firstOfType and lastOfType, for matching only the first or last node of a certain type among siblings
  • ancestorOfType and descendantOfType, for testing whether a node contains, or is contained by, a node of a certain type
  • config, for capturing certain nodes conditionally based on the user’s configuration

Some tests take a single argument…

((identifier) @foo
  (#is? test.descendantOfType "function"))

…but any test that doesn’t requre an argument can be expressed without one.

((identifier) @foo
  (#is? test.lastOfType))

Consult the ScopeResolver API documentation for a full list.

Scope adjustments: tweaking buffer ranges

There may be times when the range you want to highlight doesn’t correspond exactly with the range of a single node in the syntax tree. For those situations, you can use scope adjustments to tweak the range:

; Scope the `//` in a line comment as punctuation.
((comment) @punctuation.definition.comment.js
  (#match? @punctuation.definition.comment.js "^//")
  (#set! adjust.startAndEndAroundFirstMatchOf "^//"))

Some adjustments move the boundaries of the scope based on pattern matches inside a node’s text, like in the example above. Others may move the boundaries based on a node position descriptor — a string like lastChild.startPosition that points to a specific position in a tree relative to another node — so they can wrap a single scope around two or three adjacent sibling nodes.

There’s only one catch: adjustments can only narrow the range of a capture, not expand it. That’s important because Pulsar depends on tree-sitter to tell it when certain regions of the buffer are affected by edits and need to be re-highlighted. That system won’t work correctly if a capture that starts and ends on row 1 can stretch itself to add a scope to something on row 100.

Consult the ScopeResolver API documentation for a full list.

Sharing query files

You may find it appropriate for two different grammars to share a query file. This can work when one Tree-sitter parser builds upon the work of another; for instance, the tree-sitter-tsx parser is basically the tree-sitter-typescript parser with JSX additions, so it makes sense for the two grammars to share most of their query files.

For this reason, each of the fields that ends in Query in a grammar definition file can accept an array of paths instead of a single path. Consider a hypothetical grammar for TypeScript-plus-JSX:

treeSitter:
  grammar: 'tree-sitter-tsx/grammar.wasm'
  languageSegment: 'ts.tsx'
  highlightsQuery: [
    'highlights-common.scm'
    'tree-sitter-tsx/highlights.scm'
  ]
  foldsQuery: 'tree-sitter-tsx/folds.scm'
  indentsQuery: 'tree-sitter-tsx/indents.scm'

The highlights query loads two different files: one that is common to tree-sitter-typescript and tree-sitter-tsx, and one that is unique to tree-sitter-tsx. The latter file would contain queries that deal with JSX and anything else that would not be understood by tree-sitter-typescript.

You might also notice a new key: languageSegment. This optional property allows one to write a shared query file generically…

(class_declaration
  name: (type_identifier) @entity.name.type.class._LANG_)

…while retaining the ability to add a grammar-specific scope segment at the end of a capture. At initialization time, all _LANG_ segments in this SCM file would be dynamically replaced with ts.tsx, and the capture name above would become @entity.name.type.class.ts.tsx. In the ordinary TypeScript grammar, specifying a languageSegment of ts would allow that grammar to define a capture name of @entity.name.type.class.ts.

Language Injection

Often, a source file will contain code written in several different languages. An HTML file, for instance, may need to highlight JavaScript (if the file has an inline script element) or CSS (if the file has an inline style element).

Tree-sitter grammars support this situation using a two-part process called language injection. First, an 'outer' language must define an injection point - a set of syntax nodes whose text can be parsed using a different language, along with some logic for guessing the name of the other language that should be used. Second, an 'inner' language must define an injectionRegex - a regex pattern that will be tested against the language name provided by the injection point.

The inner language can, in turn, define any injection points it may need, such that different grammars can be “nested” inside of a buffer to an arbitrary depth.

The code to define language injections should be placed inside of lib/main.js within your language- package. That file should export a function called activate that defines the injection points:

exports.activate = () => {
  atom.grammars.addInjectionPoint(/* … */);
};

Be sure to include a main field in the package’s package.json that points to this file:

"main": "lib/main"

Using addInjectionPoint

In JavaScript, tagged template literals sometimes contain code written in a different language, and the tag’s name tends to hint at the language being used inside the template string:

// HTML in a template literal
const htmlContent = html`<div>Hello, ${name}</div>`;

// CSS in a template literal
const styles = styled.a`
  border: 2px solid #000;
  color: #fff;
`

The tree-sitter-javascript parser parses the first tagged template literal as a call_expression with two children: an identifier and a template_literal:

(call_expression
  function: (identifier)
  arguments: (template_string
    (template_substitution
      (identifier))))

So here’s how we might allow syntax highlighting inside of template literals:

atom.grammars.addInjectionPoint("source.js", {
  type: "call_expression",

  language(callExpression) {
    const { firstChild, lastChild } = callExpression;
    if (firstChild?.type === "identifier" && lastChild?.type === "template_string") {
      return firstChild.text;
    }
  },

  content(callExpression) {
    return callExpression?.lastChild;
  }
});

So what happens when we use an html tagged template literal, as in the first example above?

  1. Every call_expression node in the tree would be assessed as a possible candidate for injection.

  2. Each of those nodes would be passed into our language callback to see if it can be matched to an injection language. In our example, first we’d inspect the tree to make sure this is a tagged template literal; then we’d return the text of the identifier node — html in this example. If that string can be matched with a known grammar, the injection can proceed.

  3. The content callback would then be called with the same call_expression node and return the last child of the call_expression node — which we’ve already proven is of the type template_string — since that node precisely describes the content that should be parsed as HTML. That node exists in our example, so the injection can proceed.

We skipped something important in step 2: how do we turn the string html into the HTML grammar? The HTML grammar file would need to specify an injectionRegex, so that the string html returned from the language callback can match itself to the right grammar:

injectionRegex: 'html|HTML'

If more than one grammar’s injectionRegex matches the string in question, Pulsar will pick the grammar whose injectionRegex produced the longest string match.

When defining your own injectionRegex, consider how specific you want your pattern to be. In our example, it’s a safe bet that any template literal tag or heredoc string delimiter that even contains the string HTML is describing content that should be highlighted like HTML. But a language like C, for example, might want to define a much more restrictive pattern that won’t get matched for an identifier like coffeescript or c-sharp:

injectionRegex: '^(c|C)$'

The injectionRegex property is only required for grammars that expect to be injected into other grammars. If this doesn’t apply to your grammar, you can omit injectionRegex altogether.

Advanced injection scenarios

Each individual injection point understands that it might have to operate on disjoint ranges of the buffer, ignoring certain ranges of content in between. Let’s look at our example code again:

// HTML in a template literal
const htmlContent = html`<div>Hello, ${name}</div>`;

Here, ${name} is a JavaScript template string interpolation, and it has no special meaning in HTML. But an interpolation could include arbitrary JavaScript, including content that would flummox a parser designed to interpret HTML.

It’s for this reason that, by default, injection points ignore the descendants of their content nodes. When an injection determines which ranges in the buffer it should parse, it takes the ranges described by its content nodes and subtracts the ranges of their children, unless instructed otherwise. So our injection will cover the range of the template_string minus the ranges of any interpolations (the template_substitution child nodes).

When the injection content is parsed, Tree-sitter will look at only the ranges Pulsar tells it to; anything outside those ranges will be invisible to the parser.

Ignoring child nodes is the correct decision for scenarios like tagged template literals and heredoc strings, but it might not be the correct decision for other injections, so this behavior is configurable.

Here are some other things you can do with injections, if needed:

  • block the parent grammar from highlighting anything in the injection’s content ranges
  • include the injected language’s base scope name (text.html.basic in our case), include a different base scope name instead, or omit it altogether
  • consolidate ranges that are separated only by whitespace
  • include newline characters when they appear between disjoint content ranges so that the injection’s parser doesn’t think those ranges are part of the same line

For more information on these features, read the API documentation for addInjectionPoint.

Code Folding

Code folding can only happen if the grammar helps Pulsar to understand which ranges of the buffer represent logical sections that can be collapsed. A tree-sitter grammar does this via folds.scm — a query file whose only purpose is to mark “foldable” sections of the buffer.

Simple folds

The complexity of a folds.scm will vary based on the language. At its simplest, it will look like the following:

(block) @fold

Believe it or not, that’s the entire contents of the folds.scm file inside the language-css package. Because CSS’s syntax is very regular, the block node can handle all situations where content is enclosed in a pair of curly braces.

The @fold capture is called a simple fold. It’s the easiest kind of fold to describe because Tree-sitter has done most of the work simply by identifying the regions to be folded. By default, here’s how Pulsar turns that capture into a foldable range:

  1. It will inspect the block node to find out the buffer rows it starts and ends on. If the block node starts on row X, the fold will begin at the very end of row X.
  2. It assumes the very last child of block is its closing delimiter (because that’s usually true) and sets the ending boundary of the fold to be just before that closing delimiter, so that both the opening and closing delimiters are visible when the range is folded.
  3. If the two ends of the fold are on different rows, the fold is valid, and will be indicated by a chevron in the gutter of row X. If the would-be fold range starts and ends on the same row, the fold is invalid and therefore ignored.

Sometimes simple folds need tweaking — for instance, in the case of a multi-line if/else-if/else construction. And in some languages, like Python, there aren’t any ending delimiters, so this logic won’t work out of the box.

That’s why Pulsar lets you customize the ends of simple folds — for instance, by specifying a different ending position in the tree relative to the starting node. Or by altering a position an arbitrary amount — nudging it a few characters in either direction, or moving it to the end of the previous line.

Consult the API documentation for more information.

Divided folds

There’s a different kind of fold, called a divided fold, that can be used when simple folds aren’t an option. They use the capture names @fold.start and @fold.end to specify the boundaries of the fold in two separate captures.

Divided folds are needed when the region to be folded isn’t represented by a single node, or by some predictable path from one node to another. Examples include preprocessor definitions in C/C++ files and sections inside Markdown files.

But the best example might be complex conditionals in shell scripts. In most other built-in tree-sitter grammars, these conditionals can be handled with simple folds. But shell scripts are a bit “messy,” and the parsed tree reflects that:

(if_statement "then" @fold.start)
(elif_clause) @fold.end
(elif_clause "then" @fold.start)
(else_clause) @fold.end @fold.start
"fi" @fold.end

A @fold.start capture on a node means that a fold will start at the end of that node’s starting row. A @fold.end capture on a node means that a fold will end at the end of the row before that node’s starting row.

This behavior allows a given node to be captured with both @fold.start and @fold.end, as in the case of the else_clause above. If we see an else on row 10, it means that one fold has ended at the end of row 9, and another one will begin at the end of row 10.

Divided folds need to pair up, and Pulsar pairs them up by starting with a @fold.start capture and looking for a balanced occurrence of @fold.end, keeping in mind that folds can be nested inside other folds.

There are good reasons to prefer simple folds wherever possible, and to use divided folds only when there isn’t another option. For one thing, it becomes the grammar author’s responsibility to ensure that @fold.start and @fold.end are captured in equal numbers, and that each @fold.start matches up with its intended @fold.end.

You can read more about folds in the API documentation.

Indents

The third sort of query file is typically called indents.scm, and its purpose is to identify items in the tree that hint at indents or dedents.

To oversimplify, here’s how indentation typically works in Pulsar, regardless of which sort of grammar is used:

  • If the user is typing on row 9, then presses Enter, we’ll decide whether to indent row 10 based on what’s present on line 9. For example, if row 9 ends with an opening curly brace ({), that’s a clear sign that row 10 should start with the cursor one level deeper than row 9. Therefore: to decide whether to indent a row, we usually examine the row above it.

  • When the user starts typing on row 10, we might decide that row 10 shouldn’t be indented after all. For instance, if the first character the user typed is a closing curly brace (}), then Pulsar will immediately decrease the indent level of that line by one level. Therefore: to decide whether to dedent a row, we usually examine the content of the row itself.

In TextMate grammars, the decisions to indent and dedent are made by comparing the contents of lines to regular expressions. In Tree-sitter grammars, the decisions are made by through query captures — typically captures named @indent and @dedent.

This is a good starting point for an indents.scm for a C-like language:

["{" "[" "("] @indent
["}" "]" ")"] @dedent

The fact that Tree-sitter grammars expose their delimiters in the tree as anonymous nodes makes it very easy to interpret indentation hints. You are encouraged to capture anonymous nodes in your indents.scm when possible — because (a) they’re usually the best signifiers of when indentation needs to happen, and (b) they tend to be present in a tree even when the tree is in an error state (like when the user is in the middle of typing a line).

Here’s how we’d use these queries to make indentation decisions:

  • Starting with an empty JavaScript file, a user types if (foo) { and presses Enter. Pulsar runs an indent query on row 1, gets a match for @indent, and responds by increasing the indent level on the next line.
  • The user types a placeholder comment like // TODO implement later and presses Enter again. A query runs against row 2, finds no matches, and therefore decides that row 3 should maintain row 2’s indentation level.
  • Finally, the user types }. After that keystroke, Pulsar runs an indent query on row 3, finds that the row now starts with a @dedent capture, and responds by dedenting row 3 by one level immediately.

Thus you can see that @indent means “indent the next line,” while @dedent typically means “dedent the current line.” But keep this in mind as well:

  • If the user had instead typed if (foo) { /* TODO */ } on row 1 and pressed Enter, we’d need to be smart enough to know that the { that signals an indent was “cancelled out” by the } that came after it. That’s the second purpose of @dedent: to balance out @indent captures when deciding whether to indent the next line.

  • A @dedent capture typically results in a dedent only when it’s the first non-whitespace content on the row. And if the row to be dedented is the one being typed on, Pulsar will trigger a dedent exactly once, rather than after each character typed on the row.

    Why? Because if you don’t want the row to be dedented after all, this behavior allows you to re-indent the row the way you want and continue typing without Pulsar stubbornly trying to re-dedent the row over and over in a perverse game of tug-of-war.

Some other languages don’t have it so easy. Ruby, for instance, doesn’t open its logical blocks with one consistent delimiter…

if foo
  bar
end

while x < y
  x += 2
end

…which means that its indents.scm looks more complex.

[
  "class" "def" "module" "if" "elsif" "else" "unless" "case" "when" "while"
  "until" "for" "begin" "do" "rescue" "ensure" "(" "{" "["
] @indent

[
  "end" ")" "}" "]" "when" "elsif" "else" "rescue" "ensure"
] @dedent

Advanced indents

Indent captures can use the same set of scope tests that were described earlier for syntax highlighting, because sometimes a node should only hint at an indent in certain situations.

For instance, we can handle “hanging” indents like this one…

  return this.somewhatLongMethodName() ||
    this.somehowAnEvenLongerMethodName();

…because we understand that || can’t possibly terminate a JavaScript statement, so the next line must be a logical continuation of the statement.

(["||" "&&"] @indent
  (#is? test.lastTextOnRow))

@indent and @dedent are often the only captures you need. But for unusual situations, Pulsar allows for other sorts of captures:

  • @dedent.next can be used for the situation where something in row X hints that row X + 1 should be dedented no matter what its content is.

    One example would be a conditional statement without braces…

    if (!e.shiftKey)
      return e.preventDefault();

    …because the line immediately after this code should always be dedented one level from the return statement.

  • @match is a powerful capture that can accept configuration. When a @match capture is present, it will set the indent level of the current row to equal the level of a specific earlier row. For instance, consider one way to indent a switch statement:

    switch (job) {
      case 'lint':
        lintFile();
      case 'fix':
        fixFile();
      default:
        console.warn("Unknown job");
    }

    This indentation style means that the closing brace (}) should be dedented two levels from the previous line. A @match capture can handle this as follows…

    ((switch_body "}" @match
      (#set! indent.matchIndentOf parent.startPosition)))

    …because this capture tells Pulsar to set the closing brace’s row to match the indent level of the row where the switch_body itself starts. Pulsar therefore sets row 8’s level to match row 1’s.

    @match captures can also define an offset — for scenarios where they want to indent themselves some number of levels more or less than a reference row.

Read the full indent query documentation to learn the details.

Tags

The fourth sort of query file is typically called tags.scm, and its purpose is to identify “symbols” — nodes in the tree that contain the names of important things.

Pulsar’s knowlege of symbols is what allows you to type Cmd+R Ctrl+R and navigate a source code file by function name, or a CSS file by selector, or a Markdown file by heading name.

The tags.scm file present in most Tree-sitter repositories goes into a level of detail far greater than what Pulsar needs, but that file will nonetheless work pretty well as-is if used as your grammar’s tags query file.

Writing a tags.scm

Let’s write our own tags.scm just to understand how they work.

Suppose we’ve got a Markdown grammar and we want all Markdown headings to be recognized as symbols:

(atx_heading (heading_content) @name)
(setext_heading (heading_content) @name)

Markdown has two different styles of heading, so we’ve written two query expressions. We assign the @name capture to the specific text that is meant to function as the symbol.

The built-in package named symbol-provider-tree-sitter will use this file to query the tree whenever a user runs the Symbols View: Toggle File Symbols command, typically bound to Cmd+R Ctrl+R. For Markdown, we’ve just done all we need to do to ensure that headings show up in that symbols list.

Advanced features

If the exact text of the match isn’t quite what you want to show in the symbol list view, there are ways to alter that text before it’s displayed.

For instance, if we wanted to indicate which kind of heading each symbol is, we could do something like this:

(atx_heading
  (atx_h5_marker)
  (heading_content) @name
  (#set! symbol.prepend "Heading 5: "))

Any string present as symbol.prepend will be prepended to the symbol name before it appears in the symbol list.

As you might expect, there’s also a symbol.append:

(atx_heading
  (atx_h5_marker)
  (heading_content) @name
  (#set! symbol.append " (Heading 5)"))

And, for removing certain content altogether from a name before display, there’s symbol.strip:

; If our parser didn't separate the heading punctuation from the heading text
; in the tree, we could do it ourselves.
((atx_heading) @name
  (#set! symbol.strip "^#{1,6}\\s"))

How to convert a syntax theme

This document is aimed at someone who wants to adapt an existing TextMate-style grammar to the new tree-sitter approach, but it will probably be useful even if you’re starting from scratch.

If you’re adapting a grammar that has both TM-style and old-tree-sitter style versions, prefer the TM-style; in general they’re much better at describing scopes.

Here are the steps I followed:

Find a tree-sitter parser

If one is available and up-to-date (some are unmaintained), then you’ll want to turn it into a wasm file. The tree-sitter CLI can do this, but is very particular about the versions of tree-sitter and Emscripten that should be used. This document is very good at describing which versions you want for each.

The emsdk approach described at the bottom of that document is working fine for me on macOS, but I know @mauricioszabo uses the Docker approach instead.

For me, this command is enough to get a usable wasm file for most parsers:

tree-sitter build-wasm .

Where . is the root of the tree-sitter-X repository.

Once you’ve got a wasm file, you’ll want to put it inside of a Pulsar package’s directory structure.

Creating the grammar

WASM-tree-sitter grammar files look a lot like their legacy-tree-sitter siblings. Here’s the one for Ruby:

name: 'Ruby'
scopeName: 'source.ruby'
type: 'tree-sitter-2'
parser: 'tree-sitter-ruby'

injectionRegex: 'rb|ruby'
treeSitter:
  grammar: 'ts/grammar.wasm'
  syntaxQuery: 'ts/highlights.scm'
  localsQuery: 'ts/locals.scm'
  foldsQuery: 'ts/folds.scm'
  indentsQuery: 'ts/indents.scm'

firstLineRegex: [
  # shebang line
  '^#!.*\\b(\w*ruby|rake)\\r?\\n'

  # vim modeline
  'vim\\b.*\\bset\\b.*\\b(filetype|ft|syntax)=ruby'
]

fileTypes: [
  'rb',
  'rake',
  'Podfile',
  'Brewfile',
  'Rakefile',
  'Gemfile'
]

Some notes:

  • You may want to make the name field slightly different for the moment to make it easier to swap between the existing grammar and your grammar. I’ve been naming them (e.g.) Ruby (WASM) while working in development.

  • The type: 'tree-sitter-2' is what tells Pulsar to treat this as a WASMTreeSitterLanguageGrammar instead of a TreeSitterLanguageGrammar. The naming is silly, but we don’t yet have consensus on a better name.

  • The parser: 'tree-sitter-ruby' line is not used right now, but we’re keeping it because it may be used in the future. For instance, if node-tree-sitter were no longer incompatible with Electron in the future, we could switch back to it and keep most of the infrastructure we’ve built around web-tree-sitter.

  • The treeSitter key is where the important stuff goes:

    • grammar points to the grammar file.

    • All keys ending in Query point to the SCM files that handle highlights, folds, indents, and locals. All are optional except for syntaxQuery.

    • All paths are relative to the grammar file itself. The directory structure described here is not mandatory, and in fact I’ve been experimenting with other structures in other grammars.

Save this file with a descriptive name; I’ve started naming the grammar files modern-tree-sitter-x.cson (where x is the name of the language), but tree-sitter-2-x.cson also exists.

Save the file, then reload your window. Look for warnings in the console; if your grammar fails to activate, you’ll probably be able to figure out why.

Get started with highlighting

  • Make sure you’re in dev mode (pulsar --dev from the command line).

  • Find a good source code file that displays a wide variety of syntactic constructs. Open two copies of it in different panes so that you can compare them side-by-side. One file should use the new wasm-tree-sitter grammar, and the other should use the TM-style grammar.

  • Open the grammar’s highlights.scm file. Ideally, this would be in a third pane, but I’ve just had it in the same pane as the TM-style grammar and flipped between them as needed.

  • At this point, you should be able to make changes to the highlights.scm and see them take effect immediately when you save.

What are my goals in applying scopes?

  • The built-in syntax themes are not incredibly finicky. They divide things roughly into the same categories as the top-level namespaces of TextMate’s scope taxonomy.

  • But syntax themes in general can be arbitrarily finicky, and our goal is generally that a syntax theme should be able to make two things look different if there’s a true difference between them, no matter how slight.

  • Remember that the scope system is also used for semantic purposes. Snippets can be defined under arbitrarily deep selectors — e.g., “when the cursor is immediately to the left of a punctuation mark that ends a string” — and in that respect, tree-sitter grammars should be no less useful than TextMate-style grammars.

Do I start with the tree-sitter package’s built-in highlights.scm?

Yes and no.

Most repos define a basic highlights.scm file in the queries directory.

It’s a great reference for what sorts of things to highlight, and it’s a great way to make sure you didn’t leave out a particular keyword or language construct.

But often we’ll be making different choices on how to classify things, so if you do start with the built-in highlights.scm, please make sure you don’t leave any of the default capture names in place.

Make things look similar

There’s no one way to do this, so follow your instincts. I like to divide an SCM file into rough sections as follows:

; CLASSES

; FUNCTIONS

; COMMENTS

; STRINGS

; NUMBERS

; CONSTANTS

; KEYWORDS

; OPERATORS

; PUNCTUATION

Then, I proceed through the file and try to make scopes match up. In the TM-grammar pane, you can put your cursor anywhere and run Editor: Log Cursor Scope to see the scopes belonged to by that buffer point. The last one is typically the one you want to match.

To inspect the syntax tree generated by tree-sitter:

  • install tree-sitter-tools;
  • make sure a buffer is using your modern-tree-sitter grammar; then
  • run the Tree Sitter Tools: Open Inspector For Editor command.

You’ll see the syntax tree in a pane to the right. You’ll probably want to show anonymous nodes.

Scope conventions

Scope naming is important because there are packages — mainly third-party, but some built-in — that rely on the conventions of scopes. They expect to be able to tell block comments from line comments, and single-quoted strings from double-quoted strings, and so on.

Bookmark this document; you’ll be referring to it a lot.

  • End all scope names with some shorthand form of the language name. In Ruby it’s ruby; in JavaScript it’s js, in shell scripts it’s shell. If you’re unsure, use whatever the TM-style grammar uses; if you’re building one from scratch, maybe use the language’s most common file extension.

  • Comments must be identified with comment.line or comment.block at bare minimum. Line comments should further be annotated with the type of delimiter — hence JavaScript line comments will be comment.line.double-slash.js. The TM-style grammar will almost certainly abide by this convention, but if it doesn’t, be sure to fix that.

  • Strings should be marked with either string.quoted or string.unquoted. Quoted strings should further specify the type of delimiter — string.quoted.single, string.quoted.double, and so on. Oddballs exist: %q strings in Ruby, for example, are scoped as string.quoted.other.ruby.

  • You will come across a lot of scopes that begin with meta. Most of the time, TM-style grammars used them to mark and distinguish various kinds of pattern-matching strategies. By and large, these can be ignored, unless you think they have some sort of meaning that will be useful for tooling to know about. But there are a couple of exceptions:

    • meta.embedded is pretty commonly used to mark “embedded” kinds of things: interpolations in strings, ERB/EJS blocks in HTML, and so on. Some syntax themes style these sections by introducing a subtle background color. Keep these scopes. In fact, subdivide them into meta.embedded.block and meta.embedded.line, depending on whether they’re likely to be single-line or multi-line.

    • Some scopes can be kept, or even introduced, if you think they convey information that can’t easily be obtained another way.

      For instance, the JavaScript web-tree-sitter grammar scopes the entire inner contents of a class definition — everything between the braces — with meta.block.class.

      This can be helpful for several reasons, but most obvious is that the syntax for defining a method inside a class body differs from the syntax for defining a function outside of a class body. Imagine a def snippet that expands to function foo () {} in the general case, but then a def snippet in a more-specific scope selector that expands to foo () {} when you’re within a class body.

  • Some scope names in TM-style grammars are dynamically determined; a pattern that matches both if and else might use one of the capture matches so that it can assign keyword.control.if.x and keyword.control.else.x with just one pattern.

    There’s support for this in web-tree-sitter grammars: any capture with _TYPE_ in the scope name will have the node type interpolated into it. So you can put a keyword’s name into the scope name quite simply:

    ["if" "else"] @keyword.control._TYPE_.x

    _TYPE_ will interpolate a node’s “type” (name) into the scope name, and works the same with anonymous nodes and named nodes. _TEXT_ will interpolate the actual text of the node into the scope name, though only if the node’s text contains no spaces.

Divergences from TextMate grammars

There are a few things that happen inside of some grammars that we should try to get away from.

The big one (function calls)

entity.name.function should only ever refer to a function definition. Lots of grammars violate this rule, especially if they weren’t originally created for TextMate itself.

In TextMate, the entity namespace is devoted to “sections” of a larger document — e.g., things you might expect to see in a symbols list. Hence function invocation should have a scope name like support.function.$lang for built-in stuff, or support.other.function.$lang for functions that are not recognized as builtin.

For example, anything recognized as a common C function (sprinf, strcmp, malloc, etc.) is scoped support.function.C99.c, whereas other functions whose names are not recognized are scoped as support.other.function.c.

The goal here is to make it possible to distinguish a function definition from a function call in a syntax theme. Secondarily, it’s also to clean up the entity.name namespace to at least make it possible for it to serve double-duty the way it did in TextMate by acting as a potential symbols list.

(Things like HTML attributes and object keys are very often scoped as entity.other.attribute-name, which is a bit odd. But it’s fine to keep that the way it is, because (a) the entity.other namespace was typically excluded from symbols lists, and (b) I’m not trying to boil the ocean here.)

Punctuation

Some TM-style grammars add scopes to punctuation, but some don’t, and some do so only sparingly. The way tree-sitter grammars work makes it pretty easy to add scopes to any meaningful punctuation marks.

For instance, here’s a block I’ve been adding to most grammars:

"{" @punctuation.definition.begin.bracket.curly.js
"}" @punctuation.definition.end.bracket.curly.js
"(" @punctuation.definition.begin.bracket.round.js
")" @punctuation.definition.end.bracket.round.js
"[" @punctuation.definition.begin.bracket.square.js
"]" @punctuation.definition.end.bracket.square.js

The upside of tree-sitter grammars is that you can use these kinds of scopes more confidently because you’ll know that tree-sitter only exposes them as anonymous nodes when they’re actually meaningful.

The downside is that you have to watch out for anonymous nodes that can have different meanings in different contexts:

; Is it a division sign…
(binary_expression "/" @keyword.operator.arithmetic.x)

; …or a regex delimiter?
(regex "/" @punctuation.definition.string.regexp.x)

; Is it bitwise OR…
(binary
  "|" @keyword.operator.other.ruby)

; Or the arguments delimiter in a ruby block?
(block_parameters
  "|" @punctuation.separator.variable.ruby)

Furthermore, as you refine your highlights.scm, you’ll want to add more specificity to these punctuation scopes to make them more useful. For instance, it’s worth distinguishing between usages of parentheses in function foo(bar, baz) and if (foo) and return (foo === "bar"):

  • the parentheses around function parameters can be scoped as @punctuation.definition.parameters.(begin|end).$lang;
  • the parentheses around the if condition can be scoped as @punctuation.definition.conditional.(begin|end).$lang;
  • the parentheses around the expression can be scoped as @punctuation.definition.expression.(begin|end).$lang.

You get the idea. If you think this is overkill, that’s fine.

Variables

“How do I scope variables,” I hear you ask, or maybe I’m just imagining it.

The answer is that I haven’t quite made up my mind.

In languages where variables are indicated with a sigil like $ or @, it’s fine to apply syntax highlighting to all variables. The sigil helps us separate variables from other identifiers in a way that isn’t as obvious in a C-style language.

In languages where variables are just bare identifiers, the risk is that nearly everything gets scoped as a variable, and then nearly every identifier is given the same color, and the value of syntax highlighting is diminished.

As the TextMate docs state: “Not all languages allow easy identification (and thus markup) of [variables].” TM-style grammars tend to be rather conservative about highlighting variables as a result. Consider the following block:

let highlight = (tokens, context) => {
  let last = tokens.pop();
  let serialized = serializeLexerFragment(tokens);
  let highlighted = HTML_STRINGS.parse(serialized, context);
  return [highlighted, last];
}

highlight([], null);

Which things are variables? Certainly the parameters tokens and context make sense to scope, and are scoped as variable.parameter.js. Certainly the declarations of last and serialized and highlighted make sense to scope, so they're scoped as variable.other.assignment.js. What about tokens on line 2? Or the reference to highlight on the last line — I did, after all, declare highlight as a variable on line 1, so shouldn’t I highlight it as one on line 8? In the hypothetical object chain foo.bar.baz, do we scope all three as variable?

And if I decide this is too much, and half of the text in my editor is periwinkle now, how do I scale it back?

In lieu of specific guidance, I’d only ask that you not interpret the idea of a “variable” so broadly that it could apply to nearly any identifier in your source code file. For now, I’ve been conservative: in the JavaScript grammar, variable declarations or reassignments are scoped in the variable namespace, as are import specifiers (the Foo in import Foo from 'bar') and parameters in function definitions. Most identifiers are unscoped as yet.

The rough plan is for the locals query — which is underimplemented right now — to help sort this out (if the user wants) by applying some sort of visual treatment to distinguish variables from one another in a useful way — variables defined in the current scope, or inherited from a parent scope, or imported from another file.

Injections

Injections work differently in tree-sitter grammars than they did in TextMate-style grammars. For an example, look at the language-ruby package. Old-style tree-sitter grammars used a method called atom.grammars.addInjectionPoint, and this is one of the few things we’re keeping about the old architecture, because it’s still probably the best way we have to define injections.

Injections themselves are complex enough to need their own document. So here’s the short version:

Pulsar’s built-in language-todo and language-hyperlink packages are designed to inject very targeted syntax highlighting into other languages: TODOs within comments, and URLs within comments and strings. Because of the way TextMate-style injections work, they can inject into any other language without specifying them by name or ID.

But tree-sitter injections don’t work that way. I’ve written tree-sitter grammars for language-todo and language-hyperlink that can be injected into any new-tree-sitter grammar, but they won’t be injected into your language unless you allow them to.

Hence most new tree-sitter language packages should have, at minimum, something like this in their lib/main.js:

UPDATE: I’ve put a pause on this while I investigate how to make these tree-sitter parsers more performant. You can look at packages/javascript/lib/main.js to see how I’ve tried to balance this — creating a separate injection “layer” for each node is more performant, but we end up with a lot of layers. So I’ve had some success with pre-screening possible injections and ignoring the ones that don’t look like they have a TODO or URL.

I’d much prefer to have one language-todo layer and one language-hyperlink layer per buffer that’s responsible for all injections, but that’s not workable until we figure out how to make it so that it can re-parse after a change in under ~10ms no matter the size of the file. First-party tree-sitter parsers seem to be able to do that, but I haven’t figured out their secret.

You’ll have to modify the scope ID (of course) and perhaps the type argument to suit your own language. The type option names specific kinds of nodes in a parse tree. The use of comment for comment nodes is practically ubiquitous, but there might be a few grammars out there that don’t use string_content for the insides of strings.

Make sure you’ve got this line in your package.json

  "main": "lib/main",

…or else that code won’t run.

Tree-sitter queries

Pulsar uses tree-sitter query files to handle three major language grammar features: syntax highlighting, indenting, and code folding.

You can read the tree-sitter documentation itself for a quick tutorial on how to write tree-sitter queries. The playground will allow you to test out some queries on actual code.

Prerequisite: understanding the tree

The output of a tree-sitter parser is, as you might expect, a tree. Each node in the tree reports its range, both in one-dimensional terms (characters X through Y in the file) and in two-dimensional terms (starting at row/column (U, V) and ending at row/column (W, X)). A node’s text property contains the literal text of the node, and its type property bears a name like comment or string that you can use to query that node.

Nodes can be named or anonymous, as the tree-sitter docs explain.

Nodes also contain references to their parents and children, making it easy to traverse the tree from one node to another. Pretend you have a reference to a tree node called node. node.parent will return the node’s parent (if it exists), and node.child(1) will return the node’s second child (since children are zero-indexed).

Here are some properties that can be used to navigate between nodes:

  • firstChild
  • lastChild
  • firstNamedChild
  • lastNamedChild
  • nextSibling
  • previousSibling
  • nextNamedSibling
  • previousNamedSibling
  • parent

You’ll understand why this is useful later on.

Highlights

Syntax highlighting in pulsar uses the same scope system that was pioneered by TextMate; but instead of assigning scope names through a JSON or CSON file, you’ll assign them by querying nodes and assigning them to captures.

Highlights are defined in a file typically called highlights.scm, the path to which is specified by the treeSitter.highlightsQuery key in a grammar definition file.

Some will be easy to map…

(null) @constant.language.null.js
(undefined) @constant.language.undefined.js
(number) @constant.numeric.js

…and some will be highly contextual.

; A variable array destructuring in a for…(in|of) loop:
; The "foo" and "bar" in `for (let [foo, bar] of baz)`
(for_in_statement
  left: (array_pattern
    (identifier) @variable.other.assignment.loop.js))

Again, the playground will help you find the exact query syntax for the thing you want to target.

Built-in predicates

The tree-sitter documentation on predicates also applies to tree-sitter queries in pulsar: you can use #match? and #eq? to add scopes to certain nodes based on their text contents:

((comment) @comment.block.js
  (#match? @comment.block.js "^/\\*"))

In this example, comment.block.js will only apply to comments that start with /*; this is needed because the tree-sitter-javascript parser uses the comment node type for both line comments and block comments.

Another common use case for predicates is to highlight builtin functions as distinct from user-created functions:

(call_expression
  function: (identifier) @support.function.builtin.js
  (#eq? @support.function.builtin.js "require"))

Caveats

There is one major constraint on the use of predicates: if the scope name of a given range varies based on the result of a predicate, it should indicate this via a special #set! predicate so that Pulsar knows when it may need to re-highlight the whole range.

This contrived example won’t always work…

; WON'T WORK: Scoping a template string differently based on whether it contains
; an interpolation with a certain identifier.
(template_string
  (template_substitution
    (identifier) @identifier
    (#eq? @identifier "FOO"))) @string.quoted.other.has-foo-or-bar.js

…because there’s no automatic hint that tells Pulsar exactly which range to re-highlight when a user finishes typing F-O-O. In practice it would seem to work when the template string is small, but it would fail when the template string spans multiple lines.

To make this work in all cases, you’d use #set! to mark the capture with the highlight.invalidateOnChange setting, and you’d apply it both when the predicate matches and when it doesn’t match:

; Set `invalidateOnChange` without the predicate present…
((template_string
  (template_substitution
    (identifier)) @_IGNORE_
    (#set! highlight.invalidateOnChange true))

; …so that this will get properly re-highlighted when the predicate passes
; _and_ when it doesn't.
((template_string
  (template_substitution
    (identifier) @identifier
    (#eq? @identifier "FOO"))) @string.quoted.other.has-foo-or-bar.js)

(You’ll learn about @_IGNORE_ later; here we’re using it to apply a side effect instead of a scope name.)

Let’s look at a less contrived example: we’ll want to scope /** */ comments differently from /* */ comments in some languages…

((comment) @comment.block.documentation.js
  (#match? @comment.block.documentation.js "^/\\*\\*"))

((comment) @comment.block.js
  (#match? @comment.block.js "^/\\*(?!\\*)"))

…but, by default, the keystroke that changes /* to /** will re-highlight only the row on which the change was made. To ensure the entire comment can get rescoped when /* is changed to /** (or vice-versa), we must mark both scenarios with invalidateOnChange:

((comment) @comment.block.documentation.js
  (#match? @comment.block.documentation.js "^/\\*\\*")
  (#set! highlight.invalidateOnChange true))

((comment) @comment.block.js
  (#match? @comment.block.js "^/\\*(?!\\*)")
  (#set! highlight.invalidateOnChange true))

When a change happens anywhere within a node marked with invalidateOnChange, Pulsar will know that the node’s entire buffer range should be re-highlighted.

This has theoretical implications for performance, but in practice won’t be a problem unless you’re doing something very silly, like re-scoping a very large node based on a #match? predicate. In the future, this feature might be expanded so that you can provide more subtle hints as to when the whole node should be invalidated — for instance, not on every keystroke, but only under certain conditions.

Custom predicates

Ideally, we’d be able to define our own filter predicates in the style of #match? and #eq?, but the web-tree-sitter bindings don’t currently make that possible. They do, however, provide the #set! predicate, which allows us to attach arbitrary data to a query capture and process it after the capture:

((comment) @comment.block.js
  (#set! foo bar))

Thus, in addition to the invalidateOnChange setting above, we’re able to use #set! in two ways:

  • scope tests are filter-style predicates that restrict when a scope will be applied, and
  • scope adjustments are predicates which change the range that a scope applies to.

As you’ll see, we can also use #set! to store arbitrary keys and values that themselves can be used as filters for subsequent queries.

A #set! predicate cannot take a capture as an argument, so if your query expression has multiple captures inside it, the #set! will apply to all of them. To avoid this, you may occasionaly need to get creative in how you write your queries.

#set! predicates are organized into namespaces as much as possible.

Capture settings

final and shy

By default, if a certain kind of node is captured multiple times, it will end up with multiple scope names.

; Things that LOOK_LIKE_CONSTANTS.
((identifier) @constant.other.js
  (#match? @constant.other.js "^[A-Z_$]+$"))
; All other identifiers.
(identifier) @variable.other.js

An identifier that LOOKS_LIKE_THIS will have constant.other.js applied to it, but it will also receive variable.other.js, because both captures match. If you want the first match to exclude the other, you could add #not-match? to the second rule, but that’ll get complicated quickly.

A simpler option is to use final to “claim” a capture’s range exclusively:

; Things that LOOK_LIKE_CONSTANTS.
((identifier) @constant.other.js
  (#match? @constant.other.js "^[A-Z_$]+$")
  (#set! capture.final true))
; All other identifiers.
(identifier) @variable.other.js

The final test is used to state that the given capture is the last node that will be allowed to set a scope for the given buffer range. Any later captures that try to scope the exact same range will fail, even if they also have (#set! capture.final true).

The final test is unique in how it applies a stricter criterion to later captures, rather than the one it’s applied to. But remember — the order of rules in a query file determines the ordering of the query captures. If we were to reverse the order of these two rules…

(identifier) @variable.other.js

; Things that LOOK_LIKE_CONSTANTS.
((identifier) @constant.other.js
  (#match? @constant.other.js "^[A-Z_$]+$")
  (#set! capture.final true))

…the later rule wouldn’t prevent variable.other.js from applying.

On the other hand, shy is used to apply scopes only when a given range doesn’t yet have scopes applied. So the above example could also be written as:

; Things that LOOK_LIKE_CONSTANTS.
((identifier) @constant.other.js
  (#match? @constant.other.js "^[A-Z_$]+$"))
; All other identifiers.
((identifier) @variable.other.js
  (#set! capture.shy true))

Remember that final and shy operate on ranges, not specific nodes. If a parent node and child node have the exact same boundaries defined (as sometimes happens with unquoted strings or special language constants), any final or shy rules applied to the parent will affect the child as well.

Scope tests

All scope tests follow the pattern (#is? test.foo bar), where foo is the test’s name and bar is an argument that the test may need. (If a test doesn’t need an argument, you may omit it instead.)

Any test that passes with #is? will fail with #is-not?, and vice versa.

These tests currently work on all kinds of queries except for folds queries. Folds queries do not currently have need for these kinds of tests, but if need can be demonstrated, they will be added.

first and last

Tree-sitter’s “anchors” would be quite useful to Pulsar if they applied to all nodes, but they only work for named nodes. Here’s how we can replicate them for anonymous nodes:

; Single-quoted string…
(string "'") @string.quoted.single.js

; …and its delimiters.
((string "'" @punctuation.definition.string.begin.js)
  (#is? test.first))

((string "'" @punctuation.definition.string.end.js)
  (#is? test.last))

A string's first and last children are anonymous nodes that represent its delimiters. (string "'" @foo) will match both of them with the same capture name, but we want to apply different capture names to each one.

Hence first willignore the capture unless the captured node is its parent’s first child, and last will ignore the capture unless the captured node is its parent’s last child.

descendantOfType and ancestorOfType

Tree-sitter queries can be arbitrarily complex, but they need to be quite specific about parent/child relationships. There’s no easy way to describe something like “a string anywhere inside of a function body” because a string node can appear within the statement_block of a function_definition in a large number of ways.

Instead, descendantOfType lets you test for this relationship more simply:

((string) @string-inside-function
  (#is? test.descendantOfType function_definition))

The negation can come in handy, too. While the user is typing on a line, the code on that line might not yet be syntactically valid, in which case tree-sitter sometimes makes wrong judgements about how it should classify things. But you can exclude those tokens from being highlighted if they’re descended from an ERROR node:

; The tree-sitter-c parser usually thinks any new content on a blank line is a
; type definition.
((type_definition) @storage.type.c
  (#is-not? test.descendantOfType ERROR))

rangeWithData

The intent of #set! in web-tree-sitter was to allow the capture process to define arbitrary data for a capture. Here’s an implementation detail that you haven’t had to know until now: if a capture isn’t rejected by a scope test, the scope resolution process stores that arbitrary data according to a node’s range, and that storage persists for the life of a given scope-resolution process. That’s how we keep track of whether a range has been “marked” for rules like final and shy.

But you are free to use #set! to mark captures with any key and value you like, whether or not the key has a special meaning to Pulsar. This can be a very powerful way to match things that would otherwise be quite difficult to match.

The rangeWithData test allows you to check for the presence of an arbitrary key that has previously been applied to this range using #set!.

It’s easy to query for nodes that have children of a particular type, but #is-not? test.rangeWithData gives you a useful way to isolate nodes that do not have children of a particular type:

 ((do_block (block_parameters)) @_IGNORE_
  (#set! hasBlockParameters true))

((do_block) @block-without-parameters
  (#is-not? test.rangeWithData hasBlockParameters))

(You’ll learn about @_IGNORE_ properly later in this document, but here it allows us to capture certain nodes in order to set data on them, rather than because we want to mark them with a scope name.)

As the name of the test implies, the data is set on the range suggested by the capture, rather than by the inherent range of the node. If the capture involves scope adjustments, those adjustments will always be made before any scope tests are applied, no matter the ordering of predicates within the capture.

descendantOfNodeWithData

A similar test, descendantOfNodeWithData, can walk up a node’s ancestor chain looking for data of a certain kind.

In JavaScript, the “optional chaining” operator (?.) is quite useful, but is invalid in some syntactic contexts that tree-sitter can detect. For instance, it’s always invalid on the left-hand side of an expression…

foo?.bar?.baz?.thud = "14";

…but there’s no simple way to capture all ?.s in the above example with one capture, since the tree represents that chain as a series of Russian nesting dolls:

(assignment_expression
  left: (member_expression
    object: (member_expression
      object: (member_expression
        object: (identifier)
        optional_chain: (optional_chain)
        property: (property_identifier)
      )
      optional_chain: (optional_chain)
      property: (property_identifier)
    )
    optional_chain: (optional_chain)
    property: (property_identifier)
  )
  right: (string
    (string_fragment)
  )
)

And even if you felt like writing captures to match all of these, they’d miss something like…

foo?.bar?.['baz']?.thud = "14";

…because the bracket notation now swaps out a member_expression node for a subscript_expression node, and some of your queries would fail.

And we can’t use descendantOfType, since it’s not always invalid for an optional_chain to descend from an assignment_expression — only if it’s on the left side!

Instead, we can mark all these illegal contexts with arbitrary data, then test whether a node descends from a node that has that data:

; Optional chaining is illegal…

; …on the left-hand side of an assignment.
(assignment_expression
  left: (_) @_IGNORE_
  (#set! prohibitsOptionalChaining true))

; …within a `new` expression.
(new_expression
  constructor: (_) @_IGNORE_
  (#set! prohibitsOptionalChaining true))

((optional_chain) @invalid.illegal.optional-chain.js
  (#is? test.descendantOfNodeWithData prohibitsOptionalChaining))

Since predicates can’t have more than two arguments, descendantOfNodeWithData doesn’t care about the value at the given key — only whether the key itself is present. (In the future, this predicate might interpret its second argument the way that test.config does.)

As the name implies, descendantOfNodeWithData does not consider data defined on the captured node itself; it starts with that node’s parent (if it exists) and moves upwards.

Note that descendantOfNodeWithData cannot consider scope adjustments, unlike its sibling rangeWithData. As it goes up the ancestor chain, it uses the inherent range of each node to look up data, even when the capture includes scope adjustments — because those adjustments were meant for the captured node itself. There is no reliable way for this predicate to relate a node to any range other than its own — even adjusted ranges for which that node was the starting basis. If you need to store data for descendantOfNodeWithData to look up later, do not adjust the capture range.

root

Passes when the node in question is the root node. A node without a parent is considered to be the root.

Oddly, sometimes the root node is an ERROR node, and this test will help us detect those cases.

type

Passes when the node in question is of the given type.

You may wonder why this exists, as query syntax lets us describe this quite easily. In fact, the negation is more useful:

(binary_expression
  left: (_) @some-capture
  (#is-not? test.type "foo bar"))

This is a good way to weed out certain types in a wildcard query: this capture will pass whenever @some-capture is of a type other than foo or bar.

firstOfType and lastOfType

Passes when the node in question is the first/last of its type among its siblings. Works for both named and anonymous nodes.

injection

Passes when highlighting is being performed on an injection layer, as opposed to the root layer of a buffer.

startsOnSameRowAs and endsOnSameRowAs

startsOnSameRowAs passes when the start of this node is on the same row as the position specified by a node position descriptor.

For example, (#is? test.startsOnSameRowAs endPosition) will pass on a node that does not span more than one row, and (#is? test.startsOnSameRowAs parent.startPosition) will pass on a node that starts on the same row on which its parent starts.

endsOnSameRowAs also exists.

config

config is powerful because it allows you to define captures conditionally, based on the user’s own settings.

Imagine your language-x package has a setting for which sort of indentation style to use, and the user can pick one among several. It would allow you to define indentation captures conditionally based on what the user had chosen:

((if_statement
  "{" @indent
  (#is? test.config "language-x.braceStyle one-true-brace-style")))

This is our way of allowing the last argument to specify both a configuration key and the value we expect: the argument is wrapped in quotes, and the key and value are separated by a single space.

A simpler syntax could be used for when the setting is a simple boolean:

(binary_expression
  ["||" "&&"]

    right: (_) @dedent.next
    (#is? test.startsOnSameRowAs parent.startPosition)
    (#is? test.config language-javascript.enableHangingIndent))

Here we’re hinting that the next line should be dedented in a certain hanging-indent scenario — but only when the enableHangingIndent setting is enabled in our language package. The last argument doesn’t need quotes here because there are no spaces; when the value is omitted, the configuration key is assumed to point to a boolean.

Since all predicate arguments are parsed as strings, whether quoted or unquoted, here’s how we interpret the value:

  • If it’s omitted altogether, we assume the desired value is true.
  • When the value is true or false, it’s coerced to a boolean.
  • When the consists only of numerals, it’s coerced to an integer.
  • Otherwise, it remains a string.

Scope adjustments

Pulsar’s highlighting system doesn’t care about nodes; it just cares about positions in the buffer. So when a captured node doesn’t quite represent the range you’re trying to add a scope to, scope adjustments let you tweak that range.

There are two very important rules about scope adjustments:

  1. A single capture can be adjusted, but it cannot be multiplied. One capture cannot be adjusted to scope multiple things.
  2. A capture’s adjusted range must stay within the bounds of its original node.

Pulsar enforces these rules because major aspects of the syntax highlighting system wouldn’t work right without them.

Other things to know:

  • Order matters with adjustments; you’ll understand how in a moment.
  • If any adjustment in a series fails for whatever reason, the capture will be ignored.
  • Any predicate that acts on ranges — the final and shy capture settings, plus certain scope tests — acts after a range is adjusted.

Scope adjustments work on highlighting queries and nothing else. Folds have a more restrictive adjustments system, and indents have no need for adjustments.

Scope adjustments are #set! predicates that use the adjust namespace.

Node descriptors and node position descriptors

Earlier we learned that nodes know their boundaries in the document. Those boundaries are defined on the startPosition and endPosition properties. So not only can we traverse from one node to another; we can traverse from one node to another node’s position.

Node descriptors are chains of property names separated by dots. Given a node called node, the string nextSibling would refer to that node’s next sibling — just like if you were to evaluate node.nextSibling in a REPL. Or…

  • parent.firstNamedChild refers to the original node’s first named sibling.
  • nextSibling.nextSibling.lastChild refers to the last child of the sibling-after-next of the original node.

And so on.

Node position descriptors are chains of property names that end in either startPosition or endPosition. Hence:

  • endPosition refers to the original node’s end position.
  • lastChild.startPosition refers to the start position of the node’s last child.

Some scope adjustments use node position descriptors as a way of specifying a different boundary for a scope. Keep in mind that if the descriptor fails — if it leads to a node that doesn’t actually exist — the entire capture will be ignored.

startAt and endAt

The startAt and endAt adjustments can be used individually or in tandem to specify a new place for the adjustment to begin or end.

((some_node) @some_capture
  (#set! adjust.startAt firstChild.endPosition)
  (#set! adjust.endAt lastChild.startPosition))

This query will scope the entire range of some_node except for its first and last children.

Since these adjustments refer to absolute positions in the tree, they ignore any adjustments that may have happened before them.

offsetStart and offsetEnd

The offsetStart and offsetEnd adjustments will move either end of the range a fixed number of characters, either positive or negative.

If the example above needed further tweaking, we could use these adjustments as follows:

((some_node) @some_capture
  (#set! adjust.startAt firstChild.endPosition)
  (#set! adjust.offsetStart 1)
  (#set! adjust.endAt lastChild.startPosition)
  (#set! adjust.offsetEnd -1))

Because offsetStart acts after startAt, it moves the start of the range forward one character from the adjustment already made by startAt — and likewise for offsetEnd.

start(Before|After)FirstMatchOf and end(Before|After)FirstMatchOf

These adjustments can tweak the boundaries of a scope based on the results of a regular expression match — describing the regex in the same way you would for a #match? predicate.

One common way to use them is to scope the delimiters of a comment:

; Scope the comment itself…
((comment) @comment.line.double-slash.js
  (#match? @comment.line.double-slash.js "^//"))

; …and its opening delimiter.
((comment) @punctuation.definition.comment.js
  (#match? @punctuation.definition.comment.js "^//")
  (#set! adjust.endAfterFirstMatchOf "^//"))

Note:

  • We’re capturing comment twice here, but altering the range of the second capture. Hence the second capture would still work even if we’d used final on the first rule, because the two ranges won’t be identical.
  • Repeating the #match? predicate in the second capture is not strictly necessary in this case because the capture will be ignored if endAfterFirstMatchOf can’t find the pattern described — but it’s a good habit to have.
  • If this were a block comment, and you needed to scope both the beginning and ending delimiters as well as the entire comment itself, you’d need three captures here instead of two.
  • These adjustments are named the way they are to remind you of Adjustments Rule 1: they cannot scope an arbitrary number of matches. They can only act on the first match.
startAndEndAroundFirstMatchOf

The startAndEndAroundFirstMatchOf adjustment is a useful shorthand when you want to move the beginning and end of the range based on the same pattern match.

; Scope the comment itself…
((comment) @comment.line.double-slash.js
  (#match? @comment.line.double-slash.js "^//"))

; …and its opening delimiter.
((comment) @punctuation.definition.comment.js
  (#match? @punctuation.definition.comment.js "^//")
  (#set! adjust.startAndEndAroundFirstMatchOf "^//"))

The example above behaves identically to the previous example.

Other scope features

_TYPE_ and _TEXT_

It’s often useful to interpolate text into a scope name to simplify your highlights.scm:

(null) @constant.language.null.js
(true) @constant.language.true.js
(false) @constant.language.false.js

This can be expressed more tersely:

[
  (null)
  (true)
  (false)
] @constant.language._TYPE_.js

The _TYPE_ token will interpolate the node’s type into the scope name. This works whether the node is named or anonymous…

[
  "if"
  "else"
] @keyword.control._TYPE_.js

…but make sure that the node’s type contains only lowercase alphabetical characters.

Sometimes it’s useful to interpolate the text itself:

((identifier) @support.builtin._TEXT_.js
  (#match? @support.builtin._TEXT_.js "^(arguments|module|console|window|document)$")
  (#set! capture.final true))

Here we’re matching some common builtin identifiers and interpolating their actual text into the scope name — e.g., support.builtin.arguments.js. This requires that the text being interpolated is a single word without spaces; if the text in question contains spaces, the interpolation won’t happen, and the _TEXT_ token will remain in the scope name. Use this prudently.

@_IGNORE_

Occasionally, it’s useful to capture a node just to exclude it from highlighting altogether.

HTML allows attribute values to be specified without quotes in many circumstances; <p class=foo> is just as good as <p class="foo"> to a web browser. In a tree-sitter tree, attribute_value is present in both cases, but the quoted version wraps it with a node called quoted_attribute_value that surrounds it with delimiters.

We want to scope a bare attribute value with string.unquoted, but how do we prevent that scope from being applied inside of a quoted attribute value? We have several options, but one of them is to use @_IGNORE_ and the final predicate to “block off” a node from being highlighted:

; Prevent quoted attribute values from having `string.unquoted` applied.
(quoted_attribute_value
  (attribute_value) @_IGNORE_
  (#set! capture.final true))

; The "foo" in `<div class=foo>`.
; Because of the preceding rule, if this matches and passes all tests, the
; value must be unquoted.
(attribute_value) @string.unquoted.html

@_IGNORE_ is treated specially in scope resolution; it’s allowed to define tests and apply metadata to its range, but it does not record any scope boundaries for later highlighting.

You can also use it in scenarios where you need to give a node a “safe” name so that you can use it in a predicate without triggering any other side effects:

; Fold self-closing elements.
((start_tag (tag_name) @_IGNORE_) @fold
  (#match? @_IGNORE_ "^(area|base|br|col|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)$"))

Here we needed to test the contents of tag_name, but its parent node is the one we wanted to mark as a fold range.

In the general case, if you have to name a certain query capture, but you don’t want it to have any special meaning, @_IGNORE_ is a safe name to use.

If you need more than one safe name — or if you just prefer more descriptive capture names — you can also use any capture name that starts with _IGNORE_.:

; Fold self-closing elements.
((start_tag (tag_name) @_IGNORE_.tag) @fold
  (#match? @_IGNORE_.tag "^(area|base|br|col|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)$"))

Indents

Pulsar also uses queries to decide when to indent or dedent a given line relative to the previous line. This functionality is used when typing, when pasting text, and when selecting a range of text and running the Editor: Auto Indent command.

Indents are defined in a file typically called indents.scm, the path to which is specified by the treeSitter.indentsQuery key in a grammar definition file.

The tree-sitter system for indentation is designed to conform to the pre-existing indentation workflow as much as possible. It allows you to define “hints” that will increase or decrease indentation one level at a time, much like the TextMate-style grammars do in regular expression form.

But tree-sitter indentation queries involve deep understanding of the structure of the code. They therefore enable things that weren’t possible before, like:

  • one-line-only indentation of conditionals without braces — but smart enough to allow a comment in between
  • “hanging” indents triggered under any of several common schemes for breaking a long line into multiple lines — for instance, after a logical operator like || or && — but smart enough to understand how many lines the hanging indent should last
  • hints that can trigger precise indentation behavior — even indenting two or more levels at once

Introduction to indentation logic

Here is the basic two-phase system that Pulsar — and many editors before it — have used to determine indentation on line x:

  1. To figure out whether we should indent row x, we usually look at the content of row x - 1.
  2. To figure out whether we should dedent row x, we usually look at the content of row x itself.

This makes sense when you think about it — after all, indentation typically signifies delimited blocks, so we can’t dedent a given line until we know whether the ending delimiter will be typed on that line.

Let’s look at one example of what this means for a language like JavaScript:

  • If the user opens a brace ({) on row x but doesn’t close it, row x + 1 should be indented by one level.
  • If the user closes a brace (}) on row x — and it’s the first character typed on row x — we should dedent that row by one level when the } is typed.

The logic under the hood gets a bit complicated, but ultimately it’s easy to reason through.

But why do these two rules contain the word usually? Because there are a couple of scenarios that aren’t covered by these rules:

  • When we need to indent or dedent by more than one level — for example, with switch statments and others that have nonstandard indentation logic. Typically the goal is to ensure alignment with a specific other line.
  • When the content on row x - 1 guarantees that row x should be dedented no matter what its content is. This is rare, but we still need a way to handle it.

When to indent

Pretend we’re editing this file, and that | represents the position of the cursor:

if (foo) {|

}  

If the user were to hit Enter, they’d want the cursor to move to the next line and indent itself by one level, because the presence of { means that a statement block has started.

Now let’s pretend that the if is a one-liner:

if (foo) {| bar(); }

If the user were to hit Enter, they’d still want the cursor to indent itself by one level, even though both { } were present on the original line, because the cursor was between the two braces, and any delimiters after the cursor aren’t relevant.

But if the cursor were at the end of the line…

if (foo) { bar(); }|

…pressing Enter should not indent the next line, because the next line isn’t part of the statement block. The { we saw doesn’t act as a hint to indent the next line because it was “cancelled out” by the matching } before the cursor.

One more example:

} else if (foo) {|
}

Pressing Enter here should still indent the line, even though there are one each of { and } on this line. The initial } doesn’t count because it happens before the {, so it can’t cancel it out.

That’s all. In a language that uses curly braces to delimit blocks, the presence of an anonymous { node in the tree should usually signal an indent; and the later presence of an anonymous } node on the same line should usually cancel out that signal.

When to dedent

Pretend you’re in the middle of turning a one-line if into a multi-line if:

if (foo) {| bar(); }

Your cursor is at the position indicated by |, and you press Enter. Pulsar indents the next line:

if (foo) {
  bar(); }

Now you move the cursor to just before the closing brace, and press Enter again, producing:

if (foo) {
  bar();
}

But how did Pulsar know to indent the new line the first time, but dedent it the second time? Wasn’t there a } present on the new line each time?

  • The } doesn’t come into play after the first Enter because it’s not at the beginning of the new line. Thus it doesn’t cancel out the indentation signalled by { on the previous line.
  • But it does come into play after the second Enter because it is now the first non-whitespace content on the line, so it causes Pulsar to dedent the new line one level.

Pulsar uses similar logic to dedent a line dynamically while you’re typing.

if (foo) {
  bar();
  |

Typing } should instantly dedent row 3 by one level. But if you’ve somehow got this situation…

if (foo) {
  bar();
  } // close the conditional|

…then continuing to type at the end of the line won’t trigger a dedent, because Pulsar wants to enforce the dedent exactly once, rather than after every keystroke on the line.

These are the basic ways that indentation has worked in Pulsar for years, except that TextMate-style grammars have used regular expressions to infer when lines should be indented and dedented.

@indent and @dedent

Instead, now we’re using query captures — which allow these indentation hints to be represented much more easily and precisely.

Meet @indent and @dedent, which aim to implement the intuitions we proved above in a two-phase system:

  1. An @indent on row x - 1 hints that we should indent row x — unless it’s followed by a @dedent.
  2. A @dedent capture as the first text on row x hints that we should dedent row x.

We keep using the word “hint” because we may want to disregard these indicators in certain circumstances, but it tends to be pretty straightfoward.

For example, here’s an indentation query file that will cover a vast majority of use cases in C-style languages:

["{" "[" "("] @indent
["}" "]" ")"] @dedent

How can it be this simple? Because nearly all tree-sitter grammars for C-style languages — C, C++, JavaScript, Java, and so on — will expose these characters as anonymous nodes, meaning that we can query against them directly without caring much about the context.

There are other ways to detect when blocks start and end — look for nodes that represent if statements, for statements, and so on — but it’s better to focus on the delimiters themselves unless there’s some reason you can’t. It makes it far easier to eliminate scenarios where an indentation hint is ignored or somehow captured more than once.

There are other advantages in targeting anonymous nodes:

  • You’ll usually know exactly where those nodes begin and end, and exactly when they’ll get matched; hence you can more easily reason about when they’ll trigger indents and dedents.
  • They’re more likely to match in situations where there are errors in the parse tree — for instance, when the user is in the middle of typing if (foo.

In Ruby, we’re not so fortunate; there isn’t a consistent set of delimiters that both begin and end blocks:

[
  "class"
  "def"
  "module"
  "if"
  "elsif"
  "else"
  "unless"
  "case"
  "when"
  "while"
  "until"
  "for"
  "begin"
  "do"
  "rescue"
  "ensure"
  "("
  "{"
  "["
] @indent

[
  "end"
  ")"
  "}"
  "]"
  "when"
  "elsif"
  "else"
  "rescue"
  "ensure"
] @dedent

The end keyword acts as an ending delimiter for most blocks, but otherwise we have to look for the specific anonymous nodes which hint that a block is coming.

Some of these keywords, like elsif and else, match both @indent and @dedent captures. That’s because they should trigger a dedent of their own row but also hint at an indent of the next row.

There’s one more thing we should fix here: in Ruby, certain conditionals and loops can act as "modifiers" placed after a given statement:

exit unless "restaurant".include?("aura")

begin
  x += 2
end while x <= y

We don’t want these to trigger indents. In this situation, the simplest way to filter them out is probably to add this block at the top of indents.scm:

; Prevent postfix modifiers from triggering indents on the next line.
(unless_modifier "unless" @_IGNORE_
  (#set! capture.final true))
(if_modifier "if" @_IGNORE_
  (#set! capture.final true))
(while_modifier "while" @_IGNORE_
  (#set! capture.final true))
(until_modifier "until" @_IGNORE_
  (#set! capture.final true))

Yes, these are the same features that we can use in scope resolution. The #set! final “claims” these nodes so that they aren’t captured by the @indent below; and the @_IGNORE_ functions as a capture that has no effect of its own.

This is the one caveat about hooking into anonymous nodes like { and }: if those anonymous nodes are present in other contexts, you might have to ignore them either by employing a tactic like @_IGNORE_ or by writing your @indent and @dedent captures more specifically such that those contexts are excluded.

@match captures

@indent and @dedent are simple on purpose, but they can only match the existing functionality of TextMate-style indentation. They can’t indent or dedent more than one level from the previous row.

@match captures bypass the heuristics of @indent and @dedent captures by defining the indentation of a row in relation to the indentation of an earlier row. They deliver very precise indentation at the expense of being a bit more difficult to write.

Here’s a situation that @indent and @dedent can’t quite handle:

switch (type) {
  case 'lint':
    // lint the file
    break;
  case 'fix':
    // fix the file
    break;
  default:
    // do nothing
}

This is a common approach for indenting a switch block, and it’s tricky for Pulsar because it requires that the closing brace be dedented two levels from the previous line, not just one.

Here’s how @match makes this possible:

(switch_statement
  ; Find the precise `}` that closes the entire `switch` statement…
  body: (switch_body "}" @match
  ; …and indent it as much as the line where the `switch` statement began.
  (#set! indent.matchIndentOf parent.startPosition)))

Again we’re capturing the closing brace itself, but we’re using a #set! predicate to tell it to refer to another position: the start of its parent. Its parent is a switch_body node that captures everything starting from the brace on line 1. The @match capture extracts a row from that node position descriptor and sets the indent level of its own row to match that row.

But there’s another thing to fix. The first case statement needs to be indented from line 1, but all other case and default statements will need to be dedented from the previous line. How do we make this work? By using @match so that all these statements compare themselves to the exact same line:

; The lines after `case` and `default` need to be indented one level…
["case" "default"] @indent

; But they themselves need to be indented one level relative to their containing
; `switch`.
(["case" "default"] @match
  (#set! indent.matchIndentOf parent.parent.startPosition)
  (#set! indent.offsetIndent 1))

This is much like the previous example — except that, relative to the anonymous "case" and "default" nodes, the switch_body is a grandparent node — but also includes an offsetIndent predicate so that we can indent these lines one level more than the line we’re referring to.

Notes

  • The matchIndentOf predicate is mandatory for a @match capture; the offsetIndent is optional.
  • The position specified by matchIndentOf must be an earlier position in the document than that of the captured node, or else the capture will be ignored.

Obscure captures

@none

Use @none when you need to signal that the current line should have zero indentation, no matter the indent level of the previous line. For instance, the ending token of a (traditional) heredoc string should use @none.

@dedent.next

Once in a while, you may need the ability for a capture to hint that the next line should be dedented. This isn’t usually how it works, but here’s one example:

if (foo)
  this.destroy();|

Again, pretend that | represents the cursor. This is a rare scenario where we actually can predict when a dedent is coming — since there are no braces around the if, the indentation should only take effect for a single statement. So when we press Enter, the next line should be dedented no matter what. There’s no need to wait until the user starts typing to dedent, because they’ll probably have noticed that the dedent didn’t happen automatically and fixed it themselves.

Here’s how we pull off this entire code block — first by getting the if statement to trigger an indent even without the braces, then getting line 2 to trigger a dedent on line 3:

; An `if` statement without an opening brace should indent the next line…
(if_statement
  consequence: (empty_statement) @indent
    (#set! indent.allowEmpty true))

; …but dedent after exactly one line.
(if_statement
  condition: (_) @indent
  consequence: (expression_statement) @dedent.next)

Lots of this needs explaining.

  • When recovering from errors, tree-sitter parsers will often create “ghost” nodes that aren’t actually present in the tree. You can usually tell these nodes because their text property is an empty string. Thus the default behavior of the indentation engine is to ignore nodes that don’t have any text content, because it usually leads to more accurate results.

    Here, though, we’ve got a construct — empty_statement — that is validly empty, so we use allowEmpty to signal that this capture should not be skipped.

  • After we press Enter once and are typing the consequence of the conditional, Pulsar keeps track of the “expected” indentation level of the line so that it has a basis from which to dedent if it should need to.

    But once we start typing on the next line, our (empty_statement) capture will fail to match, because the statement is no longer empty! Our second query capture will still match a braceless if — an if with braces would have a statement_block as its consequence — but we need an @indent capture in the condition so that Pulsar knows that we still expect line 2’s baseline indentation to be 1 rather than 0.

  • We capture (expression_statement) — which will be present as soon as the user starts typing on line 2, even if it’s not yet valid — with the name @dedent.next to signal that a dedent should take place as soon as the user hits Enter to move to the next line.

How Pulsar uses indent captures

Now that you know all this, it might help to know how Pulsar figures out when to indent and dedent based on these captures.

Here’s exactly how Pulsar decideds the indentation level for a given line — determined when you press Enter, paste text, or run Editor: Auto Indent:

  1. Start an indents capture at the start of the previous row (skipping over blank rows).
  2. Capture all indents.scm queries from that point until the end of the previous line.
  3. Toss out @dedent captures that happen before the first @indent capture.
  4. Total up the score:
  • If there are more @indents than @dedents, the score is 1; otherwise it’s 0.
  • After resolving those captures, subtract 1 for each @dedent.next we saw.
  • Indent or dedent the current line accordingly. If the score is 0, the current line keeps the same indent level; if it’s 1, the line is indented one level; and so on.
  1. Next, run a capture query for the current line.
  2. If a @match capture is found anywhere on the line, and it resolves to an indent level, indent the line accordingly and skip all other processing.
  3. Otherwise, if a @dedent capture is present, and its node is the first non-whitespace content on the line, the current line should be dedented by one level from what we expected.
  4. Failing that, the line should remain indented the way we concluded at the end of step 4.

Here’s exactly how Pulsar decides whether to dedent the current line while the user is typing on it:

  1. To determine the “expected” level of indentation of the current line, execute steps 1-4 above.
  2. For each key the user types on the current line, capture all indents.scm queries on that line.
  3. If a @match capture is present anywhere on the line, the current line should be dedented according to the first @match capture we encounter, and all other processing should be skipped.
  4. Otherwise, if a @dedent capture matches, and its text is the only non-whitespace content on the line, dedent the current line one level from the level we expect.

This is a stricter version of the dedent logic from above, and its goal is to trigger dedent exactly once, and not stubbornly after every keystroke.

Folds

Folds let us collapse a certain range of rows into a logical unit.

Pulsar uses queries to decide when a given line can be folded and the exact range of the fold. Folds are defined in a file typically called folds.scm, the path to which is specified by the treeSitter.foldsQuery key in a grammar definition file.

Simple folds: the @fold capture

Pulsar is constantly asking the language mode whether the current buffer is foldable at row X — because if it is, Pulsar needs to put a fold indicator into the gutter at line X. Simple folds are vastly preferred because they allow us to define folds based solely on where they start, and because we can run one capture query on exactly one line to find out whether a fold exists on that line.

Consider the syntax:

(class_body) @fold

With that one capture, we’re able to describe how a JavaScript class body should be folded, thus turning…

class Foo {
  constructor (thing) {
    this.thing = thing;
  }
}

…into something that can fold into…

class Foo {}

…when collapsed.

A simple fold is easy to annotate because we assume some default behavior. You can imagine that it expands to something like this:

((class_body) @fold
  (#set! fold.endAt lastChild.startPosition))

Now we see how folds determine their end points: they follow a node position descriptor. The default is lastChild.startPosition because this works as indented for most languages with delimiters: since the fold will always start at the end of the starting row, this has the effect of folding up everything between the two delimiters.

IMPORTANT: The “no broadening” constraint of scope adjustments does not apply to folds! Since we only show a fold indicator in the gutter on a line where a fold begins, the only caveat for @fold captures is that their starting row cannot be moved. Adjustments like endAt and offsetEnd can move the fold’s end position to any point in the tree.

endAt

Some languages may not organize delimiters the same way in their tree-sitter trees. And some, like Python, might not have delimiters at all:

([(function_definition) (class_definition)] @fold
  (#set! fold.endAt endPosition))

Here we’re choosing to fold these blocks all the way to the end because there’s nothing else to show on the other side of the fold. Hence…

def foo(arg):
    print "This is a test"
    pass

… will fold into…

def foo(arg):…
Caveats
  • The last item in the chain must describe a position — that is, an object with row and column properties.
  • The described position must come after the fold range’s start position in the buffer.
  • If any link in the chain fails, or if the altered fold range is invalid, we’ll revert to the default strategy of lastChild.startPosition. If even that strategy fails — if the node doesn’t have a last child — then we’ll revert to endPosition, which we know must exist.

offsetEnd

Folds can describe any arbitrary buffer range; they don’t have to conform to the exact boundaries of tree-sitter nodes. If the node boundaries don’t match what you’d like to fold, you can use offsetEnd to shift either end of the range by a fixed number of characters.

(There is no offsetStart for folds, because it doesn’t make conceptual sense to move a fold’s start position, but this might be added if it’s demonstrated to be useful.)

For instance, the tree-sitter-c parser has most of its block nodes include their delimiting braces. This is a bit odd. Hence most of the folds in the language-c bundle would be incorrect if we did as follows…

[
  (for_statement)
  (if_statement)
  (while_statement)
] @fold

…because they’d eat the closing brace. So we do this instead:

(
  [
    (for_statement)
    (if_statement)
    (while_statement)
  ] @fold
  (#set! fold.offsetEnd -1)
)

Much like range adjustments for highlighting, offsetEnd is applied after any alteration of the fold range with endAt.

adjustEndColumn

Like offsetEnd, but absolute rather than relative. Alters the current end point of the fold range by changing the point’s column value to a different number. For instance, (#set! adjustEndColumn 0) will move the end point to the beginning of the current row.

adjustToEndOfPreviousRow

This is a common enough pattern that it warrants its own adjustment.

How can we fold the following?

if (foo) {
  bar();
} else if (troz === 'feh') {
  // blerg
} else {
  exit();
}

To have each if branch be separately foldable, we’d want something like this:

((if_statement
  consequence: (statement_block) @fold))

Yet now we’ve got a problem. If we fold on line 1, the if clause now collapses into…

if (foo) {} else if (troz === 'feh') {
  // blerg
} else {
  exit();
}

…which prevents us from being able to fold the else if branch separately, since it no longer has its own line.

In the JavaScript grammar, we can solve this like so:

((if_statement
  consequence: (statement_block) @fold)
  (#set! fold.adjustToEndOfPreviousRow true))

Now each if branch will fold while keeping the next else if or else clause on the next line:

if (foo) {
} else if (troz === 'feh') {
  // blerg
} else {
  exit();
}

Divided folds: @fold.start and @fold.end

In a perfect world, all folds would be expressable as simple folds. But sometimes the tree is too complex. When you can’t find the end of a fold just by following a set traversal path from its starting node, that’s when you need divided folds.

It’s encouraged to use simple folds wherever possible for performance reasons. To match a @fold.start on row X involves doing a folds query on the buffer from row X to the end of the document, since we have no idea where the matching @fold.end could be. Thus you should only use divided folds whenever you can’t reliably use @fold and endAt to reach the end of the range.

Here’s how they work:

  • The node marked with @fold.start defines the start of a range; the range will begin at the end of the row on which that node starts.

  • The next balanced capture of @fold.end marks the end of that range.

    Pretend we’re looking for a matching end for a @fold.start defined on line 9, and we see the following:

    9:  @fold.start
    10: @fold.start
    11: @fold.end
    12: @fold.end
    

    We know that the @fold.end on line 11 pairs up with the one on line 10, meaning that the @fold.end on line 12 is the one that matches the @fold.start on line 9.

  • Once we know which node marks the end of the fold, we end the fold range at that node’s start position. If the node starts at the beginning of a line, we’ll adjust the range to end at the end of the previous line.

    Right now, divided folds don’t offer any ways to customize the exact boundaries of the fold. But since you can apply either side of the fold to a very specific tree query, odds are you won’t need them. If that proves untrue, we can reassess this decision.

Here’s how we use this feature in the C grammar to handle folding of preprocessor directives:

["#ifndef" "#ifdef" "#elif" "#else"] @fold.start
["#elif" "#else" "#endif"] @fold.end

Note how #elif and #else both end and start folds. This is fine because we interpret the @fold.ends as ending at the end of the previous line. This allows each branch of a complex #ifdef or #ifndef conditional to be folded atomically.

Caveats

  • Currently, the @fold.start and @fold.end must appear in the same language layer — in other words, if you start a fold in an injected language layer (like JavaScript inside an HTML SCRIPT tag), you must end the fold within that same layer. It wouldn’t make much sense to be able to straddle an injection boundary like this — on top of which it allows us to optimize when we’re searching for folds within injection layers, since we can stop searching at the end of the layer’s boundaries rather than keep going through the rest of the buffer.

  • Because the start and end are declared separately, there is no built-in guarantee that there will be equal numbers of fold starts and fold ends, or that they will match up the way you expect them to. That is up to you to reconcile. A @fold.start that has no balanced @fold.end will not appear as a foldable row in the editor. And if you’ve got extra @fold.ends, one will likely match a @fold.start that it wasn’t meant to.

Scope taxonomy

The scope reference on the TextMate website is valuable, but does not cover the breadth of scope names seen in the wild. This is my effort to merge that taxonomy with other very common naming conventions in TextMate-style grammars.

Remember that every scope name should include a final segment describing the language; this is a useful shorthand so that one can just say (e.g.) string.quoted.double.js instead of source.js string.quoted.double. This segment will usually match the second part of the root selector — the js in source.js, the python in source.python, the html in text.html.basic.

  • comment — for code comments.

    • line — line comments; we specialize further so that the type of comment start character(s) can be extracted from the scope. (The comment delimiter itself should be scoped as punctuation.definition.comment.)

      • double-slash — // comment
      • double-dash-- comment
      • number-sign# comment
      • percentage% comment
      • asterisk* comment
      • semicolon; comment
      • apostrophe' comment
      • at-sign — @ comment
      • double-backslash\\ comment
      • double-dot.. commment
      • double-number-sign — ## comment
      • exclamation! comment
      • slash — / comment
    • block — multi-line comments like /* … */ and <!-- … -->. (The comment delimiters themselves should be scoped as punctuation.definition.comment.start.$lang and punctuation.definition.comment.end.$lang.)

      • documentation — embedded documentation like JSDoc or JavaDoc.
  • constant — various forms of constants.

    • numeric — those which represent numbers; e.g., 42, 1.3f, 0x4AB1U.
      • decimal — for base-ten numbers, themselves subdivided into:
        • integer
        • float
      • hexadecimal
      • octal
      • binary
    • character — those which represent characters; e.g., &lt;, \e, \031.
      • escape — escape sequences like \n or \\.
    • language — constants (generally) provided by the language: true, false, nil, undefined, etc.
    • other — anything else; e.g., hex colors in CSS.
  • entity — refers to a larger part of the document. Examples: a chapter, class definition, function definition, or tag. As a rule of thumb, the specific thing that should be scoped as entity is the thing you would expect to see in a symbols list.

    • name — the name of the entity.
      • function — the name of a function being defined. Examples: the "foo" in function foo() {}, let foo = () {}, and similar constructs in JavaScript; the "main" in int main() in C; the "pow" in @function pow() { in SCSS. (Do not scope a function invocation with entity.name.function; function calls should go under support.)
      • type — the name of a type declaration or class.
        • class — class name declarations carry this segment; e.g., entity.name.type.class.ruby for the "Foo" in Ruby’s class Foo.
      • tag — a tag name.
      • section — the name of a section/heading (e.g., a Markdown heading).
    • other — other metadata associated with the entity.
      • inherited-class — the superclass/base class name.
      • attribute-name — the name of an attribute (usually in tags). (Also used for the keys of objects when being defined — like in an object literal in JS — but not when being read or assigned to.)
  • invalid — stuff which is, for whatever reason, invalid. You are invited to disclose the type of invalidity in the penultimate scope segment — for example, invalid.illegal.optional-chain.js — but this won’t affect how the scope is displayed.

    • illegal — things that are unambiguously wrong: a trailing comma in a JSON object, an unencoded ampersand in an HTML document, etc.
    • deprecated — things that are deprecated: the with keyword in JavaScript, the marquee tag in HTML, etc.
  • keyword — words with special meanings. For keywords that are words rather than symbols, it is encouraged to include the keyword’s name in the scope — e.g., keyword.control.continue.js. In tree-sitter grammars, this can be done rather easily with the _TYPE_ interpolation without having to specify a different capture for each keyword — e.g., @keyword.control._TYPE_.js. In TextMate grammars, a capture group’s match can be interpolated into the scope name via (e.g.) keyword.control.$1.js.

    • control — related to flow control; e.g., if, continue, while, return, etc. (await in JavaScript is a control keyword, but async is a modifier applied to functions, and should be scoped under storage.modifier.)
    • operator — all operators, whether words or symbols, should be organized here.
      • unary — e.g., the symbols in -x and !x, the delete operator in JavaScript, etc.
      • assignment — the = in a variable assignment (or its equivalent in other languages)
        • compound — any shorthand operators like +=, *=, and the like
      • comparison — equality and inequality tests; e.g., ==, ===, !=, >, <
      • bitwise — bitwise operations; e.g., &, |, ^, <<
      • accessor — lookups of fields/properties on objects; e.g., -> in C++, . and ?. in JavaScript.
      • other — any operator that does not quite fit into an existing category; e.g., instanceof in JavaScript. You are also welcome to invent a new segment name under keyword.operator rather than using keyword.operator.other as a catch-all.
    • other — “other” keywords. The exact definition of “other” varies based on language. Examples: print in Python (statement in Python 2, function in Python 3); private/public/protected in Ruby (“feel like” keywords, yet implemented as methods). The CSS grammar uses keyword.other.unit for measurement units: %, px, em, vh, etc.
  • markup — the namespace for markup languages like HTML, XML, or its analogs (Markdown, Textile, BBCode).

    • underline — underlined text.
      • link — URLs. (This is scoped under markup.underline so that it will inherit underline styles if there is no theme rule specifically targeting links.)
    • bold — bold text. (Text which is “strong” and similar should fall under this name.)
    • heading — a section header. Optionally provide the heading level as the next element; for example, markup.heading.2.html for <h2>…</h2> in HTML. (Remember to scope the name of the heading itself with entity.name.section.)
    • italic — italic text. (Text which is “emphasized” and similar should fall under this name.)
    • list — list items. (Remember to scope the specific punctuation that signifies a list item when applicable, like in Markdown.)
      • numbered — numbered list items; e.g., <li>s within an <ol> in HTML, entire list items in Markdown that start with a number and a period.
      • unnumbered — unnumbered list items; e.g., <ul> in HTML, * in Markdown.
    • quote — quoted (sometimes block-quoted) text; e.g., <blockquote> or <q> in HTML, paragraphs starting with > in Markdown.
    • raw — text which is verbatim or preformatted; e.g., <pre> in HTML, code blocks in Markdown (both the code-fence style and the indented style). Normally spell checking is disabled for markup.raw.
    • other — other markup constructs.
  • meta — generally used to markup larger parts of the document. In TextMate-style grammars, often used as an implementation detail to mark larger patterns that scope multiple tokens at once. More generally, can be used to indicate semantically significant regions of the document.

    For example, since a JavaScript function has a different definition syntax within a class body than outside of it, a meta.block.class scope can be used to allow the same tab-trigger to expand different snippets, depending on context. Tree-sitter grammars should try to include only meta scopes that would be useful for the user to know.

    • embedded — anything that “embeds” one context into another. Examples: <?php … ?> within HTML, template-string interpolations in JavaScript. Some themes choose to style meta.embedded with a background color.

      • block — embedded things that span more than one line. Examples: a fenced code block in Markdown, a multi-line <?php … ?> block, a heredoc string.
      • line — embedded things that start and end on the same line. Examples: a <code> element, or its backtick equivalent in Markdown; most template-string interpolations in JavaScript.
    • block — any delimited block. For example, a CSS snippet might only be valid inside of a selector block, so scoping that snippet to meta.block.css would prevent it from being invoked at the root of the document. You can further specify what kind of block it is if you like, but only the kinds specified below would tend to be useful to know for most languages.

      • class — the body of a class definition. PROPOSAL
      • function — the body of a function definition. PROPOSAL
    • selector — The entirety of a CSS selector just before a statement block. Examples: all of a selector like div.foo, div.bar > nth-child(2n+1):not(:disabled). A useful cursor context for snippets and commands to be aware of. Only relevant to CSS and CSS-like languages (LESS, SCSS), but it’s a convention worth continuing.

  • punctuation — delimiters, statement terminators, and the like. Often sub-scopes of larger constructs like strings.

    It is recommended to scope all punctuation one way or another, if possible — ideally with its purpose if the purpose can be discerned, but falling back to a literal description if necessary. For instance: a comma (outside of a string) should nearly always be scoped as punctuation.separator because of how often it serves as a separator of some sort. But if a grammar sees a { and cannot tell whether it’s opening a block or describing a data structure, it can fall back to something like punctuation.definition.begin.bracket.curly as a better alternative to no scoping whatsoever.

    (It is also somewhat common for grammars to include both a description of purpose and a description of the literal character in the scope; hence one should not expect rigid adherence to this taxonomy in the wild.)

    • definition — punctuation that marks the start, and usually the end, of something. Examples: {/} in many languages to delimit blocks or data structures; [/] to delimit arrays/lists; (/) to delimit a list of function arguments/parameters; "/" to delimit the bounds of a double-quoted string. Or, for unpaired punctuation: any line comment delimiter.

      After punctuation.definition, the segments should go in this order, omitting any segment when it is not relevant or not known:

      1. type of thing being delimited (string, tag, etc.);
      2. begin or end (if delimiters are paired);
      3. a description of the character itself (bracket.curly, bracket.square, dot);
      4. the language segment.

      Punctuation scopes tend to specify either part A or part C, falling back to a literal description of the punctuation when it isn’t feasible to discern its purpose from context — but sometimes both parts are included. Part C is definitely not necessary to specify when the exact kind of delimiter can be inferred from an ancestor scope name, as it would with string and comment scopes.

      Tree-sitter grammars typically have enough contextual information to be able to describe its punctuation semantically, rather than literally, but can start with literal description as a first pass if necessary.

      • tag — punctuation in HTML/XML/JSX tags. The punctuation on either side of the tag name should be scoped as punctuation.definition.tag.begin and punctuation.definition.tag.end. This applies equally to opening tags (</>), closing tags (<//>), and self-closing tags (<//>).
      • string — string delimiters. The delimiters of a string should be scoped as punctuation.definition.string.begin and punctuation.definition.string.end, no matter whether the delimiters are single-quotes, double quotes, or something more esoteric like ruby’s %q{/}. The scope names themselves don’t need to describe what kind of delimiter they contain; the string’s own scope will already have done that.
      • block — block delimiters. Examples: { } in most C-style languages.
      • parameters — delimiters of a list of arguments when defining or invoking a function. These are (/) in most languages, but also apply to |/| when used for block parameters in Ruby.
      • array — array/list/tuple delimiters, like [/] in JavaScript.
      • other — other kinds of delimiters, though generally these can be scoped as punctuation.definition.x rather than punctuation.definition.other.x.
    • terminator — punctuation that marks the end of something in an unpaired way; e.g., ; in most C-style languages.

      • statement — statement terminators, like the aforementioned ;.
      • rule — rule terminators; e.g., ; in CSS and CSS-like languages (LESS, SCSS).
    • separator — punctuation that goes between other things. (For property lookup on objects, or namespace operators like ::, prefer classification as keyword.operator.accessor.)

      • key-value — goes between a key and value in a pair; e.g., : in JavaScript objects, => in Ruby hashes or PHP associative arrays.
      • list — goes between items in a list; e.g., , in most languages for literal array/list/tuple syntax.
      • inheritance — goes between a class name and its superclass when being defined; e.g., the "<" in Ruby’s class Foo < Bar. (In languages that prefer Foo extends Bar syntax, extends is typically scoped under storage.modifier.)
  • storage — things relating to “storage.” (Not the most insightful name.)

    • type — the type of a declaration. Examples: the function in function foo () {} in JavaScript; the class in class Foo extends Bar in several languages. Also covers int, var, etc.

    • PROPOSAL: There has been very little clarity whether value types should be classified as storage.type or support.type. The original TextMate docs say storage.type should refer to “the type of something — class, function, int, var, etc.” — and syntax highlighting in TextMate bears this out, for things like int and char and bool are storage.type in C. But it also says that support.type refers to “types provided by the framework/library.” I don’t see why the distinction should be drawn so greatly between those things, or where user-defined types fall in the taxonomy. And the TextMate C grammar doesn’t attempt to add scope names to any types it doesn’t actually recognize.

      • Legacy Tree-sitter squared the circle by scoping all type annotations in C as support.storage.type, thus putting them in the support namespace but triggering syntax highlighting (in most themes) for a storage token.
      • God help me, I think I’m willing to accept that compromise. Let’s try this:
        • Core language constructs — class, function, enum, struct, namespace, et cetera — can remain as storage.type.
        • Any value types — even primitives like int, bool, float, char, void, and so on — go under support.storage.type.
          • The primitives can be support.storage.type.builtin to signify that they’re provided by the framework.
          • The unrecognized and user-provided types can go under support.other.storage.type.
    • modifier — a modifier like static, final, abstract, etc. Other examples: async in JavaScript; get/set in JavaScript; global/nonlocal in Python. Also covers visibility annotations like public/private/protected in Java.

  • string — strings in all their forms. (Remember that strings with delimiters should be scoped with punctuation.definition.string.(begin|end), and that the range of the string scope should include its delimiters.)

    • quoted — quoted strings.
      • single — single quoted strings: 'foo'.
      • double — double quoted strings: "foo".
      • triple — triple quoted strings: """Python""".
      • other — other types of quoting: $'shell', %s{...}.
      • unquoted — for things like here-docs and here-strings. Also anything which functions like a string without the need for delimiters — e.g., an unquoted attribute value in HTML.
      • interpolated — strings which are “evaluated”: `date`, $(pwd).
      • regexp — regular expressions: /(\w+)/.
      • other — other types of strings (should rarely be used).
  • support — usages of things provided by a framework or library should be below support.

    • function — functions provided by the framework, library, or language. Examples: puts in Ruby, encodeURIComponent in JavaScript.
    • class — classes provided by the framework/library.
    • type — types provided by the framework/library. (This is probably only used for languages derived from C, which has typedef and struct. Most other languages would introduce new types as classes.)
    • constant — constants (magic values) provided by the framework/library/language. Examples: the PI of Math.PI in JavaScript.
    • variable — variables provided by the framework/library. For example: NSApp in AppKit.
    • storage.type — any value type that is known to be provided by the language or by a widely-used framework or library. This would include even basic things like int in C or string in TypeScript. (The storage segment is a compromise; refer to the storage.type entry above to learn why.)
    • other — things provided not by the framework but by the user. For example: the name of a function defined by the user should be scoped entity.name.function where it’s defined, but support.other.function wherever it’s invoked. (This is a suggestion of mine; it allows us to draw a distinction between function/class/type definition and usage, instead of overloading entity to hold both.)
      • function — any function that is not known to be provided by the framework, library, or language, whether or not we know where it is defined. PROPOSAL
      • storage.type — any type that is not known to be provided by the framework, library, or language, whether or not we know where it is defined. (The storage segment is a compromise; refer to the storage.type entry above to learn why.) PROPOSAL
      • object — any identifier that behaves as an object and does not have a more specific scope. Examples: the “foo” in foo.bar in JavaScript or Java. Not examples: $foo->bar in PHP (because $foo will have a variable.other scope); "1,2,3" in "1,2,3".split(',') in JavaScript (because "1,2,3" isn’t an identifier and will have a string.quoted scope applied). PROPOSAL
      • property — any property retrieval or assignment. Examples: the “bar” in foo.bar in JavaScript or Java; the “bar” in $foo->bar in PHP. May also be used to encapsulate “computed” properties — like the “["something"]” in foo["something"] in JavaScript. PROPOSAL
  • variable — variables. Not all languages allow easy identification (and thus markup) of these. Generally speaking, any variable with a sigil ($, @, etc.) should be scoped in all contexts, but in languages without sigils, the distinction between what is a “variable” and what isn’t is not so clear-cut.

    In general, it is strongly recommended to assign variable.parameter and variable.language scopes where they are warranted, and to avoid the temptation to classify every remaining identifier as variable.other and thereby dilute the significance of the scope. For example, in the modern tree-sitter JavaScript grammar, variable.other is currently only added to variable declarations and assignments, rather than all possible usages.

    • parameter — any parameters in a function definition.
    • language — reserved language variables like this, super, self, etc.
    • other — other variables, like $some_variable.
      • declaration — the names of variables as they are being declared (but not assigned). Examples: the "foo" in let foo; in JavaScript; the "i" in int i in C or C++. PROPOSAL
      • assignment — the names of variables as they are being assigned or reassigned. Examples: the "foo" in let foo = true; in JavaScript; the "foo" in foo = "bar" in Ruby or Python or JavaScript. PROPOSAL
      • readwrite — variables that can be written to or read from. (It’s not clear to me where this convention originated, nor how it adds information that wasn’t already present in the name variable.other. This convention is present in the C/C++, C#, Perl, CoffeeScript, Ruby, and TypeScript grammars.) (community convention)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment