Aerijo/making_language_grammar.md

## making_language_grammar.md

      
    Raw
  

              making_language_grammar.md
            
          
    A guide to writing a language grammar (TextMate) in Atom

Tree sitter


Atom is transitioning to an entirely new way of defining grammars using tree-sitter. This will be enabled by default quite soon now. It is theoretically faster and more powerful than regex based grammars (the one described in this guide), but requires a steeper learning curve. My understanding is that regex based grammars will still be supported however (at least until version 2), so this guide can still be useful.
To enable it yourself, go to Settings -> Core and check Use Tree Sitter Parsers

Links for tree-sitter help:

tree-sitter: the main repo
tree-sitter-cli: converts a JavaScript grammar to the required C/C++ files
node-tree-sitter: module to use Tree-sitter parsers in NodeJS
My guide on starting a Tree-sitter grammar


Introduction

In Atom, syntax highlighting is a two part job: the language package gives a scope to every character in the file, while the user's syntax theme tells the editor which colour each scope should be.
Themes are not the topic of this gist. To learn how to write a theme, I suggest starting at the flight manual.
Instead, this guide will be on how to write a language grammar. Specifically, a TextMate type grammar. It is intended for complete novices, who might have the crazy idea that something like this could be fun and/or easy, and those who want to remind themselves of what they can do. If you're reading this and you notice I've missed something, or I get something wrong, please don't hesitate to leave a comment. The more people sharing their knowledge and experience, the better.
Right now, I don't feel like the guide is finished. Rather, I felt I needed to get what I had written uploaded before something terribly wrong and unpredictable happens to the file I'm writing on.
Table of contents


Helpful links
Getting started

File structure
Regular expressions
Setting up the package
Things to be mindful of


Writing a basic grammar

Filling out the metadata
Making new rules

Basic structure
Simple single line rule
Match with captures
Simple multiline rule
Complex multiline rule


Intermediate tips

Repository
Embedded grammars
Style guide


Advanced tips

Variable / dynamic scoping (backreferencing)
Applying patterns to begin, end, and match captures


Helpful links

Here I've compiled a list of sites I used when writing my first language grammar. Some of these may not be intended for beginners, so think of them as a "second" step to look at when you don't get something here, or want to change things up.

This amazing guide: could not have finished my own package without this. It's worth reading, trust me.
TextMate Section 12: what the spec for Atom's rules is based on. Uses JSON instead of CSON, but the structure should be the same.
DamnedScholar's gist: a template with the accepted keys, and a short comment on their function.
Flight manual grammars entry: The official docs.
Any of the existing language packages for major languages. Python, JavaScript, HTML, and more.
regex101: a tool to test regex patterns. You need to convert between regular expressions defined here and ones used in regex101, as there are twice as many backslashes in the grammar rules. Also, the exact regex engine Atom uses is not available. Any of the options should do for most general cases, but there are differences in ability and syntax of the different engines.
oniguruma: the regex engine Atom uses. Use this to learn the specific syntax available to you.
first-mate: the package Atom uses to tokenize each line. Not necessary for writing a grammar, but a good technical reference if you want to know what's happening behind the scenes.

Getting started

File structure

You might like a basic understanding of the CSON data format. Knowing about JSON might help too. However, knowledge of either is not required to get started. Hopefully though, as you start to use it more, you will come to understand the formats if you don't already. I use the terms object, array, and string frequently, so you should understand what they are at a conceptual level at least.
A quick summary:

object: the fundamental data structure in JavaScript and JSON (JavaScript Object Notation). It is a set of key-value pairs, where accessing the object's key returns the corresponding value. In CSON (CoffeeScript Object Notation), objects are represented as follows

key: 'value'
name: 'your name'
age: 8
pets: [ # an array of pets
  'cat'
  'dog'
  'bird'
]
nestedObject:
  nestedKey: 'nestedValue'
  otherKey: 'more data'

array: seen in the above example, an array is an ordered list of values. They are denoted by square brackets, and must be comma separated if the values are on the same line. Objects in an array must be separated by using {} brackets, as will be seen later on.
string: represents a set of characters. Denoted by quotation marks (single or double) surrounding some text. Most, if not all, end values will be strings (end, as in when the value is not itself an object or array).

Regular expressions

Never heard of regular expressions? Me neither. Turns out, they're pretty useful. And essential to writing the grammar rules. (and can be used with Atom's finder if the Use Regex button is active)
I'll give out a quick rundown here, but you really need to use the provided links to better familiarise yourself with what they are and how to write and test them.

https://www.regular-expressions.info/quickstart.html
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
https://www.icewarp.com/support/online_help/203030104.htm
https://regex101.com/ (use this to test them)

First, the concept: A regular expression (regex) is a group of characters that represents a "pattern" of text. It can be used to search a larger body of text for matches, and (when programming) each match can be passed to functions and handled as desired. In our case, we use regex to search for matches that are then passed to Atom's internals, to be tokenized and processed for the syntax theme to apply colours to.
A basic regex (using JavaScript syntax) might look like the following:
/hello/
Later on, we'll see that we actually use strings to define ours, so it'll look more like
"hello"
For now though, let's examine what search patterns this rule matches.
A general rule of thumb is that all letters are exact matches. Therefore, our above rule will find all instances of the letters h, e, l, l, o appearing consecutively in a body of text.
Here's a question: where are the matches in the following body of text?
Hello to you, Othello, and hello to you too, Iago!

Did you guess correctly?
Note that (by default) regex are case sensitive (so no match for Hello) and do not respect word boundaries (so a match in Othello).
Now, for what makes regex so useful: special characters. There are many of these in regex. A few are as follows, but a proper regex guide should be used to learn them.

. (a decimal point) matches any character
* (a star) match any number of the preceding token
? (a question mark) match between 0 or 1 of the preceding token
\ (backslash) changes the behaviour of the following character. Used with punctuation, it will form a literal punctuation mark. Used with a letter, it will normally make a special meaning.

Using these special characters, more advanced search patterns can be created. For example:
/((\\)(?:\w*[rR]ef\*?))(\{.*?\})/
Try this online
Hopefully, you're now comfortable with reading and writing regular expressions. If not, don't worry too much. You can always go to regex101 and test something you don't understand.
If you completely don't understand regular expressions, or how they are useful, this will be a major hurdle. It is not a stretch to say that regular expressions are the backbone of a language grammar.
Setting up the package

You can mostly just follow the flight manual's creating a grammar section for this. The rest of the tutorial will be for creating a grammar for the (fictional) example language.

Note: following the atom guide will link the package to the dev package directory. This means your package will only be loaded when in development mode. If you wish to make it active in a normal window, navigate to the package directory in the command line and run the command apm link

You should have a package folder, which contains the following directory structure (but with example replaced by your language's name):
language-example
|-- grammars
|   `-- example.cson
`-- package.json

And inside package.json:
{
  "name": "language-example",
  "version": "0.0.0",
  "description": "An example language grammar package",
  "repository": "https://github.com/user/package-name",
  "keywords": [
    "syntax",
    "highlighting",
    "grammar"
  ],
  "license": "MIT",
  "bugs": "https://github.com/user/package-name/issues",
  "engines": {
    "atom": ">=1.0.0 <2.0.0"
  }
}
example.cson should be blank at this point.
Things to be mindful of

Are you writing this for a popular language that already has a grammar package? If so, it is likely there will be several other packages that rely on the scopes provided by the language package (spell check & autocomplete, to name a couple). These packages use the scopes for contextual information, allowing them to be smarter and more "aware" of the language. If you decide to use a nonstandard set of scopes, you risk breaking compatibility with these other packages. When deciding on new scope names, it is better to use the preexisting ones in an established grammar package rather than coming up with your own.
Additionally, these packages rely on the grammar package being active to hook their own activation. This means that you will need to sort out the package activation hooks on a case by case basis.
Terminology

There are several similar terms to describe aspects of the grammar package.
Writing a basic grammar

This section walks you through setting up a basic grammar, with minimal rules. For more advanced features and rules, see the next section.
Filling out the metadata

The top of your examples.cson file should have the following entries
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false


scopeName: this key determines the root scope for all characters in a document using this grammar. The convention is to use source.<language_identifier>, where the language identifier is a unique, short word. For example, the core packages use source.python and source.js for Python and JavaScript. However, there exists an additional convention where text based languages get the root scope text.<...>. This means HTML gets scoped to text.html.basic, and LaTeX (currently) to text.tex.latex. When in doubt, just use source.<...>.


name: this is the entry that will appear in the language selection menu. It is purely aesthetic, but should simply be the language's name.


fileTypes: an array of file extensions that are used to determine if a given file should use this grammar. This lets Atom automatically select the correct grammar when the user opens a file.


limitLineLength: a Boolean value to tell the tokenizer whether or not to "give up" on long lines. If true, the tokenizer will only look at a maximum number of characters per line, and completely ignore the rest. This can lead to incorrect pattern matching, especially in text like language where paragraphs are present. Setting it to false effectively forces the tokenizer to look at the whole line, and apply the rules to everything.


There are more available properties, but they will be introduced in the intermediate section. For now, these properties will be sufficient.
Making new rules

Basic structure

Below the above entries, make a new key called patterns. It's value is an array of objects, which will each hold the information for a search pattern.
patterns: [
  {
    # rule #1
  }
  {
    # rule #2
  }
  {
    # rule #3
  }
  # etc.
]
Simple single line rule

Now, we'll look at making a specific rule.
The basic outline for a single line matching rule is as follows:
{
  comment: 'Use this to explain the function of the rule, if necessary'
  name: 'comment.line.example'
  match: '#.*$'
}
Some things to note:

The scope name should follow one of the ones given in the TextMate manual. This is to maximise the chances that a syntax theme will have a corresponding rule to colour that scope. The final part of the scope should be the language name (the one set in scopeName at the top).
The match key holds the regex that defines the search pattern. It is a string, which means all backslashes must be escaped with another backslash. Therefore, to match a literal \, the normal regex for which is \\, one must use \\\\ (sigh... did I mention my first package was a grammar one for LaTeX?).
Also important to note is that a match will only work on one line. Even if you have a \\n inside the regex, it will not work.

If you're following along (which may be a good idea) you can start playing with this rule. Below, I'll give the current full contents of my example.cson file. I will not do this much, as you should be able to insert and maintain a list of rules yourself now.
# grammars/example.cson

scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false

patterns: [
  {
    comment: 'Use this to explain the function of the rule, if necessary'
    name: 'comment.line.example'
    match: '#.*$'
  }
]
Opening a new file and setting the grammar to your new one, paste in the following and you should see the # and subsequent characters are a comment.
Normal text # comment

Match with captures

Now, onto a more complicated match. Try the following rule:
{
  match: '(\\*)(.*?)(\\*)'
  captures:
    0:
      name: 'meta.bold.example'
    1:
      name: 'punctuation.definition.bold.example'
    2:
      name: 'markup.bold.example'
    3:
      name: 'punctuation.definition.bold.example'
}
This introduces the captures key; it's value is an object with keys corresponding to the capture groups of the match regex. Each of these keys then also has an object value, with the key name (which is like the name key in the rule above). What this does is allow different scopes to be applied to different parts of the same match. For this rule, it is applying meta.bold.example to everything (capture 0), but additionally applying punctuation.definition.bold.example to the * delimiters (captures 1 & 3) and markup.bold.example to the (arbitrary) contents of the second capture group. Note that captures: 0: is equivalent to using the name key in this case.

If you don't know what I mean by capture group, remember the section on regex? Where I told you to learn regular expressions? I wasn't kidding.

Before I continue, I'm going to show the "condensed" form of the same rule. I prefer it, as it wastes fewer lines on useless things like capture group numbers. A more detailed explanation is given in intermediate tips.

{
  match: '(\\*)(.*?)(\\*)'
  captures:
    0: name: 'meta.bold.example'
    1: name: 'punctuation.definition.bold.example'
    2: name: 'markup.bold.example'
    3: name: 'punctuation.definition.bold.example'
}
Simple multiline rule

By now, you should be able to make some basic rules for your grammar. But what if you need to match across several lines? You want the begin and end keys.
{
  name: 'meta.section.example'
  contentName: 'markup.other.section.example'
  begin: '((\\\\)section)(\\{)'
  beginCaptures:
    1: name: 'support.function.section.example'
    2: name: 'punctuation.definition.function.example'
    3: name: 'punctuation.definition.begin.example'
  end: '\\}'
  endCaptures:
    0: name: 'punctuation.definition.end.example'
}
Some new keys:

name: as with a match rule, the name key applies to the entire match, including the text captured by the begin and end patterns in this case.
contentName: applies the scope to the text between, but not including, the begin and end captures.
begin: the pattern that defines when the rule begins.
beginCaptures: much like captures in a match rule, but only applies to the text captured by begin.
end: the pattern that defines when the rule ends.
endCaptures: like beginCaptures, but for the end text.

When you try this one, you might notice a distinct lack of colour. Maybe the \section part is coloured, but nothing else is (using one dark theme at least). Lining up the cursor with a spot you want to check, running the command Editor: Log Cursor Scope will show the scopes have indeed been applied. This demonstrates the divide between grammar and theme perfectly; the scopes have all been applied, but they are not coloured because the theme ignores them. Bear in mind that scopes are not solely for themes though, and some themes may use these seemingly useless scopes. As the grammar author, it's your job to provide as much information as possible about the file, by scoping accurately.
Complex multiline rule

Another thing you might have noticed is that our other rules don't work inside of the section rule (and if you were experimenting, you'd have found they don't work inside if the bold match rule either). Basically, everything from the first to last character captured by a given rule is independent from the other rules in the main patterns array. To apply rules to the captured text, we need to make a patterns array inside the current rule. This patterns array behaves much like the outside one, except the rules it contains are only applied to the  text between the begin and end captures of the rule it's in.
{
  name: 'meta.section.example'
  contentName: 'markup.other.section.example'
  begin: '((\\\\)section)(\\{)'
  beginCaptures:
    1: name: 'support.function.section.example'
    2: name: 'punctuation.definition.function.example'
    3: name: 'punctuation.definition.begin.example'
  end: '\\}'
  endCaptures:
    0: 'punctuation.definition.end.example'
  patterns: [{
    name: 'comment.line.example'
    match: '#.*$'
  }]
}
An important behaviour to observe now is what happens if one of these inside pattern rules is not finished when the end pattern could be matched? Try the following to find out.
\section{ this is a section # }
  is this still a section? }

How about now?

Here's a step by step overview of what happened:

The text \section{ is matched as the beginning of the rule
The tokenizer started looking for the end pattern, or any matches to the rules in the local patterns array.
The tokenizer matched the # } at the end of the first line with the comment rule in patterns, effectively hiding the first }.
The tokenizer continued looking for pattern rule matches or end matches when the comment rule ended (the end of the line in this case).
It sees the } on the second line and matches it with the end pattern.
The rule is finished, so the final line has no special scopes.

But what if we actually wanted the rules in the main patterns array to be active inside a begin/end rule? For this, there is the includes key. It takes the name of a rule defined in the repository (explained later), and pretends that rule was actually there. In this case, where we want it to match the main patterns array, we would use one of two values:

$self: (note this is not a regex) this value refers to the current grammar. That is, the context it's used in will have the rules in the main patterns array applied to it.
$base: similar to $self, but with some differences when embedded in another grammar. Not important right now, but just remember that $base is not the same as $self when your grammar is embedded in another. $self points to the grammar $self appears in (points to itself), whereas $base points to the base language of the file, which could be anything. If you don't know what I mean by embedded, don't use $base.

Right now, your example.cson file should look something like this:
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false

patterns: [
  {
    comment: 'Use this to explain the function of the rule, if necessary'
    name: 'comment.line.example'
    match: '#.*$'
  }
  {
    match: '(\\*)(.*?)(\\*)'
    captures:
      0: name: 'meta.bold.example'
      1: name: 'punctuation.definition.bold.example'
      2: name: 'markup.bold.example'
      3: name: 'punctuation.definition.bold.example'
  }
  {
    name: 'meta.section.example'
    contentName: 'markup.other.section.example'
    begin: '((\\\\)section)(\\{)'
    beginCaptures:
      1: name: 'support.function.section.example'
      2: name: 'punctuation.definition.function.example'
      3: name: 'punctuation.definition.begin.example'
    end: '\\}'
    endCaptures:
      0: name: 'punctuation.definition.end.example'
    patterns: [{ include: '$self' }]
  }
]
Try it out on the following text:
Normal text # comment

* bold # text * <- not commented

\section{
  text
  # comment
  * bo-#-ld * <- still not commented
  text
}

text

Can you see what needs to be done to get comments working in a bold match? Did it work? Why not?
Remember, the match pattern will only ever work on a single line (the tokenizer only looks at one line at a time; it literally doesn't see anything else). To get comments working in the bold rule, and get the bold rule to work across multiple lines, it needs to be converted to a begin/end rule as follows:
{
  name: 'meta.bold.example'
  contentName: 'markup.bold.example'
  begin: '\\*'
  beginCaptures:
    0: name: 'punctuation.definition.bold.example'
  end: '\\*'
  endCaptures:
    0: name: 'punctuation.definition.bold.example'
  patterns: [{ include: '$self' }]
}
And so concludes the beginners section of the guide. With the tools above, you should be able to produce a grammar of reasonable complexity. What follows are some tips for intermediate authors, for additional features and best practices.
Intermediate tips

Repository

A feature mentioned above, but not explained, is the repository. For a grammar of any reasonable size, the repository is vital to help organise your rules.
To make it, add the repository key after the main patterns array. It's value is an object, so do not add brackets after it. For example:
scopeName: 'source.example'
name: 'Example'
fileTypes: [ 'exp' ]
limitLineLength: false

patterns: [{ include: '#lineComment' }]

repository:
  lineComment: {
    comment: 'This is a rule object, with the same abilities as any other'
    name: 'comment.line.example'
    match: '#.*$'
  }
  secondRule: {
    ...
  }
  thirdRule: {
    ...
  }
In the above example, a rule with the name lineComment has been added to the repository. Note that rules in the repository are not automatically applied. They must be include'd inside the main patterns array, or into another rule's child patterns array. To properly refer to this rule, the include key must have the value '#lineComment' as it does in the example. The rule itself is also a valid choice, and it will recursively apply itself as much as possible. This recursion also occurs with $self, but self refers to the entire grammar, not just that specific rule.
I'm of the opinion that all rules should be added to the repository. Then, they can be activated as desired by including them in the patterns array. I also like to group sets of rules into "meta" patterns, that are made up almost entirely of other include'd rules. This allows you to form customised sets of rules that can be applied consistently, without repetition. For an example of this, see my end result from when I tried writing one. It's not perfect, and I could probably do with following my own advice in some parts. Overall though, I'm reasonably happy with it. You'll notice I use comments to help organise the repository; this is another thing you should do to help yourself and others when trying to understand what you've done.
Embedded grammars

Another feature alluded to above, when talking about $self vs $base. Basically, other grammars can embed your grammar into their rule set, and vice versa. When embedding another grammar, you need to use include: 'source.language' (where source.language is the root scope of the target language; watch out for the text. versions). For example
{
  begin: '```'
  end: '```'
  patterns: [{ include: 'source.js' }]
}
will scope the contents of a three back-ticks pair to JavaScript.
$base is important for embedded grammars, as it points to the file's root grammar. This means that if your grammar is embedded into another, e.g., a markdown grammar, $base will point to the markdown grammar, not yours. Sometimes this behaviour is desirable, and is used extensively in the C family of grammars.
One thing to be wary of is leakage: this occurs when a scope from the embedded grammar has not been closed, and it prevents your rule from seeing the end pattern. This is highly likely when the user will only be writing a portion of a code snippet, where there might be an opening brace but no closing one.
This can be seen using the rule above. In a file, leave an unmatched { in the JavaScript section. Now, instead of picking up the back-ticks as matching the end pattern, it will instead be interpreted as a JavaScript string. From there, all scoping that follows will likely be broken.
Currently, there is no solution to this problem (that I am aware of). Hopefully a key will be added that makes the end pattern more important, so it will be checked first before all others.
As a grammar author, you need to consider this from the other side too, i.e., thinking about others embedding your grammar. For every begin rule you add, it should have an end. Sometimes, less is more. If a match rule works just as well, use the match.
Style guide

The condensed form introduced at the beginning is a good thing to use, but here I'll go through some style tips I developed after trying to read some grammars (both my own and others) myself.

Keep the patterns array clean. If there are a lot of rules building up in one, consider moving them to the repository and includeing them. This keeps the active rule set clutter free, while succinctly expressing the function or intention of each rule.
When there is only one entry in an object or array, keep it all on one line. For example, compare below. While the first may look more spaced out and easier to read, the second really helps when scrolling through a long list of rules. It's effectively halved the number of lines, while representing the exact same rule.

# Spread out
{
  begin: '((\\\\)texttt)\\s*(\\{)'
  beginCaptures:
    1:
      name: 'support.function.texttt.latex'
    2:
      name: 'punctuation.definition.function.latex'
    3:
      name: 'punctuation.definition.arguments.begin.latex'
  end: '\\}'
  endCaptures:
    0:
      name: 'punctuation.definition.arguments.end.latex'
  contentName: 'markup.raw.texttt.latex'
  patterns: [
    {
      include: '$self'
    }
  ]
}

# Condensed
{
  begin: '((\\\\)texttt)\\s*(\\{)'
  beginCaptures:
    1: name: 'support.function.texttt.latex'
    2: name: 'punctuation.definition.function.latex'
    3: name: 'punctuation.definition.arguments.begin.latex'
  end: '\\}'
  endCaptures:
    0: name: 'punctuation.definition.arguments.end.latex'
  contentName: 'markup.raw.texttt.latex'
  patterns: [{ include: '$self' }]
}

When there are multiple entries, break them across new lines. The spacing in this case (of the curly brackets column) helps with recognising when a group is together, and when it is separate. For example

# With multiple entries...
patterns: [ # multiple pattern entries, so new line
  { # multiple entries, so it gets a new line
    comment: 'Handles all types of comments'
    include: '#commentMeta'
  }
  { include: '#escapedCurlyBracket' } # single entry, so no newline
  { include: '#metaOpenBrace' }       # another single entry
]

# With one entry...
patterns: [{ # only one entry, so no new line between the [ and the {
    match: 'blah' # the rule object has multiple entries, so a new line is required after the {
    name: 'blah'
    ...
}]

The braces are optional when the array has one entry. I like them, so I use them. It makes it easier to add additional rules though, so think carefully before omitting them.
Be consistent. Worse than using any one style is using an inconsistent mixture, and making the reader think about the format of what they are reading.
Don't use quotation marks for the key names. I haven't used them in this guide, but you will likely see some packages that do. To the best of my knowledge, these quotation marks do not contribute anything. In fact, they actively detract comprehension because syntax highlighting themes will make everything the same colour. By contrast, with unquoted form using language-coffee-script and one-dark theme, I see key names as red, numbers as orange, and strings as green.

Additional root level properties

In addition to the name, patterns, repository, etc. properties, there are some others that are recognised

firstLineMatch: a regex string that assists Atom's automatic language selector. The selector generates a score for each language based on the file's extension and contents (so it was also using fileTypes behind the scenes). To

Include specific rules

It is possible to include specific rules from another grammar's repository: simply use the syntax
{ include: 'source.example#ruleName' }
When the # character is not first, the part before it is taken as the grammar scope name. The part after is then read as the repository name, like with internal include  statements.
The actual function determining this behaviour is here.
Scope names

Here I'll talk about valid scope names, and some good practices.
Advanced tips

Variable / dynamic scoping (backreferencing)


Note: Dynamic scope names do not play friendly with some packages that depend on reading the scope. For example, linter-spell will only accept a list of absolute scopes to blacklist, making it incompatible with this style of scoping. Caching of values based on scope (e.g., for autocomplete) will also be negatively affected by dynamic scoping. This guide presents them in a "this is possible" sense, rather than "you should do this".

One feature I haven't mentioned at all yet is using the capture groups in scope names and other parts of the regex pattern. This is possible using $n and \\n (not newline!) notation, where n is the capture group number. For example, using $n in scope names
{
  name: 'support.function.section.latex'
  begin: '((\\\\)(section|paragraph|part|chapter)(\\*)?)(?=[^a-zA-Z@])'
  beginCaptures:
    #                              v
    1: name: 'entity.name.section.$3.latex'
    2: name: 'punctuation.definition.function.latex'
  end: '\\}'
  endCaptures:
    0: name: 'punctuation.definition.end.latex'
  patterns: [{ include: '$self' }]
}
And using \\n in a regex match
{
  name: 'string.function.verbatim'
  #                                     v
  match: '\\\\verb([^a-zA-Z])(.*?)(?:(\\1)|$)'
  captures:
    0: name: 'support.function.vebatim.latex'
    1: name: 'punctuation.latex'
    2: name: 'markup.raw.verbatim.latex'
    3: name: 'punctuation.latex'
}
Some rules I've observed when experimenting with backreferencing:

Attempting to use \\n in a scope name results in the error invalid backref number/name thrown by first-mate. Only the $n can be used here.
The opposite is also impossible; $ is an active character in regex, so $n will never match with the single line matching we are restricted to. Only \\n can be used here.
If the $n capture group does not exist, it becomes a normal scope name. E.g., the scope would become literally support.function.$50.example, and not a reference to the 50th capture group. An empty match still counts as a match, and if this happens the scope would become support.function..example (note the double .).
name and contentName can only use the capture groups in the begin regex. Attempting to use higher numbers does not result in overflowing to the end capture groups.
beginCaptures and endCaptures will only use the capture groups in begin and end regular expressions respectively. There is no way to use a value captured in a begin group as a scope name in an end scope, and vice versa.
\\n only refers to capture groups in the begin regex. It can be used in the end regex, but will not refer to the end capture groups. Nor will it overflow to start meaning end capture groups if the number is higher than that of the number of capture groups in the begin regex.
match behaves as if it were a begin key, for the purposes of this numbering.
\\n only works for up to the number of capture groups there are. If there are less than nine capture groups, and \\9 is used, an error will be thrown. For numbers higher than 9, no errors are emitted, but that rule will not work.
oniguruma (the regex engine Atom uses) provides alternative syntax for \\n matches: \\k<n>, where n is any integer (e.g., use \\k<2> for the second capture group). For more on the syntax, see the oniguruma docs. This verbose syntax doesn't work in scope names either.
\\0 refers to the entire begin match.
Scopes probably shouldn't have punctuation in the sections, so make sure you don't just put arbitrary text in. E.g., use ([a-zA-Z\\d]*) as opposed to (.*). I ran into some bizarre errors when I had punctuation in the scope names, but they are difficult to reproduce (and I've forgotten the original cause).

If you have anything to add to this list, please leave a comment. I want to make this an exhaustive list of whats possible and impossible with backreferencing.
Applying patterns to begin, end, and match captures

Yes, it's possible. Try the following.
{
  contentName: 'keyword.example'
  begin: '\\-\\s*(.*?)\\s*\\-'
  beginCaptures:
    1:
      name: 'markup.heading.example'
      patterns: [{
          name: 'constant.character.example'
          match: 'b(.*?)b'
          captures:
            1:
              name: 'markup.italic.example'
              patterns: [{
                match: 'c'
                name: 'support.function.example'
              }]
        }]
  end: '-'
}
Try it with this! Check the scopes too with Editor: Log Cursor Scope
1. - a b c d b c - hello -
2. -   b c d b c - hello -
3. - a   c d b c - hello -
4. - a b   d b c - hello -
5. - a b c   b c - hello -
6. - a b c d   c - hello -
7. - a b c d b   - hello -

Notice that the main rule will always match if the begin regex works. It is also immune to the patterns applied to the capture groups; any matches in this way will be isolated to the captured group and will not leak into the rest of the main rule. Not even if another begin pattern is used in the capture group (which is allowed).
For this particular rule, the first capture group is given the scope markup.heading.example. This is done with name, which is how the capture group has always been scoped in previous examples. What's new is the patterns array that is also in the capture group object. It's entry, a single rule (multiple are allowed; it's just like any other patterns array), attempts to find a pair of b characters. If it succeeds, it will then attempt to match a c character between the b's.
Applying this to the example text, we get:

A complete match:

a b c d b c is scoped as a heading

b c d b satisfies the 1st capture group pattern, and is further scoped as constant.character.example

The first c satisfies the 1st capture group pattern of this sub pattern, and is further scoped as support.function.example. The second c is not between the b's, so it is ignored.


Another complete match. The initial a was never a required part.
Only matches the initial begin pattern, as the match 'b(.*?)b' is not satisfied. Because of this, the c pattern is never even looked at.
The b pattern is matched, so the c pattern is looked for, but there are no c's within the b's, so it is ultimately ignored.
Another complete match, similar to the first and second. The missing d was not a required part of any pattern.
Similar to the third, as the b pattern cannot be completed (there is no second b).
Similar to the second, except it's the second c that is missing and not the a.

You should experiment yourself with nesting rules to see what happens. If you observe any quirky or unexpected effects not mentioned here, please leave a comment explaining how to reproduce and I'll add an explanation to this section.
Injections

Introducing a new property: injections. This one sits at the root level of your grammar file, much like scopeName and patterns. It's value is an object, who's keys are scope selectors. The value of each key is another object, who's sole property is a patterns array. This patterns array has the same form as any other shown in this guide, and can be considered functionally the same. All other properties are ignored.
The purpose of injections are to provide patterns based on scope rather than includeing or nesting them. I found the best way to explain them was by example: consider the PHP grammar provided by language-php. It actually provides two grammar files: one for PHP syntax (php.cson), and a wrapper one for HTML syntax (html.cson). The pure PHP grammar does not get applied to any files automatically, as it lacks the fileTypes and firstLineMatch keys. Instead, the provided HTML grammar is applied to various PHP related files. This grammar sets the root scope name to text.html.php, and provides two rules: a new comment, and the entire text.html.basic grammar (provided by language-html by default).
What is special though, is the injections property it contains. Whenever the scope contains text.html.php (which is the root scope), and none of the scopes start with meta.embedded or meta.tag, it will attempt to match the given patterns, in this case being the php-tag rule in the repository. If these rules match, only then will the pure PHP grammar be inserted via an include statement.
Some technical notes:

Only the active grammar's injections are applied in a file. Injections in other grammars are not considered.
Matches to injected patterns will be looked for last, after the active grammar and any injectionSelector added grammars. This can be influenced with scope prefixes though.
IMPORTANT: The injectionSelector causes a bug where any grammar with one will not automatically apply itself to an opened file. For a workaround, use the following style in an independent CSON file:

injectionSelector: 'source.embedded.latex' # when this scope is present in another grammar, inject this grammar
scopeName: 'source.embedded.latex'
patterns: [{ include: 'text.tex.latex' }]

I will follow up on this after some testing, but I believe injections in a grammar that has been inserted via an injectionSelector should work.


Scope selectors

To be added

Prefixes

To be added. For now, see my forum question.
Analysing the first-mate code

This section looks at how the cson file is converted into a grammar by first-mate, and how the grammar is used to apply scopes to each character in the file. It will however focus more on what you as a package author can do, rather than the exact steps the first-mate package takes to apply scopes to everything.
To see the source code in it's final form, you need to extract .atom/.apm/first-mate/<version>/package.tgz. This has been transpiled from CoffeeScript to JavaScript, and built using Grunt, so the files and directory structure will be different to the online source code. When writing this, I found it easiest to read the CoffeeScript version, while keeping in mind that the actual paths are those in the transpiled version.

Note that JavaScript doesn't have classes, but I'll call them that because the objects are all defined using the class syntax.

Abridged version

A short summary of every recognised property of every construct you as a package author has access to.
In the root level of the grammar file:

name
fileTypes
scopeName
foldingStopMarker (not used)
maxTokensPerLine
maxLineLength
limitLineLength
injections
injectionSelector
patterns
repository
firstLineMatch

In a normal patterns array object:

name
contentName
match
begin
end
patterns
captures
beginCaptures
endCaptures
applyEndPatternsLast
include
popRule
hasBackReferences
disabled

In depth

First of all, require("first-mate") provides three classes:

ScopeSelector
GrammarRegistry
Grammar

Their behaviour are given in the following sections, as well as that of the other classes they depend on.
The ScopeSelector class

I don't know much about this one. It appears to be responsible for the scopes, and the specs show it supports some interesting syntax. Much of it is generated from a PEG.js file though, so it will be difficult to understand without additional knowledge.
If someone can explain prefixes and injections, that would be appreciated.
The GrammarRegistry class

Atom automatically creates and populates a GrammarRegistry instance. This class is available as atom.grammars, so definitely look at it in dev tools. For a specific grammar (next section) use atom.grammars.getGrammarForScopeName("<scope_name>").
It's job is to hold a group of grammars together, and provide helper functions. The steps taken to add grammars from packages are roughly as follows:


At some time or another, the method loadGrammars of the Package class is run for each package. This looks inside the package for a grammar folder, and if present it will attempt to find any .cson or .json files in there.


The method readGrammar uses the season package to read in the .cson and .json files in the grammar directory. Note that this means you could write a grammar in JSON format. It throws an error if the object is invalid, and then checks that the scopeName property is a non-empty string (throwing an error if it isn't).


It calls the method createGrammar, with the arguments of the file path and the newly formed object. This method sets the maxTokensPerLine and maxLineLength properties of the object (if not already present; if they are, the existing values are used instead). Additionally, it will then check if the object has the limitLineLength property. If false, it will set maxLineLength to Infinity, regardless of earlier steps.


It makes a new Grammar with the arguments of the global registry itself (this) and the (slightly modified) object. The section on grammar object creation is below.


When the grammar object returns, the method createGrammar continues and sets the property path to the original file path, returning the grammar object.


When this is returned, the loadGrammars method of the Package class continues. It sets the property packageName of the grammar object to the package's name, and the property bundledPackage to the packages bundledPackage property value. Finally, it pushes the grammar object to the array of grammars provided by that package. It also runs the function grammar.activate(), which pushes the grammar to the global registry.


The Grammar class

This is where a single set of rules for a given language is defined. In the walk through above, it is called with the global grammar registry (which was passed in via the Package class) and a slightly modified version of the CSON file.
Immediately, the GrammarRegistry it is called with is added as the property registry to the grammar object. Additionally, some select properties are looked at in the CSON file object. The following are directly added as properties to the grammar object:


name: explained in beginner guide. Used as a human friendly label in the language selection window.


fileTypes: explained in beginner guide. Array of file extensions used to score grammars against a given file. If not provided, and empty array will be automatically created.


scopeName: explained in beginner guide. The root scope applied to all characters, regardless of pattern matches.


foldingStopMarker: will be covered in intermediate tips. Seems more or less useless for now.


maxTokensPerLine: the maximum number of rules that will be applied per line. Potentially added by the createGrammar method in the above walk through if not already set.


maxLineLength: the maximum number of characters that will be tokenized per line. When internally set to Infinity, it will have no limit. Potentially added or modified by the createGrammar method in the above walk through. Setting to Infinity directly in the grammar file results in an error and the grammar will not load.


The following are also recognised properties of the file object, but are processed somewhat before being added to the grammar object:


injections: the grammar property injections is set to the result of a new Injections called with the grammar object and the injections property of the file object. The Injections class is addressed in another section. Basically, it is set of scopes, and rules that are applied when the scope is reached. They only work when the grammar is the active one. For when it's not active, injectionSelector must be used. For more on injections, see advanced tips.


injectionSelector: scopes to insert the grammar into when they occur in another grammar. For example, language-hyperlink uses 'text - string.regexp, string - string.regexp, comment, source.gfm' (the - sign means the following scope is not allowed; so it injects in text, but not when also in string.regexp). The grammar property is defined as the result of calling new ScopeSelector on the file object's injectionSelector property. If this is not defined, a value of null is used.


patterns: the grammar object's rawPatterns property is directly set the the patterns property of the file object.


repository: the grammar object's rawRepository property is directly set the the repository property of the file object.


firstLineMatch: if defined, it is used to create a new (oniguruma) regex object, which is then made the value of the grammar object's firstLineRegex property.


Finally, there are some properties of the grammar object that are created directly in the construction of the new grammar:


emitter: an emitter is a JavaScript object that can be used to time function execution based on events emitted by the emitter.


repository: initialised to null, but the method getRepository (not called during construction) sets it to a rule set using information in the rawRepository property, which in turn was set by the file object's repository.


initialRule: initialised to null, but the method getInitialRule (called during tokenization) sets it to @createRule({@scopeName, patterns: @rawPatterns}) if it doesn't already have a value (that's not null). It is the first set of rules that will be checked.


includedGrammarScopes: initialised to an empty array. When a separate grammar is included by this grammar, it's scope name is added to this array.


Also bear in mind that some properties were added by the GrammarRegistry and Package classes when formed that way.
When the grammar object (from here: grammar) is constructed, it is not quite ready. At some point, the method tokenizeLines is called on the text of a file. This splits the text by line (\n character) and passes each one to the tokenizeLine method, each time passing in the ruleStack variable from the previous line.
In tokenizeLine, it does the following:


First, it checks for a long line (as determined by the maxLineLength property), cutting it down if needed.


It then converts the line into an OnigString, which is written in C and presumably makes the following steps quicker.


Now it checks if the ruleStack is not null. When called by tokenizeLines, the first call will have this null value.


If not null: the ruleStack is copied (shallow) and the scopeName and contentScopeName properties of each object it contains are pushed to another array, if they exist.


If null: it executes the method getInitialRule, which sets the grammar's initialRule property to a new Rule, called with the grammar object (this), it's GrammarRegistry, and the options object {@scopeName, patterns: @rawPatterns}.


It pulls the scopeName and contentScopeName properties from initialRule, and sets these values as the first object in a ruleStack array. The rule stack keeps track of all currently active rules, and will only try to match the latest rule to be added to the stack.


Further behaviour can be determined by reading the source code directly.
The Injections class

This class handles the patterns that are applied based on scope, when this grammar is the active one.
Properties:


grammar: the grammar class it was called with.


injections: an array of objects, with the properties selector and patterns. selector is a ScopeSelector class instance, formed from the the key of each property in the grammar file's injections value. patterns is the same as ever, with at least the top level include statements resolved. I'll need to test if nested include statements are also found properly.


scanners: an object that is presumably to hold instances of the Scanner class.


The Rule class

A Rule is an object with some metadata properties and an array of Patterns. The Rule itself does not match any text. This is for the Patterns, and the special endPattern.
The Pattern class

Perhaps the most relevant class to a package author, besides the grammar class. This is the one that holds the regex and other properties you add.
A hidden property: disabled is checked when creating a new Rule
Immediately added properties:


grammar: the grammar class it was called with


registry: the grammar registry class it was called with


include: a reference to a rule stored in the grammar repository


popRule: whether a match with this pattern should remove the rule it is a part of from the rule stack.


hasBackReferences: overrides automatic detection; if null, the class will instead check for back references using /\\\d+/ on the match.


Grammar file properties that are processed:


name: this is used to make @scopeName, which is then passed to @grammar.createRule()


contentName: same as name, but the internal property is contentScopeName


match: if end is a property or popRule is true, and it has backreferences (either set explicity or detected by a quick regex), the grammar will set this value to the property @match. If not, it is set to the property @regexSource


begin: (this is only looked at if match doesn't exist). @regexSource is set to the begin value.