Skip to content

Instantly share code, notes, and snippets.

@laughinghan
Last active December 7, 2023 21:52
Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save laughinghan/4350e4438e6cfc951826 to your computer and use it in GitHub Desktop.
Save laughinghan/4350e4438e6cfc951826 to your computer and use it in GitHub Desktop.
MathSON - JSON for Math Formulae

MathSON

Status: Draft 1 In Progress. This document is undergoing its first revision. Initial implementation has begun alongside editing Draft 1. Your feedback is hoped and dreamed of.

Mathematical Structured Object Notation is a JSON-based representation for most of the common subset of what LaTeX and Presentation MathML can represent, but distilled down to the essential content and structure. It can also represent diffs between two formulas in a similar fashion to ottypes/text and ottypes/rich-text, and the associated cursor positions are a better fit for editing math than the DOM Ranges associated with MathML.

Just show me some code already

Okay.

// the quadratic formula
> MathSON.fromLatex('x = \\frac{ -b \\pm \\sqrt{ b^2 - 4ac } }{ 2a }').ops
['x=',
 { numer: ['-b±', { sqrt: ['b', { sup: ['2'] }, '-4ac'] }],
   denom: ['2a'] }]

// diff: simplifying the derivative of square root
// \frac{1}{2} x^{ - \frac{1}{2} }
> var a = MathSON([{numer: ['1'], denom: ['2']}, 'x', {sup: ['-', {numer: ['1'], denom: ['2']}]}]);
// \frac{ 1 }{ 2 \sqrt{x} }
> var b = MathSON([{numer: ['1'], denom: ['2', {sqrt: ['x']}]}]);

> a.diff(b).ops
[{ _denom: [1, { sqrt: ['x'] }] },
 { delete_: ['x', { sup: ['-', { numer: ['1'], denom: ['2'] }] }] }]

When editing the quadratic formula, '2.numer.3.sqrt.5' represents a cursor in this position:
quadratic_formula

Rationale

Currently, the viable Web formats for math formulae are:

  • AMS-LaTeX subset (as rendered by MathJax and KaTeX) is a compromise between human- and machine-readability, plus legacy/compat concerns.
  • MathML is a compromise between people who like XML and people who are reasonable...who am I kidding, MathML is even further unnecessarily complex than just due to XML (see "What do you have against MathML?")
  • obscure formats like AsciiMath that optimize even more for human-readability than TeX & friends, at the expense of machine-readability and simplicity

MathSON is intended to fill the niche of being easily parseable (unlike TeX) into a tree structure that's easy to understand and use (unlike MathML).

My primary motivation is MathQuill, my formula editor whose API uses an AMS-LaTeX subset to represent math. It's impractical to use this API to implement stuff like what typing a slash / does in MathQuill, which is to scan backwards until a + or similar and move the group into the numerator of a fraction (so typing 1+1/x yields 1+\frac{1}{x}). You'd need to parse the LaTeX into an AST, scan & modify the AST, then serialize the AST back to LaTeX. Which sucks, because MathQuill already has a perfectly good (well, not perfectly, but still) internal AST that it parses the LaTeX to, so that'd be so much wasted, duplicated parsing & serialization. (More on why MathQuill needs MathSON)

Completely unintentionally, this format turns out to be surprisingly useful for for accessibility: for most math its tree structure is isomorphic to the corresponding MathSpeak, the speech protocol used by mathematician Abe Nemeth (inventor of the widely-used Nemeth Braille Code for Math). Notably, whereas MathML has lots of extraneous information that'd be ignored when converting to MathSpeak (like <mo> vs <mi> vs <mn>), that's all implicit in MathSON just like in MathSpeak; in other words, virtually anything explicit in MathSON is also explicit in MathSpeak. Even the cursor positions represent closely what a screen reader would read when navigating an editing interface for MathSpeak.

(It wasn't until we started work on just such an accessible math editing interface that I noticed this.)

This suggests that MathSON does in fact perfectly extract the essential content and structure of any given math.

Details

Proposed examples of the subset that is just math, not math diffs:

> MathSON.fromLatex('x = \\frac{ -b \\pm \\sqrt{ b^2 - 4ac } }{ 2a }')
['x=',
 { numer: ['-b±', { sqrt: ['b', { sup: ['2'] }, '-4ac'] }],
   denom: ['2a'] }]
> MathSON.fromLatex('\\frac{ \\sin x }{ x }')
[{ numer: [{ inline_op: ['sin'] }, 'x'], denom: ['x'] }]
> MahtSON.fromLatex('\\sin\\left( \\frac{1}{x} \\right)')
[{inline_op: ['sin']}, {$left: '(', group: [{numer: ['1'], denom: ['x']}], $right: ')'}]

// KaTeX's homepage example:
> MathSON.fromLatex('f(x) = \\int_{-\\infty}^\\infty \\hat f(\\xi) e^{2 \\pi i \\xi x} d\\xi')
[
    "f(x)=∫",
    { "sub": ["-∞"], "sup": ["∞"] },
    { "hat": ["f"] },
    "(ξ)e",
    { "sup": ["2πiξx"] },
    "dξ"
]

Notes:

  • The top-level MathSON object is always an array. Arrays represent snippets of math known as "blocks". Arrays contain strings which represent math symbols, and objects which represent "commands" i.e. complex math notation like fractions and paren groups.
  • Command objects' keys will usually be letters-only (not even _ allowed), in which case the value must be a math block (an array of strings and objects, as described above; in the future we may allow arrays of arrays to support LaTeX's cases and matrix).
  • Command objects' keys can also be any string starting with a dollar sign $, these can have any JSON value (these are "attributes" rather than "content", basically).
  • What do the special inline_* keys do? They're for blocks of math that don't have a boundary or border that the cursor has to cross. At the edge of a normal block of math, like a square root or paren group, the cursor can cross between inside and outside; but at the left edge of sin, there's no inside or outside, the s, i, and n are "inline" in the containing block. At most one is ever allowed, and if present, all other keys must start with $ (i.e. be "attribute" keys, so that cursor position can make sense, see below).
  • This just uses Unicode rather than TeX and friends' backslash names for fancy math symbols, which strikes me as both simpler (it's just text!) and better specced (Unicode is a mess but at least has a standards body, nobody likes hunting through Plain TeX, LaTeX, AMS-LaTeX, nonstandard MathJax commands and more for the right backslash name for every symbol).
    • We do still have to restrict to a subset of Unicode and ban stuff like Unicode subscripts and superscripts.
    • For ASCII-only environments, built-into JSON is a Unicode escape sequence (e.g. \u2264 for ).
  • Uniqueness: any given MathSON value is represented by exactly one JSON value. The serialization of a JSON value isn't unique, of course (e.g. whitespace insensitivity), but this means that deep comparison of JSON values tells you all you need to know about MathSON. For MathML, what if two trees are equivalent except for an id attribute? What about different lspace or minsize/maxsize attributes? Who knows?

Now, extend that to a math diff:

// \frac{1}{2} x^{ - \frac{1}{2} }
var a = MathSON([{numer: ['1'], denom: ['2']}, 'x', {sup: ['-', {numer: ['1'], denom: ['2']}]}]);
// \frac{ 1 }{ 2 \sqrt{x} }
var b = MathSON([{numer: ['1'], denom: ['2', {sqrt: ['x']}]}]);

a.diff(b) // => [{_denom: [1, {sqrt: ['x']}]}, {delete_: ['x', {sup: ['-', {numer: ['1'],
          //                                                                denom: ['2']}]}]}]

Note that just like in ottypes/text, an insert of a piece of MathSON is represented by "itself", a retain/skip is represented by a raw number (different from a numeral string), and a delete is represented by an object with a special key (but it's an invertible delete because, c'mon, diffs should be invertible). One new thing is a syntax to mutate an existing thing, using keys prefixed with _ or insert_ or delete_:

// \frac{1}{2} + \frac{1}{2} + x_1 + x_1 + x_2
var c = MathSON([{numer: ['1'], denom: ['2']}, '+', {numer: ['1'], denom: ['2']}, '+x', {sub: ['1']},
                 '+x', {sub: ['1']}, '+x', {sub: ['2']}]);

// \frac{x}{y}\frac{1}{2} + \frac{x1}{y2} + x_1^2 + x^2{}_1 + x^2
//   (the second-to-last one can be typed in MathQuill by typing x^2 y_1 and backspacing the y)
var d = MathSON([{numer: ['x'], denom: ['y']}, {numer: ['1'], denom: ['2']}, '+', {numer: ['x1'], denom: ['y2']},
                 '+x', {sub: ['1'], sup: ['2']}, '+x', {sup: ['2']}, {sub: ['1']}, '+x', {sup: ['2']}]);

c.diff(d) // => [{numer: ['x'], denom: ['y']}, 2, {_numer: ['x'], _denom: ['y']}, 2, {insert_sup: ['2']},
          //     2, {sup: ['2']}, 3, {delete_sub: ['2'], insert_sup: ['2']}]

(Alternative: ottypes/text is deliberately noninvertible, we could do the same to slightly simplify our syntax:

a.diff(b) // => [{_denom: [1, {sqrt: ['x']}]}, {_delete_: 2}]
c.diff(d) // => [{numer: ['x'], denom: ['y']}, 2, {_numer: ['x'], _denom: ['y']}, 2,
          //     {insert_sup: ['2']}, 2, {sup: ['2']}, 3, {_delete_: 'sub', insert_sup: ['2']}]

)

Finally, cursor positions:

A cursor position is just a sequence of indicies and keys (typically alternating, but cases and matrix may change that), always starting and ending with an index. For example, consider:
\frac{1}{2}x^{-\frac{1}{2|}}

[{numer: ['1'], denom: ['2']}, 'x', {sup: ['-', {numer: ['1'], denom: ['2']}]}]

To get to the cursor position, we start in the root block, go to its 2nd item (0-indexed) which is the superscript, go into its sup block, go to its 1st item which is the fraction, go into its denominator, and go to slice index 1 (slicing from index 0 would slice from before the 2). In JavaScript this could be mathObj[2].sup[1].denom[1]; for simplicity, in MathSON this is represented by the string '2.sup.1.denom.1'.

Note that these indices aren't quite array indices, since strings can span a range of indices. Consider:
ax^2+|by^2

['ax', {sup: ['2']}, '+by', {sub: ['2']}]

The cursor is in the middle of the string '+by', which is the 2nd item in the array, but in MathSON there are cursor positions between adjacent symbols, so the cursor is at index 4.

There is one special case, inline_* blocks. Whereas normal commands only count for one index increment, inline_* blocks are like strings, they can span a range of indices. Consider:
\frac{\sin x + \co|s x}{x}

[{numer: [{inline_op: ['sin']}, 'x+', {inline_op: ['cos']}, 'x'],
  denom: ['x']}]

In this case, the cursor position is '0.numer.7', there isn't a step in the cursor position where we go "into" the cos. This makes sense if you consider what happens between the cos and the x: there is no going "into" or coming "out of" the cos, from the cursor's perspective the c, o, and s are at the same "level" as the x.

This is also why inline_* commands may only have "attributes" but no "content" child blocks. If the cos block had a child: ['y'], what would be the cursor position of a cursor next to the y? The c is index 5, the o is index 6, the s is index 7, but what index is the command with a .child?

Did you notice the extensibility?

Nothing about diffs or cursor positions is math-specific. We could use this for rich text:

['This sentence has both ', {inline_text: ['bold'], $bold: true}, ' and ',
 {inline_text: ['italic'], $italic: true}, ' words.']

and the diff and cursor position definitions (and hopefully, implementations) would work equally well.

I'm not sure what to call that—MathSON Level 0, Base MathSON, Core MathSON, DocSON, EditSON, EdSON—but I think it's very important. MathQuill's edit tree and associated cursor and selection model was originally designed by Jeanine and then haphazardly evolved by me basically just to generalize the cases we thought of at the time (fractions, square roots, and paren groups I guess—we also had supsubs but the tree model already didn't generalize well to them—hence the double-layered tree where blocks have a variable number of commands, each with some fixed number of blocks).

We didn't, couldn't, and can't think about all the other math notation supported by TeX and friends that we want to eventually support. There will be continuing work to add commands to MathQuill, and that needs to be possible without having to change the underlying tree and cursor model that everything else relies on. In fact, a safe tree and cursor model opens up entirely new API possibilities, since the lack of safety is a key reason that MathQuill's tree and cursor are super hidden away from the API ([For MathQuill, this is more than just a notation.] (#for-mathquill-this-is-more-than-just-a-notation)).

By the by, this is why inline_* needs to be a block (array) and not just a string, even though the only immediate use-case is operator names like sin whose contents are only ever strings. MathSON Level 0 shouldn't know about only being strings, it only knows about cursor position semantics. And, I can totally imagine use cases that aren't just strings, like exotic sup/sub, or like if in the rich text example above, a bold region of text had some math in it:

{$bold: true, inline_text: ['7⋅10', {sup: ['2']}, ' weight bold']}

Separate from MathSON Level 0, there of course needs to be a spec for MathSON Level 1 listing the kinds of commands accepted, {numer, denom} for fractions, {$left, group, $right} for paren/bracket/brace groups, {sup, sub}, etc.

"Isn't this just the 'Extensible Markup Language' + MathML with a different syntax?"

First of all, syntax matters. Syntax is a UI, and shapes every interaction that people have with something.

Secondly, syntax isn't even that big a part of MathSON Level 0, as described. I've talked enough about semantics that it's more like XML + DOM (including DOM Ranges, kinda analogous to cursor positions).

Thirdly, by relegating e.g. Unicode escaping and most well-formedness concerns (matching braces etc) to the "lower level" JSON spec, JSON + MathSON Level 0 are better organized than XML which deals with all of that in one monolithic spec.

Finally, you know what's crazy? Even with all that, JSON + MathSON Level 0 combined is still simpler than XML alone. There are no intricate whitespace semantics, no Text vs CharacterData vs Comments vs Processing Instructions, no custom character entity references, no self-closing tags. Hell, the only consideration we have to make that XML doesn't (that isn't because XML is missing a feature we need), as far as I can think of, is that JSON inherited JS's UTF-16 surrogate pairs for "astral plane" Unicode characters, and I dunno how our indicies should treat those.

(See also "Wait so, what do you have against XML?")

Open Questions

  • should commands have a type? Seems unnecessary to me
  • full words (numerator, subscript) or abbreviations (numer, sub)?
  • should the format be even more minimal? Currently arrays are required in more places than are strictly necessary for Level 0 to be unambiguous, for example one-half (1/2) is [{numer: ['1'], denom: ['2']}] when it could be {numer: '1', denom: '2'} instead. I prefer arrays because I think it makes it clearer why the cursor position rightward of the 2 is 0.denom.1, for instance, whereas without the outer array making the root block explicit it seems like it should just be denom.1 or something
  • should there be a "noncanonical" variant where prohibited Unicode characters like subscript and superscript characters are allowed, and a canonicalization that'll convert them into "proper" {sub: ...} objects? (Folding them into nearby ones as necessary)
    • what about the goddamned Mathematical Alphanumeric Symbols? Bold, italic, serif/sans-serif/monospace clearly need to be canonicalized as a font style thing, but what about calligraphic, fraktur, and double-struck? Do we have to use a different font? MathQuill doesn't; then again, MathQuill's font, Symbola, doesn't support that full range, only the subset that's actually in the Letterlike Symbols block.
    • the "noncanonical" variant could also feature unmerged consecutive strings (i.e. canonicalize(['ab', 'cd']) => ['abcd'])

Asides

For MathQuill, this is more than just a notation.

This is a way of life.

Really though, I'm so excited about this as an API to manipulate MathQuill's tree structure, even by internal code. MathQuill's internal tree manipulation API is so prone to becoming ill-formed if you sneeze at it that there are 750 lines of 89 tests for paren typing behavior, to make sure that the tree and cursor doesn't become ill-formed in the course of the manipulation in all the different cases. There are intrinsically a lot of cases, don't get me wrong... but for any given case, there's 4 or 8 tests checking the same paren typing behavior in similar tree shapes. That's not cool.

One major source of bugs in particular has been that the cursor position is represented by pointers to nodes in the tree, and that can easily become ill-formed due to simple modifications to the tree. (#429 is an example of this class of bugs that was fixed not that long ago.) This is actually kind of a blocker for exposing the tree and cursor to manipulation by external API calls: how do we ensure well-formedness without the API feeling like moving piles of rice around with tweezers (like if all you had was cursor.moveLeft() and cursor.moveRight() or something)? Well, how come flat text fields don't have this problem? The answer is in the data model.

In flat text fields, a cursor position is an index, so even if it is ill-formed (i.e. out of bounds), the right way to normalize it is obvious, just clamp it to the nearest bound. By contrast, in MathQuill's current representation where the cursor position is pointers to tree nodes, if the cursor's parent is a detached node, there's no obvious way to normalize that into where the cursor "should" be. However, if the cursor position is a path through the ancestors like proposed here, normalization is obvious, put the cursor in the deepest ancestor that still exists.

And externally, of course, the LaTeX imported and exported by MathQuill isn't meant to be human-edited (the point of MathQuill is to edit math visually, which is more human-readable than a text format could ever be), so LaTeX compromising machine-readability for human-readability doesn't really serve MathQuill well. MathQuill needs a format where the overriding concern is being dead simple for machines to read, possibly at the expense of human-readibility.

"Why don't super/subscripts have a base?"

"...like they do in both MathML and KaTeX's AST?"

Because that lets you do stuff like {\frac{ \frac{1}{2} }{3} + 1 + 2}^2:
{\frac{ \frac{1}{2} }{3} + 1 + 2}^2 rendered with katex

What the hell is that? How do you edit that? How do you show whether the cursor is inside or outside the base of the super/subscript? There's no analogue when writing math on a whiteboard.

Note that something like it is still possible with e.g.:

{inline_base: [{numer: ['1'], denom: ['2']}, '+1', {sup: ['2']}]}

and a special relationship between the containing thingy and the sup node.

"What do you have against MathML?"

Okay so, (Presentation) MathML is supposed to, more or less, represent the same data (structure and content) as the relevant AMS-LaTeX subset, but more machine-readable and amenable to the horrifying existing ecosystem of XML tools, right? What are the other reasons people think everything should be in XML? Tim B-L talked about "the fruits of well-formed systems" but like, TeX and friends don't suffer from the rapidly evolving incompatibilities that HTML had, nor the ill-formedness problems inherent to SGML descendants like <b><i>LOL</b></i>, it's not like influential TeX tools are forgiving of unmatched braces and handling them in undocumented, ill-understood ways.

Okay so great, MathML lets you leverage existing XML tools for parsing and stuff, maximizing your synergy for win-win solutions, etc. Which is great if you necessarily need a format that makes parsing and stuff hard. But wouldn't it be even better if you could use a format so trivially simple that parsing and stuff is easier to do by hand than it would be to configure and use giant heavyweight XML parsing tools?

Even beyond the whole XML thing, MathML is unnecessarily complex, encompassing aspects of semantics or presentation that fundamentally are neither structure nor content. Especially having to specify <mo> vs <mi> vs <mn>, whereas in LaTeX that's implicit in the normal case, yet no one worries that LaTeX isn't expressive enough compared to MathML.

This is even more apparent contrasting with MathSpeak. <mo> vs <mi> vs <mn>? Not even representable, should belong to Content MathML or OpenMath. Attributes like form or lspace or stretchy? Ignored, belongs solidly in the domain of visual display styling. <mrow>? Is that meaningful to anyone? LaTeX lets you put braces {} anywhere, which leads to shitty situations with super/subscripts that aren't representable in MathSpeak nor MathSON.

"Wait so, what do you have against XML?"

Well, I could attempt thoughtful, balanced reasoning of why XML's tradeoffs are a poor fit for math, but if I can't do better than this HN commenter, is it really worth it? Instead I shall present a more visceral argument.

"Is parsing XML really that hard and heavyweight?" Look, parsing XML isn't hard like parsing HTML is hard, but just look at this JSON:

[
  {
    numer: ['1'],
    denom: ['2']
  },
  'x',
  {
    sup: [
      '-',
      {
        numer: ['1'],
        denom: ['2']
      }
    ]
  }
]

In MathML, that'd be, what:

<math>
  <mfrac>
    <mn>1</mn>
    <mn>2</mn>
  </mfrac>
  <msup>
    <mi>x</mi>
    <mrow>
      <mo>-</mo>
      <mfrac>
        <mn>1</mn>
        <mn>2</mn>
      </mfrac>
    </mrow>
  </msup>
</math>

Don't worry if that's not valid MathML, this is about XML. My point is, here's how to get to the 2 in the exponent in MathSON:

mathObj[2].sup[1].denom

(That's JS but it'd be similarly straightforward in Python or Ruby or whatever.) By comparison, in MathML:

mathTree.children[1].children[1].children[1].children[1]

That gets you the <mn>, by the way, not the Text node containing the string '2'. There's a difference. Now, which would you rather deal with? These generic tree node things, or plain old dictionaries and arrays?

Credits

This gimmick:

Just show me some code already

Okay.

was blatantly stolen from @jneen's literary masterpiece.

this.MathSON = (function () {
var unicodeToLatex = {
'±': 'pm'
};
var keyToCommand = {
numer: Fraction,
denom: Fraction,
sqrt: SquareRoot,
sup: SupSub,
sub: SupSub
};
function Fraction(cmd) { this.cmd = cmd; }
Fraction.prototype.toLatex = function () {
return [
'\\frac{',
this.cmd.numer.toLatex(),
'}{',
this.cmd.denom.toLatex(),
'}'
].join('');
};
function SquareRoot(cmd) { this.cmd = cmd; }
SquareRoot.prototype.toLatex = function () {
return [
'\\sqrt{',
this.cmd.sqrt.toLatex(),
'}'
].join('');
};
function SupSub(cmd) { this.cmd = cmd; }
SupSub.prototype.toLatex = function () {
return [
'^{',
this.cmd.sup.toLatex(),
'}'
].join('');
};
function MathSON(ops) {
if (!Array.isArray(ops)) throw 'Need Array, got ' + JSON.stringify(ops);
if (!(this instanceof MathSON)) return new MathSON(ops);
this.ops = ops;
}
MathSON.prototype.toLatex = function () {
return this.ops.map(function (op, i, ops) {
// for strings, translate Unicode chars to LaTeX, like ± to \pm
if (typeof op === 'string') {
return op.split('').map(function (ch) {
if (ch in unicodeToLatex) {
return '\\' + unicodeToLatex[ch] + ' ';
}
return ch;
}).join('');
}
// for objects,
if (typeof op === 'object' && op !== null) {
var Command = keyToCommand[Object.keys(op)[0]];
if (Command) {
var cmd = {};
Object.keys(op).forEach(function (key) {
if (/^[a-z]+$/i.test(key)) cmd[key] = MathSON(op[key]);
else if (key.charAt(0) === '$') cmd[key] = op[key];
else throw 'Unexpected key \'' + key + "'";
});
return new Command(cmd).toLatex();
}
}
throw 'Unexpected ' + JSON.stringify(op);
}).join('')
.replace(/ (?![a-z])/ig, '');
};
MathSON.toLatex = function (ops) {
return MathSON(ops).toLatex();
};
return MathSON;
}());
// tests
console.assert(MathSON([{numer: ['1'], denom: ['2']}]).toLatex() === '\\frac{1}{2}');
console.assert(MathSON.toLatex(['x=',
{ numer: ['-b±', { sqrt: ['b', { sup: ['2'] }, '-4ac'] }],
denom: ['2a'] }]) === "x=\\frac{-b\\pm\\sqrt{b^{2}-4ac}}{2a}");
@d3an
Copy link

d3an commented Jan 31, 2021

Not sure if this factors into your research, but MathJSON seems to be a decent standard for Math represented in JSON. I haven't used it, but there's also a corresponding library called MathLive for rendering Math in web components. Examples on their website: https://mathlive.io/examples/. Again, not sure if this relates to your research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment