Skip to content

Instantly share code, notes, and snippets.

@msteen
Created April 11, 2020 11:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save msteen/d437c76f04ce5f2c1a52282d594d7d7a to your computer and use it in GitHub Desktop.
Save msteen/d437c76f04ce5f2c1a52282d594d7d7a to your computer and use it in GitHub Desktop.

Introduction

This document describes Tagdown, a lightweight markup language (LML) lml designed to be used for rich-text editing within knowledge based systems. The first application of the language will be a personal knowledge base (PKB) pkb, but it is also suited for information systems in general, for example a content management system (CMS).

The reason to choose to design a LML is to make the language agnostic to a particular toolset. For example, although a rich-text editor will be made for the language, by making it a LML, it will also be readable and writable in any generic text editor without the need for any special support. This makes the language more future proof and more versatile.

LMLs are markup languages that are also intended to be easy to read and write in its raw form, so contrast to regular markup languages, such as HTML, where the focus is on the final output being easy to read, LMLs should also be easy to work with in their raw form. Examples of LMLs are AsciiDoc, Markdown, MediaWiki, and Org-mode.

In contrast to (most) other LMLs, Tagdown is designed to be as minimal as possible, as long as this does not conflict too much with ease of use. Ease of use is after all one of the key features of being a LML, so losing that would make it lose its qualification as a LML. We choose minimalism as a main driver of our design because it in turn brings about the kind of properties we would like to see in language design. It also makes it so that everything in the language needs to have a clear purpose and needs convincing enough arguments as to why it needs to be included into the language.

Besides being minimal, Tagdown is designed to be consistent, for example by using consistent syntax and preferring generic concepts that can be reused in multiple scenarios over concepts that are specialized for their specific scenario.

To facilitate the more advanced needs of knowledge base system, where text is not sufficient to encode the intended knowledge or would limits its use, Tagdown will need to be able to add semantical structure to text. Markdown md does this by introducing custom markers for each concept of the language, such as surrounding a word with * to give it a bold text style and prefixing headings with multiple # depending on the level. Although this approach might work well for Markdown's intended use, as a LML for rich text, it does not fit well with being able to add more complex semantical structure to text, i.e. to embed knowledge in text, as it is designed to only support a fixed set of semantical structures that are often needed within rich text documents, while knowledge bases would need to support any number of semantical structures that are not known ahead of time, i.e. generic semantical structures need to be supported. Those structures would for example allow the knowledge bases to more easily reuse the knowledge within. Tagdown therefore uses tags to enrich text with support for generic semantical structures. Where Markdown uses custom markers to format text, Tagdown will be using tags, hence the name Tagdown.

The kind of design that Markdown follows is one that is almost always never complete. There will likely be new rich text needs that require new custom markers to be added to the language in order to support them, as showcased by the many Markdown extentions that exist. Of course when aiming for a complete language, new insights might still lead to changes, but they will be mostly improvements and it will be unlikely to see new additions to the language, unless there is a very good reason for it (at least when following minimal design).

Keeping in line with minimalism and being easy to use, Tagdown is thus designed to be a mixed-content mc language (such as HTML) of text and tags, where the syntax for tags is optimized to be easy to write and maintain even in its raw text format.

The syntactical choices of the Tagdown should not be reflected in the internal representation, i.e. it is syntax agnostic. This leaves room for future improvements, because the old syntax should not remain supported. A consequence of this choice is that Tagdown is not lossless with regard to its syntax (i.e. input != print(parse(input))), only lossless in its internal representation (i.e. parse(input) == parse(print(parse(input))) or repr == parse(print(repr))).

In Tagdown everything written is considered intentional, including the failure to parse a tag, in that case the writer apparently meant to write that particular bit as text. For this reason the parser will always succeed without reporting any errors. This also means that whether something is an attribute should not be context dependent, it either is or isn't an attribute.

Language concepts

As described in the introduction, the language will consist of mixed-content with text and tags, where the tags need to be able to represent semantical structures. Given that, we make the following points to further define the concepts found within the language:

  • Tags can be used to mark content, thus tags can have content.
  • The semantical structures can be nested, tags represent the semantical structures, thus tags can contain other tags. These tags act like attributes of the tag that contains them, so we call them as such. Note that attributes are themselves thus also tags.
  • Tags can have attributes which cannot be included as content, because it might incorrectly be interpreted as content. For example, a code tag takes its content literally.
  • In the case there the tag content is not taken literally, we still might want to write an attribute within the content, e.g. when you want to describe something about the attribute.

We thus end up with the following concepts:

  • text
  • tag
  • content
  • literal content
  • attribute
  • content attribute

Oxford dictionary

Looking at the dictionary definition of a word for a concept helps to confirm it is the right word for it its use. These are the word definitions used as concepts:

  • text: a book or other written or printed work, regarded in terms of its content rather than its physical form.
  • tag: a label attached to someone or something for the purpose of identification or to give other information.
  • attribute: a quality or feature regarded as a characteristic or inherent part of someone or something.
  • content: the things that are held or included in something.
  • literal: representing the exact words of the original text.

Why not use XML instead?

On the surface, Tagdown is rather similar to XML: it consists of mixed content of text and tags, tags can have attributes, it supports literals through CDATA, etc. However there are some key differences:

  • XML is not a lightweight markup language, which is showcased by the many templating systems designed towards making it easier to write HTML in plain text.
  • Attributes in XML can only hold text, they are not tags themselves like in Tagdown. This makes XML less consistent, since some tags serve the same function as attribute do, for example the <title> HTML tag, but you are not allowed to instead define e.g. <head title="...">..

Even though XML will not be used as the language to write in, it is still very valuable as a translation target, since XML has many mature tooling to choose from. To convert the way Tagdown approaches attributes to XML, metadata is necessary to differentiate between attributes and contents. Even though XML sometimes leverages tags to represent attributes of its parent tag (an obvious example would be the <meta> tag in HTML), XML does not differentiate between these tags acting like attributes and any other tags. To prevent naming collisions from happening, the metadata will need to be prefixed with a XML namespace.

For the XML namespace the following options were considered:

  • t: Short for tag, which would make for the shorted possible namespace, but does affect readability.
  • td: Short for Tagdown, but that would result in cementing the language name, making it harder to change it later on.
  • tag: Still short enough to not be cumbersome, but it leaves no room for interpretation. It also matches the jargon of its use case, i.e. a tag attribute is marked with <tag:attr>.

For the reasons listed under each of the XML namespace options, we will be using tag as the XML namespace.

The following two translations will be made available. The reason two support them both, is that they each have distinct pros and cons, and neither is objectively better than the other, it will depend on the use case.

Both translations will use a direct translations of text and tags, other than the necessary escaping, but Tagdown attributes cannot be directly translated to XML attributes, so the part where both translations will differ between will be how to translate to attributes to XML attributes.

The first translation will translate attributes by turning them to XML tags just like tags themselves, but they will be as the very first children of the XML tag (i.e. the owner of the attributes). This would not allow to differeniate between attributes and tags found in the contents, so we will also wrap all attribute XML tags with <tag:attr>. And for the edge case where a content attribute is found as the very first thing in the contents, which would make it indistinguishable from any of the other attributes, we will add the XML attribute tag:content="true" to content attributes. The pros of this approach are that 1) it makes it easy to query any attribute, content attributes included, by searching for tag:attr and 2) hierachy of the tags match the XML tags directly, making XPaths like /foo/bar possible. The cons are that the XML tag contents are poisioned with the additional attributes XML tags, making operations that work on contents potentially more convoluted. An example:

{foo:}
{@bar:} test
  {@baz:} test

Would result in the following XML (if pretty printed):

<foo>
  <tag:attr>
    <bar>test</bar>
  </tag:attr>
  <tag:attr tag:content="true">
    <baz>test</baz>
  </tag:attr>
</foo>

The second translation makes the distinction between attributes and contents explicit by wrapping both. The attributes will be put under <tag:attrs> and the contents under <tag:contents>. Like the internal representation, content attributes would be copied over to be in <tag:attrs> as well, to make the way to access attributes consistent for all. Content attributes will be wrapped by <tag:attr>, since they still need differentiating with regular tags. The pros of this approach is that 1) it matches the internal representation, 2) there is no poisoning of the contents by attributes. The cons are that it complicates queries, because e.g. XPaths would look like: /foo/tags:contents/bar. An example:

{foo:}
{@bar:} test
  {@baz:} test

Would result in the following XML (if pretty printed):

<foo>
  <tag:attrs>
    <bar>test</bar>
  </tag:attrs>
  <tag:contents>
    <tag:attr>
      <baz>test</baz>
    </tag:attr>
  </tag:contents>
</foo>

A variation of the second translation would be to not copy over content attributes to <tag:attrs> and wrap all attributes in <tag:attr>. That way content attributes would not need to be duplicated, since querying for tag:attr suffices to find all atributes, but since all attributes are now wrapped in <tag:attr>, this does complicate accessing attributes under <tag:attrs> with an extra wrapper. Overall this would probably be more complicated than the original second translation and it no longer matches the internal representation, so we will be using the original second translation intead.

Another variation of the second translation would be to not wrap the contents in <tag:contents, but similarly to the first translation it would poison the contents with the attributes, in this case <tag:attrs>.

It is important to note that the translations to XML will not be lossless, this would limit Tagdown's to that of XML, e.g. what is allowed in a tag name would be dictated by XML's naming rules for XML tags.

Syntax

The syntax will be decided following the design goals as mentioned in the introduction, those that are especially relevant for the syntax are being as minimal as possible as long as it does not conflict with ease of use, and consistency.

To denote the parse results we will be using a language that is somewhat like JavaScript or JSON. Text is represented as string literals (e.g. "test"), tags are represented by object literals (e.g. {foo} and {foo: "test"}), sequences do not have delimiters (e.g. {foo} "test"), and attributes start with @ and are put in a list literal (again without sequence delimiters) as the very first content item (e.g. {foo: [{bar} {baz}]}).

Text and tags

Considering the language is mixed content of text and tags, we will need to be able to differentiate between the two. Following minimalism we could go for a single delimiter to delimit text and tags. However when we consider that tags can be nested, it would become hard to read with just a single delimiter, so regardless whether it is possible to express nested tags properly with just a single delimiter, it would be best to use a seperate delimiter for starting and closing a tag. This would thus be an example of ease of use over absolute minimalism.

Considering the requirement of a lightweight markup language to be usable in a generic code editor and ease of use, we ought to limit the choices for the delimiters to those commonly found on keyboard:

  • ( and ): Parentheses (i.e. round brackets) are commonly used within regular text, and taking into account that it is the goal of these delimiters to differentiate between text and tags, they would make for horrible delimiters.
  • [ and ]: Outside their use to denote references, square brackets are not that common within regular text. One of their benefits is being the only ones that do not require Shift to get access to them on common keyboard layouts.
  • { and }: Curly brackets are rarely found within regular text, making them good candidates in that regard. Considering they would be used to mark text, you would end up with something like {tag: text}, which matches well with commonly found syntax within programming languages.
  • < and >: Angle brackets are rarely found within regular text to bracket things, but more commonly to indicate less-than and greater-than, so this could lead to ambiguity if they were to be used as delimiters.

For the reasons listed under each of the delimiter candidates, { and } have the most merits, so they will be the delimiters used to denote the boundaries of tags. Their shape also matched with how tags will be used within the rich-text editor for the language, namely that they represent the source code and will be replaced with their rich-text representation once done editing, as if the representation is injected to replace the source code.

This is one of the reasons not to reuse the concept of opening and closing tags found in XML, that turns the tags themselves into the delimiters. And would make the intuition found with the shape of curly braces become moot. Another reason is that if you look at LMLs intended to replace XML or HTML, they won't have opening and closing tags in the syntax either.

Contents

Within a tag there is a need to differentiate between the tag name and its contents, for which the indicator : is used. The only commonly found alternative for this type of indicator would be =, but : is more commonly used when used with { and }. It is also slightly more minimalistic, because : is coventionally only followed by a space, while = is surrounded by a space on both sides.

Nesting

The problem with the use of any nested grouping is that you can end up with a lot of ending delimiters together, e.g. }}}}, which is bad for ease of use, because it becomes harder to read and easier to add one too few or too many. That is why besides inline tags, there are also line and block tags.

Line tags are just like inline tags, but they close their tag prematurely (to prevent the dangling }) and their contents will be considered everything after a space (for readability and consistency with : ) and until the end of the line. For example:

{foo:} {bar:} test

Parses as: {foo: {bar: "test"}}.

Block tags also close their tag prematurely, but their contents will start at a new line (i.e. not after a space) using indentation as a continuation (i.e. it continues if present) delimiter and their contents will be considered everything until the end of the lines that are prepended with the indentation. The minimal indentation used tends to be 2 spaces, since a single space is constantly used within text, it is not noticable enough, making it not very readable. In keeping with the minimism property we will hence be using 2 spaces for indentation. For example:

{foo:}
  {bar:}
    test

Parses as: {foo: {bar: "test"}}.

The tag ending delimiters that are used to determine what the contents are for an inline tag are ignored within line and block tag contents, this would complicate the semantics, since a } could then also terminate line and block tag contents, but only if its parent was an open inline tag, which would go against the language being context independent. For example:

{foo: {bar:} }
}

Parses as: {foo: {bar: "}"}}

For both line and block contents holds that the last newline is not considered as part of the text, but of the syntax. Otherwise it would be impossible to put a multiline tag within a line. This is very relevant to the way tags can be rendered as rich-text widgets within the custom rich-text editor for Tagdown.

Literals

It is not possible to implement literals by printing back a tag after parsing because:

  1. the language is not lossless with regards to the syntax
  2. whether contents should be taken literally is a property of the contents, not of a tag or something else
  3. there are edge cases leading to literals not holding the same value compared to it being builtin

To indicate literals after : a single quote is used ', i.e. :'. Instead of ' another option might have been \, but the use case of literals does not match up well with escaping, as escaping is used to escape from the usual way it should be interpreted, but literals are just an alternative way of interpretation, it is not escaping. It is also ambiguous, {foo:\} could then be a literal, but could also be wanting to escape }. By using an extra indicator character after the one normally used to indicate contents, rather than having an indicator to indicate literal contents, results in one less special characters to consider in tag names.

An example of an edge case where literals by printing the parse result will not work:

{foo:} {bar:}
  test

Parses as: {foo: {bar: "test"}}, but in comparison:

{foo:'} {bar:}
  test

Parses as: {foo: "{bar:}"} " test".

If were to choose not to ignore tag ending delimiters within line and block contents, there are two additional edge cases.

For line contents:

{foo: {bar:} }
}

Parses as: {foo: {bar: ""}} "}", but in comparison:

{foo: {bar:'} }
}

Parses as: {foo: {bar: "}"}}.

For block contents:

{foo: {bar:}
  }
}

Parses as: {foo: {bar: ""}} "}", but in comparison:

{foo: {bar:'}
  }
}

Parses as: {foo: {bar: "}"}}.

Tag names

Tag names have no limitations put on them, except for the indicators {, }, :, and \, but those can be escaped with \, like \{.

Attributes

Attributes are tags, but not all tags are attributes, so we need to indicate whether a tag is an attribute. As the indicator @ will be used, because it stands for at, which is what "attribute" starts with, and it matches that we want to add something at something else. This indicator needs to be within the tag, otherwise we would lose the simplicity of differentiating between text and tags.

As mentioned at the language concepts, attributes can be placed as part of a tag or its contents. When they are part of the tag, we call them inline attributes, because they are inline together with their parent tag, but also because they can only be inline tags, line and block tags would force the parent tag to be a block tag. For example:

{foo{@bar}}

Parses as: {foo: [{@bar}]}.

When an attribute is part of the content it would be for example:

{foo{@bar}: {@baz}}

Parses as: {foo: [{@bar} {@baz}] {@bar}}. The content attributes are thus appended to the end of the list of attributes. This makes it possible to consistenly work on all attributes within the internal representation.

As mentioned, a block tag is the only form in which a tag can contain multiline attributes. For example:

{foo:}
{@bar:} test
  test

Parses as: {foo: [{@bar: "test"}] "test"}.

The reason for having attribute tags on the same level as the parent tag is that it would otherwise become ambiguous:

{code:'}
  {@bar:} test
  test

Is {@bar:} test part of the literal contents or is it an attribute of code? This can be solved by always having the contents be indented one level more as the block tag attributes, but we want to prevent having to use 2 indentations for every block tag contents.

We also considered using something like:

{foo:}
@{bar:} test
  test

However this conflicts with everything written being considered intentional, if due some mistake foo fails to parse, the meaning of bar would change. Rather than an attribute, it would become a content tag.

Escaped tags

TODO: {\foo} or \{foo} both are ambiguous, because {\:} and what if you want to just escape { and not the whole tag?

To make it easy to disable the interpretation of a tag for the time being, i.e. commenting the whole thing out, it is possible to escape a tag as a whole, e.g. {\foo}. It will be parsed as normal, but will be marked as being escaped. It would not be correct to turn the escaped tag into text, because this would conflict with being syntax agnostic, for it would still need to be possible to parse it back into a tag again, so it still needs to be syntactically a correct tag, which is something we do not want to guarantee in the language, to make sure the internal representation remains future proof.

Parse failures

If a tag failes to parse correctly, it will be considered text. We cannot know whether the intention was that it should have been a tag, since maybe it was supposed to be text. For that reason all attributes that are valid as part of the failed tag, will just be considered content attributes instead. For this reason content attributes do not have to be differentiated from regular attributes.

Edge case (block tag with a block attribute as its last argument)

When a block tag has a block attribute as its last argument it can become ambiguous to which of the two the indentation delimited content should belong to. For example:

{foo:}
@{bar:}
  test
  test

Introducing additional indentation to help disambiguate this situation is not going to help, because what if there is at least one additional line of contents that happens to have that same additional identation?

If the extra indentation is added to the block attribute contents we would have a situation like this:

{foo:}
@{bar:}
    test
    test
  test

Which could be interpreted as foo having "test" and "test\ntest", or less likely, as foo having " test\ntest" and bar having "test", but it should be possible to express both interpretations.

If instead the extra indentation is added to the block tag contents:

{foo:}
@{bar:}
  test
    test
    test

Then it could be interpreted as foo having "test\ntest" and bar having "test", or less likely, as foo having "test" and bar having "test\n test", but it should be possible to express both interpretations.

The covenience of using indentation as continuation delimiters is a must when considering ease of use, because the language needs to be easy to work with in its raw form in a generic text editor, and most editors, code editors at least, will have builtin support for indentation. It thus makes it really easy to add some content to a block tag by indenting it to the required level.

If changing the continuation delimiter is not possible, then an indicator needs to be introduced to indicate at what point either the block attribute contents ends or the block tag contents starts. For example if we were to do this:

{foo
@{bar:}
  test
:}
  test
  test

There would be no ambiguity. However the reason to have line and block contents is to prevent having to split a tag over multiple lines, which this would reintroduce, just to cover an edge case. For example:

{foo
@{bar:} test
:}
  test
  test

Would not require any extra indicator to help disambiguate it, but for consistency sake it would still be required.

Instead of splitting the tag over multiple lines as a means to indicate the boundaries of the contents between the tag and attribute, another solution would be to introduce an indicator solely for this purpose, but this problem is also present in another edge case, whose solution would also be applicable here, so for consistency sake, see the edge case of having indentation that should be part of text right after a block tag.

Edge case (matching indentation that should be part of the text after a block tag)

It should be able to express {foo: "test"} " test". For example:

{foo:}
  test
  test

However this would be parsed as {foo: "test\ntest"} instead.

We thus need to be able to either delimit the end of block contents or be able to escape the indentation. Although introducing another escape sequence is also an addition, introducing a new delimiter altogether is going against the language design goal of going for minialism in the design, especially since it would be delimiting text, whose distinction we want to keep as clear and simple as possible.

The escape sequence we will be using to indicate the indentation is no longer meaningful is \ . For example:

{foo:}
  test
\ test

Parses as: {foo: "test"} " test"

It being an escape sequence like the others also implies that it can be itself escaped by \\ .

The reason to use \ is to indicate that we want to escape the indentation. If the escape character \ was put at the end, it would lead to confusion, because you expect the thing to be escaped to follow right after it. For example:

{foo:}
  test
 \{bar}

Could be parsed correctly, but to the reader it would still look like {bar} is being escaped.

Putting it consistently at the very start would make it annoying to change the indentation level. For example, going from:

{foo:}
  test
\ test

To the following:

{bar:}
  {foo:}
    test
\   test

Would not be a matter of indention the whole, but requires the writer to correct the line containing the escape character to be at the start again.

There is also the matter of ambiguity when multiple levels of nesting are involved:

{foo:}
  {bar:}
    test
    \ test

Would this make the last test part of foo or top level?

Also, what if a literal is involved:

{foo:'}
  test
  \ test

Would \ be taken literally or escaped?

To resolve these two issues, it is required to line up the escape sequence with nesting level you want to break into. This still keeps the nice property of not needing to correct the escape sequence after indenting the whole further. When the escape sequence appears any deeper than the max nesting or after any previous escape sequence, the escape sequence will be regarded as literal \ . For example:

{foo:}
  {bar:}
    {baz:}
      test
\ \ \ test

Parses to: {foo: {bar: {baz: "test"}}} " \ \ test".

The ambiguity between foo and the top level from before would then be resolved with for foo:

{foo:}
  {bar:}
    test
  \   test

Parses to: {foo: {bar: "test"} " test"}.

And for the top level:

{foo:}
  {bar:}
    test
\     test

Parses to: {foo: {bar: "test"}} " test".

The literal from before would also no longer be ambiguous:

{foo:'}
  test
  \ test

Parses to: {foo: "test\n\ test"}.

And for the top level:

{foo:'}
  test
\   test

Parses to: {foo: "test"} " test".

Also, when you want to indent multiple lines at the same level, correcting the line following the line containing the escape character, would be relatively easy as indentation is supported in most generic editors. For example, where | is the cursor:

{foo:}
  test
\ test
  |

Is the situation after Enter, but is easily resolved by indenting once again:

{foo:}
  test
\ test
  |

In this case it was a matter of a single indentation, but depending on the nesting level you want to escape into, it might require multiple indentations. If we were to put the escape sequence always at the end, we could guarantee only needing a single indentation, but without aligning it at the indentation we want to break into, it would cause all ambiguities mentioned before (e.g. what level, when there are multiple levels involved).

Overview

The discussion about the syntax is spread over several sections, so here is an overview of the overall syntax for clarify sake.

Tagdown consists of text and tags. Text is everything except for the escape character \ and the tag delimiters { and } when they form a valid tag. These delimiters can be escaped: \, {, and }, by prepending it with the escape character, e.g. \{. An indentation level can be escaped by replacing the initial indentation by replacing it with \ such that it lines out with the indentation level you want to escape into. A tag can also be escaped by

TODO: What about indentation, does it match with the language properties we stated, and if so, why?

Custom rich-text editor

A rich-text editor for a LML is almost a contradiction, since the goal of a LML is that it is still easy to work with in its raw format within generic text editors. Although Tagdown is indeed designed to be workable in its raw format, we will be designing a rich-text editor that will be using a hybrid editing approach. By default it will work as an editor specialized to support Tagdown as a language, almost like a code editor (e.g. with autocomplete and helping you out with the syntax), but whenever a tag has defined a rich-text representation it will automatically be converted to this representation when you are finished editing its raw format. An example would be a tag about an appointment would produce a link to a calendar or some such. It should be possible to easily switch between the raw text format view and the rich text view.

Creating a rich-text editor is hard work, so we will be leveraging the rich-text editor toolkit for the web called ProseMirror pm.

Personal knowledge base (PKB)

XQuery BaseX GraphQL PostgreSQL Datomic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment