Skip to content

Instantly share code, notes, and snippets.

@rain-1
Last active September 22, 2023 21:13
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save rain-1/a253e47b939fc0769524d8716541c96e to your computer and use it in GitHub Desktop.
Save rain-1/a253e47b939fc0769524d8716541c96e to your computer and use it in GitHub Desktop.
Dotted Canonical S-expressions - DCSexps
#lang racket
;; printing s-exps as DCS and TDCS, plus examples of what DCS and TDCS look like
(define (dcs l)
(cond ((pair? l)
(begin
(display ".")
(dcs (car l))
(dcs (cdr l))))
((null? l)
(display "0:"))
((symbol? l)
(display (string-length (symbol->string l)))
(display ":")
(display l))))
(define (tdcs l)
(cond ((pair? l)
(begin
(display ".")
(tdcs (car l))
(tdcs (cdr l))))
((null? l)
(display "Z0:"))
((symbol? l)
(display "A")
(display (string-length (symbol->string l)))
(display ":")
(display l))
((string? l)
(display "S")
(display (string-length l))
(display ":")
(display l))
))
(dcs '(function T get_x () (body (return (const 3)))))
(newline)
;; .8:function.1:T.5:get_x.0:..4:body..6:return..5:const.0:0:0:0:
(tdcs '(function T get_x () (body (return (const 3)))))
(newline)
;; .A8:function.A1:T.A5:get_x.Z0:..A4:body..A6:return..A5:const.Z0:Z0:Z0:Z0:
(tdcs '(var-decl string (foo "")))
(newline)
;; .A8:var-decl.A6:string..A3:foo.S0:Z0:Z0:

Dotted Canonical S-expressions - DCSexps

This is a specification for an s-expression interchange format that attempts to improve upon [2]rivest's canonical s-expressions.

It is an output format for a subset of s-expressions. Those containing only pairs and atoms.

It was designed with the following desirable properties in mind:

  • It has the canonicity property that (EQUAL? A B) implies the DCS output of A is byte equal to the DCS output of B.
  • It has the non-escaping property that arbitrary binary blobs can be contained as atoms without any processing. A consequence of this is that dcsexps can be nested easily.
  • Simple to parse: It is much simpler to parse compared to rivest's canonical s-expressions because we use . instead of ( and ).

The empty symbol (length 0) may be used as a stand-in for ().

<DCS> ::= <length> ':' <data[length]>
        | '.' <DCS> <DCS>

Rationale:

Why would you use this instead of regular s-expressions with the WRITE feature (that you could in theory turn off indentation, pretty printing to produce a function with the canonicity property)?

The value of this over that is that it is much more efficient in machine to machine interchange. For example between a web server and client.

Why would you use this instead of rivest canonical s-exps? It has a much simpler specification and the parsing algorithm is a fraction of the complexity.

Disadvantages

This format is very raw, it only has pairs and atoms. We may need more data types. For that we can use tagged canonical s-exps.

<TCS> ::= <tag> <length> ':' <data[length]>
        | '.' <TCS> <TCS>

<tag> ::= 'A'    ;; Atom
        | 'S'    ;; String
        | 'N'    ;; Number
        | 'C'    ;; Character: content must have length one.
        | 'B'    ;; Boolean: content must be 't' or 'f'
        | 'Z'    ;; nil (): content must have length 0

tag is a single character that explains which type to interpret the content of the atom as.

Related work

  • messagepack - "It's like JSON but fast an small"
  • bencode - netstring based format that can encode lists, used in bittorrent.
  • flatbuffers - interchange without any parsing
@alaricsp
Copy link

The lengths are redundant in several cases in TCS, might I suggest changing the format to:

<TCS> ::= 'A' <length> ':' <data[length]> ;; Atom
        | 'S' <length> `:` <data[length]> ;; String
        | 'N' <length> `:` <data[length]> ;; Number
        | 'C' <data[1]> ;; Character
        | '.' <TCD> <TCD> ;; Pair
        | 'V' <length> ':' <TCD*length> ;; Vector
        | 'T' ;; Boolean true
        | 'F' ;; Boolean false
        | 'Z' ;; nil ()

The syntax is a little less regular, but we already had an atom/pair split anyway. I've taken the liberty to also support vectors... SRFI-4 homogenous vectors might also be a useful extension, using direct binary representations of their fixed-element-width contents.

@rain-1
Copy link
Author

rain-1 commented Sep 17, 2019

Thanks for the comment.

Good idea to support vectors.

I would like to support them using '#' because we write vectors like this #().

   <TCS> ::= <tag> <length> ':' <data[length]>
           | '.' <TCS> <TCS>
           | '#' n <TCS>^n

How should we do hash tables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment