isaacs/jsonar.md

## jsonar.md

      
    Raw
  

              jsonar.md
            
          
    JSONar

A highly compressible, appendable, indexed, fast-parsing, flexible,
extensible, human-debuggable, machine-verifiable, tamper-resistant
archive format.
JSONar takes some of the best parts of tar, without also being a
forensic history of computing.
Entries

There are two types of records in JSONar: entries and indexes.
Subsequent entries with the same path value as previous entries
override those previous entries.
The structure of an entry is:

intro - The ascii string ">JSONar\n"
pathLen - 4 bytes - size of path name as an unsigned 32-bit big-endian int
headerLen - 4 bytes - size of header portion as UInt32BE
bodyLen - 4 bytes - size of body as UInt32BE
\n (1 byte, value 0x0A)
path - path name as UTF-8 encoded string of byte length defined in the
first 4 bytes uint
\n (1 byte, value 0x0A)
header - header as JSON string of length defined in second 4 byte uint
\n (1 byte, value 0x0A)
body - body bytes of length defined in third 4 byte uint
\n (1 byte, value 0x0A)
shasum - a 64-byte sha512sum of the previous sections of the entry.
\n (1 byte, value 0x0A)

All 13 parts are always present, but body and header can be 0 bytes.
JSON Header

The JSON header information should contain the following values, but
any arbitrary data is allowed.


type - One of the following strings, indicating the type of file
that the entry represents.  The default is file.

file
directory - Entry body MAY contain directory listing
fifo
symboliclink - Entry body contains link target
link - Entry body contains link target
characterdevice
socket
tombstone - Explicitly removed from archive.
index - A JSONar index (see below)


dev - The device id of the file system entry


ino - The inode value of the file system entry


mode - The numeric mode (including suid and sticky bit)


nlink


uid gid - User and group IDs of file owner


rdev (optional for non-device files)


atime - access time (optional)


ctime - change time


mtime - modification time


birthtime - file creation time (optional)


Skipping an Entry

To skip over a record:

Read the first 20 bytes.  Assert that the first 8 bytes are the
string are the ascii string ">JSONar\n".  Interpret the next 12
bytes as 3 unsignted int32 values.
Add those three numbers, plus 5 for the \n delimiters, plus 64 for
the sha512sum.
Skip ahead that many bytes.

Reading an Entry

To read a record securely from start to finish:

Start a SHA-2 512 checksum stream.
Read the first 8 bytes.  Assert that they are the string
">JSONar\n".
Read the next 12 bytes.  Interpret this as 3 unsignted int32
values.  Assign these to pathLen, headerLen, and bodyLen,
respectively.
Write 20 consumed bytes to checksum stream.
Read next byte.  Assert it is '\n'.  Write '\n' to checksum
stream.
Read pathLen bytes.  Interpret as utf-8 string.  This is the
entry path.  Write bytes to checksum stream.
If the path is @JSONar Index, then skip to the next entry.
(Indexes are only relevant in random access mode.)
Read next byte.  Assert it is '\n'.  Write '\n' to checksum
stream.
Read headerLen bytes.  Interpret as utf-8 string.  This is the
header JSON.  Write bytes to checksum stream.
Decode header JSON.  This is the metadata.  (If it does not parse
as valid JSON, then skip to end of record, or abort entirely.)
Consume next byte.  Assert it is '\n'.  Write '\n' to checksum
stream.
Consume bodyLen bytes.  This is the entry body.  Write each byte
to checksum stream.
Consume next byte.  Assert it is '\n'.  Write '\n' to checksum
stream.
Read next 64 bytes.  This is expected checksum digest.  End
checksum stream.  Verify actual digest matches expected digest.
Consume next byte.  Assert it is '\n'.

Indexes

An index is a map from path names to positions within the file where
the entry can be found.
An index is a special kind of entry where:

The path field is @JSONar Index
The bodyLen value is always 8.
The header.type field is "index"
The header.entries is an object which maps filenames to file
offsets where the most recent entry for that pathname can be found.
The body is 8 bytes indicating the file offset of the index as an
unsigned 64-bit big-endian integer.

When writing a JSONar, an index should be written after entries are
added.
When reading a JSONar file from disk, it is possible to seek
throughout the file to access items randomly using the index.
To access files randomly in a JSONar file,

Read the last 70 bytes of the file.  The first 4 bytes are the file
offset of the index.  The 5th byte is '\n', then 64 byte sha512
checksum, then '\n'.  If the delimiters aren't in the right
places, give up.
Seek back to the position indicated in the index body, and read to
end of the record.
Check the index checksum, verify that the path name is @JSONar Index, and pull out the header.entries object.
At this point, file metadata and contents can be accessed by
seeking to the appropriate point in the file and reading the entry.

Credits, Improvements, and License

This is a bad idea and you should probably not implement this, except
for fun.
As @tef points out in the comments below, negative file offsets
relative to the start of the index (and the index body having a
negative offset relative to the end of the index) is a better idea,
because it means that archives can be concatenated, or garbage
prepended to the start (thus supporting self-extraction).
To the extent that this is "software", you may use it under the
following license:
The ISC License

Copyright (c) Isaac Z. Schlueter and Contributors

Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR
IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.