Skip to content

Instantly share code, notes, and snippets.

@isaacs
Last active March 25, 2017 22:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save isaacs/1948555cd81f7ccd931a8846ad6fa185 to your computer and use it in GitHub Desktop.
Save isaacs/1948555cd81f7ccd931a8846ad6fa185 to your computer and use it in GitHub Desktop.

JSONar

A highly compressible, appendable, indexed, fast-parsing, flexible, extensible, human-debuggable, machine-verifiable, tamper-resistant archive format.

JSONar takes some of the best parts of tar, without also being a forensic history of computing.

Entries

There are two types of records in JSONar: entries and indexes.

Subsequent entries with the same path value as previous entries override those previous entries.

The structure of an entry is:

  1. intro - The ascii string ">JSONar\n"
  2. pathLen - 4 bytes - size of path name as an unsigned 32-bit big-endian int
  3. headerLen - 4 bytes - size of header portion as UInt32BE
  4. bodyLen - 4 bytes - size of body as UInt32BE
  5. \n (1 byte, value 0x0A)
  6. path - path name as UTF-8 encoded string of byte length defined in the first 4 bytes uint
  7. \n (1 byte, value 0x0A)
  8. header - header as JSON string of length defined in second 4 byte uint
  9. \n (1 byte, value 0x0A)
  10. body - body bytes of length defined in third 4 byte uint
  11. \n (1 byte, value 0x0A)
  12. shasum - a 64-byte sha512sum of the previous sections of the entry.
  13. \n (1 byte, value 0x0A)

All 13 parts are always present, but body and header can be 0 bytes.

JSON Header

The JSON header information should contain the following values, but any arbitrary data is allowed.

  • type - One of the following strings, indicating the type of file that the entry represents. The default is file.

    • file
    • directory - Entry body MAY contain directory listing
    • fifo
    • symboliclink - Entry body contains link target
    • link - Entry body contains link target
    • characterdevice
    • socket
    • tombstone - Explicitly removed from archive.
    • index - A JSONar index (see below)
  • dev - The device id of the file system entry

  • ino - The inode value of the file system entry

  • mode - The numeric mode (including suid and sticky bit)

  • nlink

  • uid gid - User and group IDs of file owner

  • rdev (optional for non-device files)

  • atime - access time (optional)

  • ctime - change time

  • mtime - modification time

  • birthtime - file creation time (optional)

Skipping an Entry

To skip over a record:

  1. Read the first 20 bytes. Assert that the first 8 bytes are the string are the ascii string ">JSONar\n". Interpret the next 12 bytes as 3 unsignted int32 values.
  2. Add those three numbers, plus 5 for the \n delimiters, plus 64 for the sha512sum.
  3. Skip ahead that many bytes.

Reading an Entry

To read a record securely from start to finish:

  1. Start a SHA-2 512 checksum stream.
  2. Read the first 8 bytes. Assert that they are the string ">JSONar\n".
  3. Read the next 12 bytes. Interpret this as 3 unsignted int32 values. Assign these to pathLen, headerLen, and bodyLen, respectively.
  4. Write 20 consumed bytes to checksum stream.
  5. Read next byte. Assert it is '\n'. Write '\n' to checksum stream.
  6. Read pathLen bytes. Interpret as utf-8 string. This is the entry path. Write bytes to checksum stream.
  7. If the path is @JSONar Index, then skip to the next entry. (Indexes are only relevant in random access mode.)
  8. Read next byte. Assert it is '\n'. Write '\n' to checksum stream.
  9. Read headerLen bytes. Interpret as utf-8 string. This is the header JSON. Write bytes to checksum stream.
  10. Decode header JSON. This is the metadata. (If it does not parse as valid JSON, then skip to end of record, or abort entirely.)
  11. Consume next byte. Assert it is '\n'. Write '\n' to checksum stream.
  12. Consume bodyLen bytes. This is the entry body. Write each byte to checksum stream.
  13. Consume next byte. Assert it is '\n'. Write '\n' to checksum stream.
  14. Read next 64 bytes. This is expected checksum digest. End checksum stream. Verify actual digest matches expected digest.
  15. Consume next byte. Assert it is '\n'.

Indexes

An index is a map from path names to positions within the file where the entry can be found.

An index is a special kind of entry where:

  1. The path field is @JSONar Index
  2. The bodyLen value is always 8.
  3. The header.type field is "index"
  4. The header.entries is an object which maps filenames to file offsets where the most recent entry for that pathname can be found.
  5. The body is 8 bytes indicating the file offset of the index as an unsigned 64-bit big-endian integer.

When writing a JSONar, an index should be written after entries are added.

When reading a JSONar file from disk, it is possible to seek throughout the file to access items randomly using the index.

To access files randomly in a JSONar file,

  1. Read the last 70 bytes of the file. The first 4 bytes are the file offset of the index. The 5th byte is '\n', then 64 byte sha512 checksum, then '\n'. If the delimiters aren't in the right places, give up.
  2. Seek back to the position indicated in the index body, and read to end of the record.
  3. Check the index checksum, verify that the path name is @JSONar Index, and pull out the header.entries object.
  4. At this point, file metadata and contents can be accessed by seeking to the appropriate point in the file and reading the entry.

Credits, Improvements, and License

This is a bad idea and you should probably not implement this, except for fun.

As @tef points out in the comments below, negative file offsets relative to the start of the index (and the index body having a negative offset relative to the end of the index) is a better idea, because it means that archives can be concatenated, or garbage prepended to the start (thus supporting self-extraction).

To the extent that this is "software", you may use it under the following license:

The ISC License

Copyright (c) Isaac Z. Schlueter and Contributors

Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR
IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
@isaacs
Copy link
Author

isaacs commented Mar 19, 2017

The name is obviously a huge problem. It didn't occur to me that JSONar is j-sonar, which is a bunch of things, some involving Java and/or sound navigation and ranging.

It's a working title.

@abritinthebay
Copy link

JSAR? Maybe?

@tef
Copy link

tef commented Mar 19, 2017

Here's some random and quick feedback

  • File paths are weird and special and you might want to also normalize the unicode, beyond just utf-8

  • Is \r\n valid? It should be. People do that.

  • Why use binary packing with endian problems when you can just use ascii digits "Length: 12345", and not encode any maximum lengths into your archive format?

  • Why hard code the checksum algorithm, and also why use SHA-2 and not SHA-3 (the latter is quite lovely for not being vulnerable to extension attacks)

  • One nice property .gz files is that you can concat them together, but sadly tar won't let you.

With your end of directory format, you'd have to re-append the index each time you added a file. Random access to small archives is overrated. Random access to large archives usually involves storing index separately - especially if they're compressed afterwards.

(Index records are a complete pain in the ass and tend to get corrupt or miswritten or partially updated—ask anyone who has worked with PDF )

ps:

After working heavily with WARC files and ARC files, and a bunch of archiving, i'd lean towards something that doesn't mix binary and human encoding: something more like HTTP-Headers + JSON header values + chunked encoding

FILE /path/name <size of header+payload> <checksum of header+payload>
Content-Type: mime/header
atime: 12345
property: ["123", "456"]

<length> <hash>
<payload>

<length> <hash>
<payload>

0

i.e you put the ` ' as the header line, each header can be a json-value, the contents of the file can be split into blocks. With really large files, well, it's nice to have fixed blocks for checksums + do maybe a merkle tree

But the big thing is you should be able to cat a bunch of archives together and things still work. You can even compress them with gzip and it still works

@tef
Copy link

tef commented Mar 19, 2017

here's a refinement of that sketch to try and preseve the properties i've been told:

Same ideas:

  • Start line that covers entire payload
  • using ascii instead of fixed width length encodings, human as much as possible
  • http like, chunked encoding for large files

with:

  • index record, including start ref
  • (optional) offsets in trailer to start of record
record :== record_header (payload_header)* newline (payload)* trailer

record_header :==record_type file_path record_length record_checksum newline
record_type :==  "file" | "directory" | .... | "index"
file_path :== "..." # a utf-8, NFC encoded string

payload_headers :== identifier ":" json_value newline

payload :== <length of chunk> chunk_checksum ":" newline raw_chunk newline

trailer :== "0" ":" <optional negative offset of record_header> newline newline

record_checksum :== <hash algorithm> ":" <hex digest of merkle hash of (record_header, chunk_checksums)>
chunk_checksum :== <hash algorithm> ":" <hex digest of hash of (raw_chunk)>

so a file might look like

FILE /readme 123 sha:8383838
Header: [1,2,3]

12 sha:292929292
Hello, World!

0: -298383

INDEX 2939 sha:238388338
Header:[1,2,3]

3838 sha:393993939
{type:"start", offset: -293}
{type:"file", path:"/readme" , offset:-293, length:123, checksum:"sha:3888383"} 
0: -64

@dstufft
Copy link

dstufft commented Mar 19, 2017

Blake2 is likely better than any of the sha2 or sha3 family for this.

@aredridel
Copy link

I was just thinking "make the hash replaceable" is a smart thing.

@tamzinblake
Copy link

Yeah, the output could specify what hash it uses and the cli tool when writing could default to whatever is considered 'best' right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment