retroplasma/BINSYNC.md

## BINSYNC.md

      
    Raw
  

              BINSYNC.md
            
          
Private Incremental File Storage on Usenet

Disclaimer

This is just a concept. Don't spam the Usenet.
How Usenet works

Usenet is an old replicated network that is basically a giant forum. There are many providers that offer access to it. Clients use NNTP to read and write articles in newsgroups. A newsgroup is like a directory that contains a stream of articles that each have a unique message ID.
There are many NNTP commands. Here are the most important ones for our purpose (simplified syntax):

HEAD <message_id>: Returns header of article with message ID <message_id>. Can be used to check if article exists.
BODY <message_id>: Returns body of article with message ID <message_id>.
POST <headers> <content>: Posts article with <content>. The <headers> can contain a custom message ID.

(Commands for listing newsgroups, articles etc. and various optional extensions are not listed here.)
We can already see that these three commands function like a key-value store. The problem is, however, that articles can't be changed once they are posted. Also, articles are deleted after some years (lifetime depends on provider) and some articles just get lost for some reason.
How binaries are commonly up- and downloaded

To upload files a user usually archives them with WinRAR, splits the archive into parts and creates additional PAR files that can be used for error correction. If some parts are lost the PAR files help to re-create them. For example, if there are 100 RAR parts and 10 PAR files then 10 out of those 110 files can be thrown away and the whole thing will still be recoverable due to magic.
Uploaders post those parts as articles (encoded in yEnc) and often indicate the name of the upload and a part number in the subject header. NZB files can be created (sometimes automatically) which contain a list of message IDs that make up an upload. The NZB is then used to download the parts.
People who want to backup their private files need to remember their encryption key and keep their NZBs somewhere. They could also use distinct subjects and search for them later but this might be impractical, prone to duplication and time-consuming. Also, they need to keep track of more information if new files are uploaded later.
Idea

We know that articles can always be retrieved by their message ID which is formatted like an email address. Any message ID can be chosen before posting an article, as long as it's unique and has this format:
Message-ID: <something-unique@example.com>

Now, the idea is essentially that a message ID and all encryption keys can be derived deterministically from a single (private) seed. Similar to how deterministic bitcoin wallets work. Instead of NZBs, an index to the files can be stored and accessed in a deterministic way as well. When new data is added, a journal of changes is uploaded and its message ID is derived from an incremented counter:
A user generates a random secret seed and three main keys are derived from it as follows:

EncryptionKey = Derive(seed, 1)
AuthenticationKey = Derive(seed, 2),
IndexKey = Derive(seed, 3)

There is four kinds of data that needs to be uploaded:

JournalData

msg_id: HMAC(IndexKey, type(Journal) || replication || counter)
Points to uploaded MetaData, RawData, ParityData and indicates parity relations


MetaData

msg_id: HMAC(IndexKey, type(Meta) || replication || counter || meta_type(file|dir) || path)
Directory strucure, file structure (links to RawData) encoded as diff commands.


RawData

msg_id: HMAC(IndexKey, type(RawOrParity) || replication || SHA(content))
File contents


ParityData

msg_id: HMAC(IndexKey, type(RawOrParity) || replication || SHA(content))
Additional data for error correction (Reed-Solomon that helps reconstruct broken MetaData and RawData.


All data content is encrypted using EncryptionKey and authenticated using AuthenticationKey. ParityData corrects errors of MetaData and RawData. JournalData needs to replicated by default because it does not have its own parity data. Other data can also be replicated which can be helpful to survive data retention limits or upload failures.
Users only need to remember their initial seed. When they want to retrieve their stuff in the future or add something new, they use their initial seed. The client shall then start searching for available JournalData by looping through the counter and replications. It will chain the results and stop searching after no more data is found (after a threshold constant). MetaData can be accessed via paths, so it is possible to mount everything somewhat efficiently as a virtual harddrive (e.g. FUSE) without loading all directories.
Future Work

The concept could be applied to public uploads as well; a seed could be shared with people. There should probably be some additional public key signature or at least some kind of "finalization article" (similar to this) so that other users can't append new data. Parts of private uploads could also be shared by creating new meta data for them and sharing appropriate keys.
In order to deal with finite retention times and still keep the same seed we need to replicate data when it's getting older but also keep the journal locatable. For example we could index the journal by YYYY-MM instead of a counter so a future search has a definitive end if it reaches a certain date in the past (e.g. binsync release date).
Code

There is some working code but it is extremely hacky and basically a brain dump. However, it should suffice as a PoC and parts of it are planned to be released eventually for the sake of completion. Partial rewrite: Github Repo