Roman2K/00_TOC.md

## 00_TOC.md

      
    Raw
  

              00_TOC.md
            
          
List of procs

split
split2
index
uindex
backlog
checksum
uchecksum
gzip
ugzip
parity
uparity
group
cmd
cmdin
cmdout
concur
multireader
sort
write
join
cp, ucp, rclone, etc.


List of dynprocs

stripe
mincopies


List of stores

rclone
cp
scp


Other types

quotaRes
copier
string
bytes
int


## 01_PROCSTRING.md

      
    Raw
  

              01_PROCSTRING.md
            
          
    Proc string

A proc string looks like "foo a b", where foo is the name of a proc and a b a space-separated list of arguments. Below are simple backup-agnostic examples of how to write one (last argument to scat). See README for real use-case examples: full proc strings for backup and restore.
Hello, World:
# stdout of echo serves as data of the seed chunk, fed to proc
# "write", which writes to stdout:

$ echo "Hello, World!" | scat "write -"
Hello, World!
Procs may be chained as a pipe-separated list:
# Proc "cmd" feeds chunk data to stdin of a command and captures its
# stdout as data of a new chunk:

$ echo "Hello, World!" | scat "cmd cat | write -"
Hello, World!

# Proc "cmdout" produces new data:

$ scat "cmdout echo Hello, World! | write -" < /dev/null
Hello, World!

# More chaining:

$ echo -n "Hello, " | scat "cmd cat | write - | cmdout echo World! | write -"
Hello, World!

$ echo "Hello, World!" | scat "cmd gpg --batch -e -r 00828C1D | cmd gpg --batch -d | write -"
Hello, World!

$ echo "Hello, World!" | scat "cmdin tee hello" && cat hello
Hello, World!
A chain is actually just another proc with special syntax for convenience to specify its args (0..n procs) separated by pipes instead of spaces, relaxing the need for parentheses. Since a chain is itself a proc also, it may be passed as argument to other procs, surrounded with curly brackets ({}), as in:
"split | { checksum | index - }"

Important: Procs are non-blocking. In the above, the chain piped to split is run for every chunk output by split without waiting for the last one to be processed. To avoid resource hogging, limit the number of concurrent instances of a proc with backlog:
"split | backlog 8 { checksum | index - }"

Parentheses may surround the arguments to avoid ambiguity when passing procs as argument to other procs:
"backlog 8 cp(foo)"
Example: Split file foo, write chunks to bar/:
$ echo hello > foo
$ scat "split | { checksum | index - | cp bar }" < foo > foo_index
$ ls bar
5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03
For restoring, we need a list of all the chunks produced during backup. Proc index does that: it lists checksums of chunks output by its containing chain, preserving order. Note that it's part of a subchain ({}), following split: see index.
Re-create foo from chunk files in bar/:
$ scat "uindex | ucp bar | uchecksum | join -" < foo_index > foo
$ cat foo
hello
The following lists document procs, their purpose and arguments. Some arguments are more complex types than strings or ints, they may be procs too, or other types like dynprocs, stores, copiers, etc. See corresponding lists.
List of procs

A proc can be thought of as a function that takes a chunk as input, may use its data (feed to a command, check integrity, etc.), modify its properties (checksum, target size, etc.) and possibly return new data as one or more new chunks as output (output of command, parity shards, etc.) fed to the next proc in the chain. They are classified as different types according to the nature of their action.
Proc types shouldn't really be a concern when choosing which to use, apart from understanding exactly what happens to the data transiting a chain and get the order right. For instance, it does matter to know which proc produces new data to correctly place index and checksum within a chain.
Types:

mutator: modifies properties leaving data as-is, returns the chunk

ex: assigning a checksum


producer: produces new data by returning one or more new chunks

ex: compressing data: 1→1 (new data, no checksum)
ex: splitting into smaller chunks: 1→n (new data, no checksum)
ex: reading an index from a chunk's data: 1→n (checksum, empty data)
ex: joining data and parity shards: n→1 (new data, no checksum)


passthrough: doesn't modify properties nor produce new data, returns the chunk

ex: integrity check


delegator: doesn't modify properties nor produce new data, passes the chunk through other proc(s)

ex: limiting the number of concurrent instances of a proc


More procs exist but aren't exposed for use in a proc string: cascade on error, path-based command, etc. See procs/.

Note about examples:

The examples below aren't usable as standalone proc strings and are to be interpreted as extracts of larger proc strings. See README for usable examples and adapt from this list.
The backlog recommendation is voluntarily not respected either for simplification


split

Usage: split()
Content-Defined Chunking with default chunk size (min: 512KiB, max: 8MiB)

type: producer

Ex:
"split | checksum | cp my_dir"
split2

Usage: split2(min max)
Idem split with custom min/max chunk size

args:

min (bytes)
max (bytes)


type: producer

Ex:
"split2 1mib 4mib"
index

Usage: index(path)
De-duplicates chunks and writes an index file to path
Tracks chunks output by the containing chain and writes a list of their checksums to path, preserving order.

Note: index is special in that it's called at the end of the chain as well, with a reference to the chunk that entered the chain. That chunk must have its checksum assigned, otherwise the chain's output chunks can't be tracked properly. As a consequence, the main chain couldn't look like split | checksum | index - because the seed chunk doesn't have a checksum before split. Rather: split | { checksum | index - }.


args:

path (string) path to index file, or - for stdout


type: passthrough
requires: checksum

Ex:
"split | { checksum | index - | cp my_dir }"
Checksums should be generally be placed twiced in a chain: an initial checksum before index, before the first producer proc, to detect duplicates. And a final checksum after the last producer proc.
Ex:
"split | { checksum | index - | gzip | checksum }"

Important: Some commands are not idempotent, such as cmd gpg -e. Two identical chunks encrypted by this proc, though decrypted as identical original data, will result in different encrypted data and thus the checksums will differ as well, making the output chunk considered new and re-written/uploaded. To prevent such behaviour, place the final checksum before it:
"split | { checksum | index - | gzip | checksum | cmd gpg --batch -e -r 00828C1D }"

uindex

Usage: uindex()
Reads the index from a chunk's data
Returns empty chunks (no data) with their checksum and target size assigned, ready for data retrieval by following procs.

type: producer

Ex:
"uindex | ucp my_dir | uchecksum | join -"
backlog

Usage: backlog(nslots proc)
Limits the number of concurrent instances of proc to nslots at a time
Since procs are non-blocking, it is highly recommended to wrap a proc immediately following split or uindex with backlog as chunks usually come out of them faster than they get processed by the rest of the chain, causing goroutines to be spawned uncontrollably. Without backlog, expect high memory usage, "too many open files" errors, etc.
To serialize the execution of a proc, pass 1 as nslots. Equivalent of a mutex, ensuring only a single instance is being run a time.
If proc is a chain, concurrency may be further limited by nesting backlogs within it.

args:

nslots (int) max number of instances
proc (proc)


type: delegator

Ex:
# integrity check with 8 workers:
"uindex | backlog 8 uchecksum"

# writing ordered chunks requires a mutex:
"split | backlog 1 { sort write - }"

# ...equivalent of:
"split | join -"

# process with 8 workers, write 4 files at a time:
"split | backlog 8 checksum | backlog 4 cp(my_dir)"

# ...or:
"split | backlog 8 { checksum | backlog 4 cp(my_dir) }"
checksum

Usage: checksum()
Computes and assigns checksums

type: mutator

Ex: see index
uchecksum

Usage: uchecksum()
Integrity check

type: passthrough
requires: checksum

Ex:
"uindex | ucp my_dir | uchecksum"
gzip

Usage: gzip()
Compresses data in gzip format
Ex:
"gzip | checksum | cp my_dir"

type: producer

ugzip

Usage: ugzip()
Uncompresses data compressed by gzip

type: producer

Ex:
"ucp my_dir | uchecksum | ugzip"
parity

Usage: parity(ndata nparity)
Reed-Solomon erasure coding
Splits chunks into ndata data shards and nparity partity shards for error correction.

args:

ndata (int) number of data shards
nparity (int) number of parity shards


type: producer

Ex:
"parity 2 1 | checksum"
uparity

Usage: uparity(ndata nparity)
Joins chunks split by parity into the original bigger chunk, recovering any error (failed integrity check, missing data)

args: see parity
type: producer
requires: checksum, group (ndata + nparity)

Ex:
"uchecksum | group 3 | uparity 2 1"
group

Usage: group(size)
Aggregates size contiguous chunks into one for procs that work with fixed-sized groups of chunks
For instance, parity(2 1) creates 3 shard chunks from one original and uparity needs those 3 grouped together to recreate the original. Use group before uparity: see example.

args:

size (int) group size


type: producer

Ex:
"group 3 | uparity 2 1"
cmd

Usage: cmd(name arg...)
Filters a chunk's data through a command

args:

name (string) command executable name: relative to $PATH or absolute path
0..n arg (string) command arguments


type: producer
stdin ← chunk data
stdout → chunk data

Ex:
"cmd gpg --batch --encrypt -r 00828C1D"
"cmd gpg --batch --decrypt"
cmdin

Usage: cmdin(name arg...)
Runs a command using a chunk's data as stdin

args: see cmd
type: passthrough
stdin ← chunk data
stdout → (discarded)

Ex:
"cmdin tee /tmp/out"
"cmdin ssh bankmon dd of=/tmp/out"
cmdout

Usage: cmdout(name arg...)
Runs a command to produce new data

args: see cmd
type: producer
stdin ← (none)
stdout → chunk data

Ex:
"cmdout date | write - | cmdout echo Hello | write -"
concur

Usage: concur(max dynproc)
Feeds chunks to procs returned by dynproc, running only max of them at a time, concurrently

args:

max (int) max number of instances
dynproc (dynproc)


type: delegator

Ex:
# one transfer at a time:
"concur 1 mincopies(2
	a=scp(bankmon:tmp/a)
	b=rclone(drive:tmp/b)
)"
multireader

Usage:  multireader(copier...)
Retrieves data from copiers, randomly alternating between them and cascading on error (failover)

args:

0..n copier (copier)


type: delegator

Ex:
"multireader(
	a=rclone(drive:tmp/a)
	b=scp(bankmon tmp/b)
)"
sort

Usage: sort()
Sorts chunks by their original order
Since procs are non-blocking, chunks get out order as they advance through a chain. But order is important at the time of re-assembling them into the original stream. sort buffers them until achieving a contiguous series and returns them in order.

type: passthrough

Ex:
"sort | write -"
write

Usage: write(path)
Writes a chunk's data to path

args:

path (string) path to write to, or - for stdout


type: passthrough

Ex:
"write -"
join

Usage: join(path)
Joins chunks data in their original order, writing the concatenation to path. Short for backlog 1 { sort | write path }.

args: see write
type: passthrough

Ex:
"uindex | ucp my_dir | uchecksum | join -"
cp, ucp, rclone, etc.

Every store "foo" is also availablea as two procs:

"foo" (write)
"ufoo" (read)

See corresponding stores
Ex:
"rclone drive:tmp"
"urclone drive:tmp"
List of dynprocs

A dynproc is similar to a function that takes a chunk as input and returns a variable number of  procs to process that chunk.
stripe

Usage: stripe(min excl copier...)
Striping and N-copies duplication
Ensures there exist at least min copies of each chunk among all given copiers, creating missing ones as needed. Chunks are striped across stores by interleaving them in a Round-Robin fashion.
If chunks are grouped with group, then stripe may guarantee that at least excl chunks within that group are put on distinct stores from the others. Required for guaranteeing recoverability from parity so that any nparity stores may be lost while maintaining ability to recompute original data from the remaining >= ndata shards.
On consecutive runs, existing copies will be reused as much as possible while meeting the min and excl requirements, making new copies as necessary to meet them. Returns an error if not possible, whether for lack of provided stores, or not enough of them available with quota left.
Stores are filled up to their quota and a little bit over due to concurrency during writes/uploads causing imprecision in calculation. In theory, quota overage may reach up to group size × max chunk size × concurrency.

args:

min (int) guarantee of mininum number of copies
excl (int) guarantee of minimum number of exclusive chunks within a group
0..n copier (quotaRes)


type of returned procs: passthrough
requires: checksum, group (for excl > 0)

Ex:
# RAID 1: make 2 copies
"stripe(2 0
	a=scp(bankmon tmp/a)
	b=rclone(drive:tmp/b)=2gib
)"

# RAID 5: ensure exclusivity for ndata shards
"parity 2 1 | group 3 | stripe(1 2
	a=scp(bankmon tmp/a)
	b=rclone(drive:tmp/b)=2gib
	c=rclone(drive2:tmp/c)
)"
mincopies

Usage: mincopies(min copier...)
N-copies duplication
Idem stripe with no guarantee of exclusivity of chunks across stores. Short for stripe(min 0 copier...).

args: see stripe
type of returned procs: see stripe
requires: see stripe

Ex:
"mincopies(2
	a=scp(bankmon tmp/a)
	b=rclone(drive:tmp/b)=2gib
)"
List of stores

A store represents a storage facility, local or remote. It provides a proc for writes/uploads, another for reads (or downloads) and can list existing entries, such as files or objects in buckets.
Filenames are hexadecimal SHA256 checksum hashes (64 chars). Ex: aeef70b69d4e9dc8eb95bea114c4e992831e4185ec93145c4c893b5811079bea
rclone

Usage: rclone(remote)
Cloud storage via rclone command

Note: The remote must be already configured via rclone config.


args:

remote (string) name of remote and directory in the form of "<remote>:<dir>"


requires: checksum

Ex:
"rclone drive:tmp/backup"
cp

Usage: cp(dir level...)
Local filesystem storage in directory dir
If levels are specified, chunks are nested within subdirectories by hashing their checksum into one level-character long subdirectory per level.

args:

dir (string) path to directory
0..n level (int) nesting levels


requires: checksum

Ex:
"cp path/to/foo"

# ...writes chunks to: (relative to dir)
fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1

"cp path/to/foo 4"

# ...writes to:
fd9f/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1

"cp path/to/foo 3 2"

# ...writes to:
fd9/fe/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1
scp

Usage: scp(host dir level...)
Remote file system storage via SSH
Requires the following GNU-compatible commands:

(local) ssh for remote command execution
(remote) dd for streaming file transfer
(remote) find for listing existing files


Note: ssh and helper commands are used for file transfer instead of scp or sftpbecause scp only takes path arguments and that would require writing many temp files. Instead, files are streamed through dd without buffering to disk. For listing, neither sftp nor ls are used either due to necessary path escaping and inflexible output formatting which would have required error-prone parsing. With ssh + find, paths and file info are passed around in a manner that eliminates ambiguity: environment variables and NUL-separated strings.


args:

host (string) first argument to ssh: [user@]hostname
dir (string) path to directory
0..n level (int) see cp


requires: checksum

Ex:
"scp bankmon /tmp"
"scp bankmon /tmp 4"
Other types

Arguments to above types

Note: In equal sign-separated pairs such as copier=limit, spaces are not allowed around =.

quotaRes

Quota resource, with or without quota limit
Format: copier=max or copier

args:

copier (copier)
max (bytes) default = unlimited


Ex:
"a=scp(bankmon tmp/a)"
"b=rclone(drive:tmp/b)=2gib"
copier

Format: id=store

args:

id (string) used for internal book-keeping such as quota and stats computation
store (store)


Ex:
"foo=rclone(drive:tmp/bar)"
string

Format: sequence of non-space characters
Ex:
"path/to/file"
bytes

Format: <int><unit>
Size in bytes
Ex:
1024MiB
1GiB
1000MiB
1gb

int

Format: numeric characters
Ex:
"123"

  
## gentoc
#!/usr/bin/env ruby

$stdin.each_line do |line|
  line.chomp =~ /^##(#*)\s*/ or next
  level, text = $1.size, $'
  anchor = text.downcase.tr(" ", "-").gsub(/[^\w-]/, "")
  print "%s* [%s](#%s)\n" % ["\t"*level, text, anchor]
end

## Makefile
00_TOC.md: 01_PROCSTRING.md
	./gentoc < $< > $@
	#!/usr/bin/env ruby

	$stdin.each_line do \|line\|
	line.chomp =~ /^##(#)\s/ or next
	level, text = $1.size, $'
	anchor = text.downcase.tr(" ", "-").gsub(/[^\w-]/, "")
	print "%s* [%s](#%s)\n" % ["\t"*level, text, anchor]
	end