-
-
Save Roman2K/cc6fd61027306d73f1f2b193f1ce7e94 to your computer and use it in GitHub Desktop.
A proc string looks like "foo a b"
, where foo
is the name of a proc and a b
a space-separated list of arguments. Below are simple backup-agnostic examples of how to write one (last argument to scat
). See README for real use-case examples: full proc strings for backup and restore.
Hello, World:
# stdout of echo serves as data of the seed chunk, fed to proc
# "write", which writes to stdout:
$ echo "Hello, World!" | scat "write -"
Hello, World!
Procs may be chained as a pipe-separated list:
# Proc "cmd" feeds chunk data to stdin of a command and captures its
# stdout as data of a new chunk:
$ echo "Hello, World!" | scat "cmd cat | write -"
Hello, World!
# Proc "cmdout" produces new data:
$ scat "cmdout echo Hello, World! | write -" < /dev/null
Hello, World!
# More chaining:
$ echo -n "Hello, " | scat "cmd cat | write - | cmdout echo World! | write -"
Hello, World!
$ echo "Hello, World!" | scat "cmd gpg --batch -e -r 00828C1D | cmd gpg --batch -d | write -"
Hello, World!
$ echo "Hello, World!" | scat "cmdin tee hello" && cat hello
Hello, World!
A chain is actually just another proc with special syntax for convenience to specify its args (0..n procs) separated by pipes instead of spaces, relaxing the need for parentheses. Since a chain is itself a proc also, it may be passed as argument to other procs, surrounded with curly brackets ({}
), as in:
"split | { checksum | index - }"
Important: Procs are non-blocking. In the above, the chain piped to
split
is run for every chunk output bysplit
without waiting for the last one to be processed. To avoid resource hogging, limit the number of concurrent instances of a proc withbacklog
:"split | backlog 8 { checksum | index - }"
Parentheses may surround the arguments to avoid ambiguity when passing procs as argument to other procs:
"backlog 8 cp(foo)"
Example: Split file foo
, write chunks to bar/
:
$ echo hello > foo
$ scat "split | { checksum | index - | cp bar }" < foo > foo_index
$ ls bar
5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03
For restoring, we need a list of all the chunks produced during backup. Proc index
does that: it lists checksums of chunks output by its containing chain, preserving order. Note that it's part of a subchain ({}
), following split
: see index
.
Re-create foo
from chunk files in bar/
:
$ scat "uindex | ucp bar | uchecksum | join -" < foo_index > foo
$ cat foo
hello
The following lists document procs, their purpose and arguments. Some arguments are more complex types than strings or ints, they may be procs too, or other types like dynprocs, stores, copiers, etc. See corresponding lists.
A proc
can be thought of as a function that takes a chunk as input, may use its data (feed to a command, check integrity, etc.), modify its properties (checksum, target size, etc.) and possibly return new data as one or more new chunks as output (output of command, parity shards, etc.) fed to the next proc in the chain. They are classified as different types according to the nature of their action.
Proc types shouldn't really be a concern when choosing which to use, apart from understanding exactly what happens to the data transiting a chain and get the order right. For instance, it does matter to know which proc produces new data to correctly place index
and checksum
within a chain.
Types:
- mutator: modifies properties leaving data as-is, returns the chunk
- ex: assigning a checksum
- producer: produces new data by returning one or more new chunks
- ex: compressing data: 1→1 (new data, no checksum)
- ex: splitting into smaller chunks: 1→n (new data, no checksum)
- ex: reading an index from a chunk's data: 1→n (checksum, empty data)
- ex: joining data and parity shards: n→1 (new data, no checksum)
- passthrough: doesn't modify properties nor produce new data, returns the chunk
- ex: integrity check
- delegator: doesn't modify properties nor produce new data, passes the chunk through other proc(s)
- ex: limiting the number of concurrent instances of a proc
More procs exist but aren't exposed for use in a proc string: cascade on error, path-based command, etc. See procs/.
Note about examples:
- The examples below aren't usable as standalone proc strings and are to be interpreted as extracts of larger proc strings. See README for usable examples and adapt from this list.
- The
backlog
recommendation is voluntarily not respected either for simplification
Usage: split()
Content-Defined Chunking with default chunk size (min: 512KiB, max: 8MiB)
- type: producer
Ex:
"split | checksum | cp my_dir"
Usage: split2(min max)
Idem split
with custom min/max chunk size
- args:
min
(bytes)max
(bytes)
- type: producer
Ex:
"split2 1mib 4mib"
Usage: index(path)
De-duplicates chunks and writes an index file to path
Tracks chunks output by the containing chain and writes a list of their checksums to path
, preserving order.
Note:
index
is special in that it's called at the end of the chain as well, with a reference to the chunk that entered the chain. That chunk must have its checksum assigned, otherwise the chain's output chunks can't be tracked properly. As a consequence, the main chain couldn't look likesplit | checksum | index -
because the seed chunk doesn't have a checksum beforesplit
. Rather:split | { checksum | index - }
.
- args:
path
(string) path to index file, or-
for stdout
- type: passthrough
- requires: checksum
Ex:
"split | { checksum | index - | cp my_dir }"
Checksums should be generally be placed twiced in a chain: an initial checksum
before index
, before the first producer proc, to detect duplicates. And a final checksum
after the last producer proc.
Ex:
"split | { checksum | index - | gzip | checksum }"
Important: Some commands are not idempotent, such as
cmd gpg -e
. Two identical chunks encrypted by this proc, though decrypted as identical original data, will result in different encrypted data and thus the checksums will differ as well, making the output chunk considered new and re-written/uploaded. To prevent such behaviour, place the finalchecksum
before it:"split | { checksum | index - | gzip | checksum | cmd gpg --batch -e -r 00828C1D }"
Usage: uindex()
Reads the index from a chunk's data
Returns empty chunks (no data) with their checksum and target size assigned, ready for data retrieval by following procs.
- type: producer
Ex:
"uindex | ucp my_dir | uchecksum | join -"
Usage: backlog(nslots proc)
Limits the number of concurrent instances of proc
to nslots
at a time
Since procs are non-blocking, it is highly recommended to wrap a proc immediately following split
or uindex
with backlog
as chunks usually come out of them faster than they get processed by the rest of the chain, causing goroutines to be spawned uncontrollably. Without backlog
, expect high memory usage, "too many open files" errors, etc.
To serialize the execution of a proc, pass 1 as nslots
. Equivalent of a mutex, ensuring only a single instance is being run a time.
If proc
is a chain, concurrency may be further limited by nesting backlog
s within it.
- args:
nslots
(int) max number of instancesproc
(proc)
- type: delegator
Ex:
# integrity check with 8 workers:
"uindex | backlog 8 uchecksum"
# writing ordered chunks requires a mutex:
"split | backlog 1 { sort write - }"
# ...equivalent of:
"split | join -"
# process with 8 workers, write 4 files at a time:
"split | backlog 8 checksum | backlog 4 cp(my_dir)"
# ...or:
"split | backlog 8 { checksum | backlog 4 cp(my_dir) }"
Usage: checksum()
Computes and assigns checksums
- type: mutator
Ex: see index
Usage: uchecksum()
Integrity check
- type: passthrough
- requires: checksum
Ex:
"uindex | ucp my_dir | uchecksum"
Usage: gzip()
Compresses data in gzip format
Ex:
"gzip | checksum | cp my_dir"
- type: producer
Usage: ugzip()
Uncompresses data compressed by gzip
- type: producer
Ex:
"ucp my_dir | uchecksum | ugzip"
Usage: parity(ndata nparity)
Reed-Solomon erasure coding
Splits chunks into ndata
data shards and nparity
partity shards for error correction.
- args:
ndata
(int) number of data shardsnparity
(int) number of parity shards
- type: producer
Ex:
"parity 2 1 | checksum"
Usage: uparity(ndata nparity)
Joins chunks split by parity
into the original bigger chunk, recovering any error (failed integrity check, missing data)
- args: see
parity
- type: producer
- requires: checksum, group (ndata + nparity)
Ex:
"uchecksum | group 3 | uparity 2 1"
Usage: group(size)
Aggregates size
contiguous chunks into one for procs that work with fixed-sized groups of chunks
For instance, parity(2 1)
creates 3 shard chunks from one original and uparity
needs those 3 grouped together to recreate the original. Use group
before uparity
: see example.
- args:
size
(int) group size
- type: producer
Ex:
"group 3 | uparity 2 1"
Usage: cmd(name arg...)
Filters a chunk's data through a command
- args:
name
(string) command executable name: relative to$PATH
or absolute path- 0..n
arg
(string) command arguments
- type: producer
- stdin ← chunk data
- stdout → chunk data
Ex:
"cmd gpg --batch --encrypt -r 00828C1D"
"cmd gpg --batch --decrypt"
Usage: cmdin(name arg...)
Runs a command using a chunk's data as stdin
- args: see
cmd
- type: passthrough
- stdin ← chunk data
- stdout → (discarded)
Ex:
"cmdin tee /tmp/out"
"cmdin ssh bankmon dd of=/tmp/out"
Usage: cmdout(name arg...)
Runs a command to produce new data
- args: see
cmd
- type: producer
- stdin ← (none)
- stdout → chunk data
Ex:
"cmdout date | write - | cmdout echo Hello | write -"
Usage: concur(max dynproc)
Feeds chunks to procs returned by dynproc
, running only max
of them at a time, concurrently
- args:
max
(int) max number of instancesdynproc
(dynproc)
- type: delegator
Ex:
# one transfer at a time:
"concur 1 mincopies(2
a=scp(bankmon:tmp/a)
b=rclone(drive:tmp/b)
)"
Usage: multireader(copier...)
Retrieves data from copier
s, randomly alternating between them and cascading on error (failover)
- args:
- 0..n
copier
(copier)
- 0..n
- type: delegator
Ex:
"multireader(
a=rclone(drive:tmp/a)
b=scp(bankmon tmp/b)
)"
Usage: sort()
Sorts chunks by their original order
Since procs are non-blocking, chunks get out order as they advance through a chain. But order is important at the time of re-assembling them into the original stream. sort
buffers them until achieving a contiguous series and returns them in order.
- type: passthrough
Ex:
"sort | write -"
Usage: write(path)
Writes a chunk's data to path
- args:
path
(string) path to write to, or-
for stdout
- type: passthrough
Ex:
"write -"
Usage: join(path)
Joins chunks data in their original order, writing the concatenation to path
. Short for backlog 1 { sort | write path }
.
- args: see
write
- type: passthrough
Ex:
"uindex | ucp my_dir | uchecksum | join -"
Every store
"foo" is also availablea as two proc
s:
- "foo" (write)
- "ufoo" (read)
See corresponding store
s
Ex:
"rclone drive:tmp"
"urclone drive:tmp"
A dynproc
is similar to a function that takes a chunk as input and returns a variable number of proc
s to process that chunk.
Usage: stripe(min excl copier...)
Striping and N-copies duplication
Ensures there exist at least min
copies of each chunk among all given copier
s, creating missing ones as needed. Chunks are striped across stores by interleaving them in a Round-Robin fashion.
If chunks are grouped with group
, then stripe
may guarantee that at least excl
chunks within that group are put on distinct stores from the others. Required for guaranteeing recoverability from parity
so that any nparity
stores may be lost while maintaining ability to recompute original data from the remaining >= ndata
shards.
On consecutive runs, existing copies will be reused as much as possible while meeting the min
and excl
requirements, making new copies as necessary to meet them. Returns an error if not possible, whether for lack of provided stores, or not enough of them available with quota left.
Stores are filled up to their quota and a little bit over due to concurrency during writes/uploads causing imprecision in calculation. In theory, quota overage may reach up to group size × max chunk size × concurrency.
- args:
min
(int) guarantee of mininum number of copiesexcl
(int) guarantee of minimum number of exclusive chunks within a group- 0..n
copier
(quotaRes)
- type of returned procs: passthrough
- requires: checksum, group (for excl > 0)
Ex:
# RAID 1: make 2 copies
"stripe(2 0
a=scp(bankmon tmp/a)
b=rclone(drive:tmp/b)=2gib
)"
# RAID 5: ensure exclusivity for ndata shards
"parity 2 1 | group 3 | stripe(1 2
a=scp(bankmon tmp/a)
b=rclone(drive:tmp/b)=2gib
c=rclone(drive2:tmp/c)
)"
Usage: mincopies(min copier...)
N-copies duplication
Idem stripe
with no guarantee of exclusivity of chunks across stores. Short for stripe(min 0 copier...)
.
- args: see
stripe
- type of returned procs: see
stripe
- requires: see
stripe
Ex:
"mincopies(2
a=scp(bankmon tmp/a)
b=rclone(drive:tmp/b)=2gib
)"
A store
represents a storage facility, local or remote. It provides a proc
for writes/uploads, another for reads (or downloads) and can list existing entries, such as files or objects in buckets.
Filenames are hexadecimal SHA256 checksum hashes (64 chars). Ex: aeef70b69d4e9dc8eb95bea114c4e992831e4185ec93145c4c893b5811079bea
Usage: rclone(remote)
Cloud storage via rclone command
Note: The remote must be already configured via
rclone config
.
- args:
remote
(string) name of remote and directory in the form of"<remote>:<dir>"
- requires: checksum
Ex:
"rclone drive:tmp/backup"
Usage: cp(dir level...)
Local filesystem storage in directory dir
If level
s are specified, chunks are nested within subdirectories by hashing their checksum into one level
-character long subdirectory per level
.
- args:
dir
(string) path to directory- 0..n
level
(int) nesting levels
- requires: checksum
Ex:
"cp path/to/foo"
# ...writes chunks to: (relative to dir)
fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1
"cp path/to/foo 4"
# ...writes to:
fd9f/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1
"cp path/to/foo 3 2"
# ...writes to:
fd9/fe/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1
Usage: scp(host dir level...)
Remote file system storage via SSH
Requires the following GNU-compatible commands:
- (local)
ssh
for remote command execution - (remote)
dd
for streaming file transfer - (remote)
find
for listing existing files
Note:
ssh
and helper commands are used for file transfer instead ofscp
orsftp
becausescp
only takes path arguments and that would require writing many temp files. Instead, files are streamed throughdd
without buffering to disk. For listing, neithersftp
norls
are used either due to necessary path escaping and inflexible output formatting which would have required error-prone parsing. Withssh
+find
, paths and file info are passed around in a manner that eliminates ambiguity: environment variables and NUL-separated strings.
- args:
host
(string) first argument tossh
:[user@]hostname
dir
(string) path to directory- 0..n
level
(int) seecp
- requires: checksum
Ex:
"scp bankmon /tmp"
"scp bankmon /tmp 4"
Arguments to above types
Note: In equal sign-separated pairs such as
copier=limit
, spaces are not allowed around=
.
Quota resource, with or without quota limit
Format: copier=max
or copier
- args:
copier
(copier)max
(bytes) default = unlimited
Ex:
"a=scp(bankmon tmp/a)"
"b=rclone(drive:tmp/b)=2gib"
Format: id=store
- args:
id
(string) used for internal book-keeping such as quota and stats computationstore
(store)
Ex:
"foo=rclone(drive:tmp/bar)"
Format: sequence of non-space characters
Ex:
"path/to/file"
Format: <int><unit>
Size in bytes
Ex:
1024MiB
1GiB
1000MiB
1gb
Format: numeric characters
Ex:
"123"
#!/usr/bin/env ruby | |
$stdin.each_line do |line| | |
line.chomp =~ /^##(#*)\s*/ or next | |
level, text = $1.size, $' | |
anchor = text.downcase.tr(" ", "-").gsub(/[^\w-]/, "") | |
print "%s* [%s](#%s)\n" % ["\t"*level, text, anchor] | |
end |
00_TOC.md: 01_PROCSTRING.md | |
./gentoc < $< > $@ |