Skip to content

Instantly share code, notes, and snippets.

@Roman2K
Last active March 17, 2017 06:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Roman2K/cc6fd61027306d73f1f2b193f1ce7e94 to your computer and use it in GitHub Desktop.
Save Roman2K/cc6fd61027306d73f1f2b193f1ce7e94 to your computer and use it in GitHub Desktop.
Documentation for scat Proc string - see https://github.com/Roman2K/scat

Proc string

A proc string looks like "foo a b", where foo is the name of a proc and a b a space-separated list of arguments. Below are simple backup-agnostic examples of how to write one (last argument to scat). See README for real use-case examples: full proc strings for backup and restore.

Hello, World:

# stdout of echo serves as data of the seed chunk, fed to proc
# "write", which writes to stdout:

$ echo "Hello, World!" | scat "write -"
Hello, World!

Procs may be chained as a pipe-separated list:

# Proc "cmd" feeds chunk data to stdin of a command and captures its
# stdout as data of a new chunk:

$ echo "Hello, World!" | scat "cmd cat | write -"
Hello, World!

# Proc "cmdout" produces new data:

$ scat "cmdout echo Hello, World! | write -" < /dev/null
Hello, World!

# More chaining:

$ echo -n "Hello, " | scat "cmd cat | write - | cmdout echo World! | write -"
Hello, World!

$ echo "Hello, World!" | scat "cmd gpg --batch -e -r 00828C1D | cmd gpg --batch -d | write -"
Hello, World!

$ echo "Hello, World!" | scat "cmdin tee hello" && cat hello
Hello, World!

A chain is actually just another proc with special syntax for convenience to specify its args (0..n procs) separated by pipes instead of spaces, relaxing the need for parentheses. Since a chain is itself a proc also, it may be passed as argument to other procs, surrounded with curly brackets ({}), as in:

"split | { checksum | index - }"

Important: Procs are non-blocking. In the above, the chain piped to split is run for every chunk output by split without waiting for the last one to be processed. To avoid resource hogging, limit the number of concurrent instances of a proc with backlog:

"split | backlog 8 { checksum | index - }"

Parentheses may surround the arguments to avoid ambiguity when passing procs as argument to other procs:

"backlog 8 cp(foo)"

Example: Split file foo, write chunks to bar/:

$ echo hello > foo
$ scat "split | { checksum | index - | cp bar }" < foo > foo_index
$ ls bar
5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03

For restoring, we need a list of all the chunks produced during backup. Proc index does that: it lists checksums of chunks output by its containing chain, preserving order. Note that it's part of a subchain ({}), following split: see index.

Re-create foo from chunk files in bar/:

$ scat "uindex | ucp bar | uchecksum | join -" < foo_index > foo
$ cat foo
hello

The following lists document procs, their purpose and arguments. Some arguments are more complex types than strings or ints, they may be procs too, or other types like dynprocs, stores, copiers, etc. See corresponding lists.

List of procs

A proc can be thought of as a function that takes a chunk as input, may use its data (feed to a command, check integrity, etc.), modify its properties (checksum, target size, etc.) and possibly return new data as one or more new chunks as output (output of command, parity shards, etc.) fed to the next proc in the chain. They are classified as different types according to the nature of their action.

Proc types shouldn't really be a concern when choosing which to use, apart from understanding exactly what happens to the data transiting a chain and get the order right. For instance, it does matter to know which proc produces new data to correctly place index and checksum within a chain.

Types:

  • mutator: modifies properties leaving data as-is, returns the chunk
    • ex: assigning a checksum
  • producer: produces new data by returning one or more new chunks
    • ex: compressing data: 1→1 (new data, no checksum)
    • ex: splitting into smaller chunks: 1→n (new data, no checksum)
    • ex: reading an index from a chunk's data: 1→n (checksum, empty data)
    • ex: joining data and parity shards: n→1 (new data, no checksum)
  • passthrough: doesn't modify properties nor produce new data, returns the chunk
    • ex: integrity check
  • delegator: doesn't modify properties nor produce new data, passes the chunk through other proc(s)
    • ex: limiting the number of concurrent instances of a proc

More procs exist but aren't exposed for use in a proc string: cascade on error, path-based command, etc. See procs/.

Note about examples:

  • The examples below aren't usable as standalone proc strings and are to be interpreted as extracts of larger proc strings. See README for usable examples and adapt from this list.
  • The backlog recommendation is voluntarily not respected either for simplification

split

Usage: split()

Content-Defined Chunking with default chunk size (min: 512KiB, max: 8MiB)

  • type: producer

Ex:

"split | checksum | cp my_dir"

split2

Usage: split2(min max)

Idem split with custom min/max chunk size

  • args:
    • min (bytes)
    • max (bytes)
  • type: producer

Ex:

"split2 1mib 4mib"

index

Usage: index(path)

De-duplicates chunks and writes an index file to path

Tracks chunks output by the containing chain and writes a list of their checksums to path, preserving order.

Note: index is special in that it's called at the end of the chain as well, with a reference to the chunk that entered the chain. That chunk must have its checksum assigned, otherwise the chain's output chunks can't be tracked properly. As a consequence, the main chain couldn't look like split | checksum | index - because the seed chunk doesn't have a checksum before split. Rather: split | { checksum | index - }.

  • args:
    • path (string) path to index file, or - for stdout
  • type: passthrough
  • requires: checksum

Ex:

"split | { checksum | index - | cp my_dir }"

Checksums should be generally be placed twiced in a chain: an initial checksum before index, before the first producer proc, to detect duplicates. And a final checksum after the last producer proc.

Ex:

"split | { checksum | index - | gzip | checksum }"

Important: Some commands are not idempotent, such as cmd gpg -e. Two identical chunks encrypted by this proc, though decrypted as identical original data, will result in different encrypted data and thus the checksums will differ as well, making the output chunk considered new and re-written/uploaded. To prevent such behaviour, place the final checksum before it:

"split | { checksum | index - | gzip | checksum | cmd gpg --batch -e -r 00828C1D }"

uindex

Usage: uindex()

Reads the index from a chunk's data

Returns empty chunks (no data) with their checksum and target size assigned, ready for data retrieval by following procs.

  • type: producer

Ex:

"uindex | ucp my_dir | uchecksum | join -"

backlog

Usage: backlog(nslots proc)

Limits the number of concurrent instances of proc to nslots at a time

Since procs are non-blocking, it is highly recommended to wrap a proc immediately following split or uindex with backlog as chunks usually come out of them faster than they get processed by the rest of the chain, causing goroutines to be spawned uncontrollably. Without backlog, expect high memory usage, "too many open files" errors, etc.

To serialize the execution of a proc, pass 1 as nslots. Equivalent of a mutex, ensuring only a single instance is being run a time.

If proc is a chain, concurrency may be further limited by nesting backlogs within it.

  • args:
    • nslots (int) max number of instances
    • proc (proc)
  • type: delegator

Ex:

# integrity check with 8 workers:
"uindex | backlog 8 uchecksum"

# writing ordered chunks requires a mutex:
"split | backlog 1 { sort write - }"

# ...equivalent of:
"split | join -"

# process with 8 workers, write 4 files at a time:
"split | backlog 8 checksum | backlog 4 cp(my_dir)"

# ...or:
"split | backlog 8 { checksum | backlog 4 cp(my_dir) }"

checksum

Usage: checksum()

Computes and assigns checksums

  • type: mutator

Ex: see index

uchecksum

Usage: uchecksum()

Integrity check

  • type: passthrough
  • requires: checksum

Ex:

"uindex | ucp my_dir | uchecksum"

gzip

Usage: gzip()

Compresses data in gzip format

Ex:

"gzip | checksum | cp my_dir"
  • type: producer

ugzip

Usage: ugzip()

Uncompresses data compressed by gzip

  • type: producer

Ex:

"ucp my_dir | uchecksum | ugzip"

parity

Usage: parity(ndata nparity)

Reed-Solomon erasure coding

Splits chunks into ndata data shards and nparity partity shards for error correction.

  • args:
    • ndata (int) number of data shards
    • nparity (int) number of parity shards
  • type: producer

Ex:

"parity 2 1 | checksum"

uparity

Usage: uparity(ndata nparity)

Joins chunks split by parity into the original bigger chunk, recovering any error (failed integrity check, missing data)

  • args: see parity
  • type: producer
  • requires: checksum, group (ndata + nparity)

Ex:

"uchecksum | group 3 | uparity 2 1"

group

Usage: group(size)

Aggregates size contiguous chunks into one for procs that work with fixed-sized groups of chunks

For instance, parity(2 1) creates 3 shard chunks from one original and uparity needs those 3 grouped together to recreate the original. Use group before uparity: see example.

  • args:
    • size (int) group size
  • type: producer

Ex:

"group 3 | uparity 2 1"

cmd

Usage: cmd(name arg...)

Filters a chunk's data through a command

  • args:
    • name (string) command executable name: relative to $PATH or absolute path
    • 0..n arg (string) command arguments
  • type: producer
  • stdin ← chunk data
  • stdout → chunk data

Ex:

"cmd gpg --batch --encrypt -r 00828C1D"
"cmd gpg --batch --decrypt"

cmdin

Usage: cmdin(name arg...)

Runs a command using a chunk's data as stdin

  • args: see cmd
  • type: passthrough
  • stdin ← chunk data
  • stdout → (discarded)

Ex:

"cmdin tee /tmp/out"
"cmdin ssh bankmon dd of=/tmp/out"

cmdout

Usage: cmdout(name arg...)

Runs a command to produce new data

  • args: see cmd
  • type: producer
  • stdin ← (none)
  • stdout → chunk data

Ex:

"cmdout date | write - | cmdout echo Hello | write -"

concur

Usage: concur(max dynproc)

Feeds chunks to procs returned by dynproc, running only max of them at a time, concurrently

  • args:
    • max (int) max number of instances
    • dynproc (dynproc)
  • type: delegator

Ex:

# one transfer at a time:
"concur 1 mincopies(2
	a=scp(bankmon:tmp/a)
	b=rclone(drive:tmp/b)
)"

multireader

Usage: multireader(copier...)

Retrieves data from copiers, randomly alternating between them and cascading on error (failover)

  • args:
    • 0..n copier (copier)
  • type: delegator

Ex:

"multireader(
	a=rclone(drive:tmp/a)
	b=scp(bankmon tmp/b)
)"

sort

Usage: sort()

Sorts chunks by their original order

Since procs are non-blocking, chunks get out order as they advance through a chain. But order is important at the time of re-assembling them into the original stream. sort buffers them until achieving a contiguous series and returns them in order.

  • type: passthrough

Ex:

"sort | write -"

write

Usage: write(path)

Writes a chunk's data to path

  • args:
    • path (string) path to write to, or - for stdout
  • type: passthrough

Ex:

"write -"

join

Usage: join(path)

Joins chunks data in their original order, writing the concatenation to path. Short for backlog 1 { sort | write path }.

  • args: see write
  • type: passthrough

Ex:

"uindex | ucp my_dir | uchecksum | join -"

cp, ucp, rclone, etc.

Every store "foo" is also availablea as two procs:

  • "foo" (write)
  • "ufoo" (read)

See corresponding stores

Ex:

"rclone drive:tmp"
"urclone drive:tmp"

List of dynprocs

A dynproc is similar to a function that takes a chunk as input and returns a variable number of procs to process that chunk.

stripe

Usage: stripe(min excl copier...)

Striping and N-copies duplication

Ensures there exist at least min copies of each chunk among all given copiers, creating missing ones as needed. Chunks are striped across stores by interleaving them in a Round-Robin fashion.

If chunks are grouped with group, then stripe may guarantee that at least excl chunks within that group are put on distinct stores from the others. Required for guaranteeing recoverability from parity so that any nparity stores may be lost while maintaining ability to recompute original data from the remaining >= ndata shards.

On consecutive runs, existing copies will be reused as much as possible while meeting the min and excl requirements, making new copies as necessary to meet them. Returns an error if not possible, whether for lack of provided stores, or not enough of them available with quota left.

Stores are filled up to their quota and a little bit over due to concurrency during writes/uploads causing imprecision in calculation. In theory, quota overage may reach up to group size × max chunk size × concurrency.

  • args:
    • min (int) guarantee of mininum number of copies
    • excl (int) guarantee of minimum number of exclusive chunks within a group
    • 0..n copier (quotaRes)
  • type of returned procs: passthrough
  • requires: checksum, group (for excl > 0)

Ex:

# RAID 1: make 2 copies
"stripe(2 0
	a=scp(bankmon tmp/a)
	b=rclone(drive:tmp/b)=2gib
)"

# RAID 5: ensure exclusivity for ndata shards
"parity 2 1 | group 3 | stripe(1 2
	a=scp(bankmon tmp/a)
	b=rclone(drive:tmp/b)=2gib
	c=rclone(drive2:tmp/c)
)"

mincopies

Usage: mincopies(min copier...)

N-copies duplication

Idem stripe with no guarantee of exclusivity of chunks across stores. Short for stripe(min 0 copier...).

  • args: see stripe
  • type of returned procs: see stripe
  • requires: see stripe

Ex:

"mincopies(2
	a=scp(bankmon tmp/a)
	b=rclone(drive:tmp/b)=2gib
)"

List of stores

A store represents a storage facility, local or remote. It provides a proc for writes/uploads, another for reads (or downloads) and can list existing entries, such as files or objects in buckets.

Filenames are hexadecimal SHA256 checksum hashes (64 chars). Ex: aeef70b69d4e9dc8eb95bea114c4e992831e4185ec93145c4c893b5811079bea

rclone

Usage: rclone(remote)

Cloud storage via rclone command

Note: The remote must be already configured via rclone config.

  • args:
    • remote (string) name of remote and directory in the form of "<remote>:<dir>"
  • requires: checksum

Ex:

"rclone drive:tmp/backup"

cp

Usage: cp(dir level...)

Local filesystem storage in directory dir

If levels are specified, chunks are nested within subdirectories by hashing their checksum into one level-character long subdirectory per level.

  • args:
    • dir (string) path to directory
    • 0..n level (int) nesting levels
  • requires: checksum

Ex:

"cp path/to/foo"

# ...writes chunks to: (relative to dir)
fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1

"cp path/to/foo 4"

# ...writes to:
fd9f/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1

"cp path/to/foo 3 2"

# ...writes to:
fd9/fe/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1

scp

Usage: scp(host dir level...)

Remote file system storage via SSH

Requires the following GNU-compatible commands:

  • (local) ssh for remote command execution
  • (remote) dd for streaming file transfer
  • (remote) find for listing existing files

Note: ssh and helper commands are used for file transfer instead of scp or sftpbecause scp only takes path arguments and that would require writing many temp files. Instead, files are streamed through dd without buffering to disk. For listing, neither sftp nor ls are used either due to necessary path escaping and inflexible output formatting which would have required error-prone parsing. With ssh + find, paths and file info are passed around in a manner that eliminates ambiguity: environment variables and NUL-separated strings.

  • args:
    • host (string) first argument to ssh: [user@]hostname
    • dir (string) path to directory
    • 0..n level (int) see cp
  • requires: checksum

Ex:

"scp bankmon /tmp"
"scp bankmon /tmp 4"

Other types

Arguments to above types

Note: In equal sign-separated pairs such as copier=limit, spaces are not allowed around =.

quotaRes

Quota resource, with or without quota limit

Format: copier=max or copier

  • args:
    • copier (copier)
    • max (bytes) default = unlimited

Ex:

"a=scp(bankmon tmp/a)"
"b=rclone(drive:tmp/b)=2gib"

copier

Format: id=store

  • args:
    • id (string) used for internal book-keeping such as quota and stats computation
    • store (store)

Ex:

"foo=rclone(drive:tmp/bar)"

string

Format: sequence of non-space characters

Ex:

"path/to/file"

bytes

Format: <int><unit>

Size in bytes

Ex:

1024MiB
1GiB
1000MiB
1gb

int

Format: numeric characters

Ex:

"123"
#!/usr/bin/env ruby
$stdin.each_line do |line|
line.chomp =~ /^##(#*)\s*/ or next
level, text = $1.size, $'
anchor = text.downcase.tr(" ", "-").gsub(/[^\w-]/, "")
print "%s* [%s](#%s)\n" % ["\t"*level, text, anchor]
end
00_TOC.md: 01_PROCSTRING.md
./gentoc < $< > $@
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment