DavidVorick/renterprop.md

## renterprop.md

      
    Raw
  

              renterprop.md
            
          
    This document is a proposed roadmap for the renter. It contains a lot of long
term code projects, and a path to get there incrementally. It is our goal to
never refactor more than a small part of the code at a time, and to divide the
code cleanly so that multiple people can easily work on the code in parallel.
Sprint 1: Distributed Renter Abstraction

An important feature for moving towards large enterprise customers is a
distribtuted renter, where multiple renters use the same filesystem and the same
set of contracts. This allows larger systems to be set up with multiple nodes,
multiple failure points (for CDN type setups), and the ability to perform large
jobs such as uploading and downloading in parallel.
In terms of roadmap, a distributed renter is 12-18 months away. However, we need
to start preparing the renter to be distributed immediately, because if we do
not write the other components of the renter in anticipation of a distributed
system, we will have to re-write them later. Preparing to migrate to a
distributed system early allows us to migrate to it faster when we finally do
decide to implement it.
Right now we will be building an abstraction that is used by the filesystem and
contract system components of the renter. This system will be called the
'coordinator', and will be a package Sia/modules/renter/coordinator. All
filesystem operations and contract system operations will be proposed to the
coordinator, and then the coordinator will check with the other renters it is
coordinating with and then executing the changes that make sense. If other
renters propose changes, it will coordinate and may execute those changes as
well.
Initially, the coordinator will be essentially pass-through. It will not
actually be a distributed system when we implement it, it'll just execute the
commands that are coming in. But we will make sure that every command we get
could easily be translated to a distributed and fault tolerant renter system in
the future.
The reason for doing this abstraction first is because programming around
distributed systems requires a lot of planning and careful design, and we need
to be in that mindset when we are building out the rest of the components. It's
a limitation which is going to affect all of our other decisions, so we need to
be making those decisions with a distributed system in mind immediately. And
again, this first step is just writing the abstraction, the implementation
itself will not actually be a full distributed system.
Sprint 2: Filesystem Abstraction

The filesystem is currently intermingled with the uploading, the downloading,
and the contracting of the renter. This creates messy, co-dependent code that is
difficult to cleanly modify without introducing bugs. Several bugs exist in the
filesystem code which are hard to test for and track down, and when discovered
it's difficult to be certain that you are fixing them without introducing new
bugs.
We will be creating a filesystem abstraction in package
Sia/modules/renter/filesystem that will handle all of the file operations for
the renter. The upload loop, download loop, and work distributor will all
interact with the filesystem through this interface instead of accessing its
fields directly. This makes it far easier to change and update the filesystem,
and also easier to change and update the uploader, downloader, and work
distributor.
The existing filesystem will be translated into this abstraction, and then later
upgraded.
The abstraction will have the following methods:
+ Add Folder
+ Delete Folder
+ View Folder // contains folders and files

+ Add File
+ Delete File
+ Modify File // overwrite, append, truncate all supported
+ View File   // contains posix metadata and a list of chunks

+ View Chunk   // contains a list of sectors
+ Modify Chunk // change the redundancy settings on this chunk

Reminder: most actions that interact with the filesystem will actually be
interacting with the coordinator, which will then check the action against any
known updates from remote renters, and then the coordinator will access the
filesystem to get return values. The filesystem on each renter can be different,
and can be overlapping, but user-queries into the filesystem should query the
global filesystem, not just the local one.
Sprint 3: Filesharing

Once both the filesystem and the contract system have been cleanly abstracted
away, the path has been paved for adding filesharing (finally). We will be
adding two methods to the filesystem - export and import. Export will take all
of the metadata for a file or folder and roll it into a sharable file (similar
to a .torrent file). That file can then be imported into other renters.
The exporting is quite straightforward. Importing requires a little extra work.
The imported files need to be scanned to see what hosts are necessary to
download the file. If the importer does not have contracts with enough hosts to
perform the download, the user will have to confirm the creation of some
'download-only' file contracts that can be used to fetch the file.
We will need to create extensions to the API, the CLI, and the GUI to support
filesharing.
A strong proposal for the filesharing format already exists, and some variation
of that proposal is what can be used as the standard filesharing format within
Sia.
Filesharing will prefer not to share any of the archive-grade hosts for the
redundancy. When sharing a file, the user accepts a risk of deduplication. If
sharing only 1.2x redundancy of the file, then only 0.2x of the redundancy can
possibly be deduplicated away, which means the risk is minimal. When sharing,
the user will have to be careful to always share the same hosts, so that
repeated sharings of the file do not increase vulnerability to deduplication.
The recommendation is to share the first 1.2x shards of the file.
Sprint 3a: Filesharing Based Backup Snapshots

This feature is to be based on the filesharing format. Instead of sharing just a
single file or folder, you share the entire root directory with yourself. This
shared file will be good for at least 6 weeks, and can be stored redundantly on
a centralized cloud or backed up to offsite machines so that Sia data can be
recovered in the event of local machine failure.
More complex filesystem backup schemes are planned, this is a simple one to get
full backup support completed as soon as possible for Sia.
With this feature in place, all major features required to use Sia as a
production backup system are complete.
Sprint 4: Transition the Filesystem to a Database

The filesystem needs to be moved to a database. This will allow for
substantially better scaling properties. We will need an upgrade process from
the current filesystem to the new database system.
The database will have 2 buckets:
Object Name -> UID
	+ Can be a folder
	+ Can be a file
UID -> Object
	+ Can be a folder
	+ Can be a file
	+ Can be a full-sector chunk

The object name is a filepath.

A UID is 16 random bytes that uniquely identify an object. The UID of the
root folder will always be 0x00000000000000000000000000000001.

A folder is an object with posix metadata and a list of contents. The
contents can be files and folders, just as in posix systems. The files and
folders are listed by name. When renaming a file or folder, the object for
the parent directory needs to also be updated. The root folder cannot be
renamed.

A file contains posix metadata, sia metadata (like health information), and
a list of chunks by UID, in order that they appear in the file.  The chunk
list will contain a length in bytes for the size of the chunk, as different
chunks can be different sizes. The file does not need to indicate whether
the chunks are full sector, partial-sector, or full-data chunks.

Full-sector chunks are chunks that are composed of full, 4 MiB sectors.
These chunks contain metadata indicating their redundancy settings, and then
a list of hosts and sectors for each of the pieces in the chunk.

This system has the advantage that individual chunks can have their redundancy
settings updated on the fly, and the file data does not change at all. The file
can be easily modified (append, data changes in the middle, or even appending in
the middle) without having to do many disk operations. Renaming is very easy,
you just have to change the file name in its metadata, and then in the folder
data that's housing the file.
This system will scale very well. Adding a chunk or updating a chunk in a very
large file still only requires writing one page in the database.
This filesystem has been designed to be easily extendable (explained in a later
sprint) to small files + small chunks.
Sprint 5: Separate 'threadedQueueRepairs' in Repair Loop

This sprint is going to break uploading for the renter. Uploading will not work
again until sprint 6, 7, and 8 have also been completed, so we should work on
all of these sprints as a team in one fast, focused effort. We should all be
comfortable with the designs before we get started, and even have testing ready
before we get started, so that we can finish as fast as possible.
'threadedQueueRepairs' is going to be broken out of the repair loop into its own
file. Instead of using the in-memory file object to pass to the upload loop, it
is going to break the file into a bunch of 'chunkRepair' objects.
chunkRepair objects will be layed out as such:
type usableHost struct {
	uses int      // number of times this host can be used within a chunk. Usually 1.
	pubkey string // the string-encoded public key of the host
}

type pieceRepair struct {
	aceeptedByWorker bool // Whether a worker has accepted the piece
	workersRemaining int  // The number of workers that have not tried to repair this piece

	chunk       *chunkRepair // The chunk for this piece
	data        []byte       // The erasure-coded, encrypted piece data
	usableHosts []usableHost
}

type chunkRepair struct {
	// TODO: Fields specifying the name + location of the file + chunk
	// TODO: Fields specifying erasure coding + encryption settings/keys
	// TODO: Fields specifying how many incomplete pieces exist

	usedHosts map[string]int // number of times each host has been used in this chunk
	pieces    []pieceRepair
}

Each piece is completely separate in specifying a set of hosts that can be used
so that complex CDN settings can be taken into account when establishing this
object. Any complex settings should be handled by the queue repairs function,
instead of happening inside the actual repair logic.
The chunkRepair object is sent to the coordinator. The coordinator then has the
responsibility of knowing which renter to send the chunkRepair object to, based
on which renters have files locally, which renters have available resources, and
which renters are controlling which contracts. If no renters are ready yet for
the chunk, the coordinator will block until a renter becomes ready.
The 'data' fields of the pieceRepair objects are not filled out until after the
chunk has been given to a renter. The renter will fetch the data (from disk or
from the Sia network), erasure code it, encrypt it, and then fill out the
pieces.
A future feature will require 'threadedQueueRepairs' to be aware of when a chunk
repair has completed, so there needs to be a function or channel that can be
called/used when a chunk has completed. There will need to be sanity checks and
verifications that confirm the code is properly reporting back every time a
chunk has completed.
Sprint 6: Coordinator Renter Selection

Initially the coordinator will only be coordinating over a single renter. The
renter selection function should be separate in a way that makes it easy to
re-write entirely when support for multiple renters is added.
The coordinator will keep track of how much memory the single renter is using,
and it will only assign a chunk to the renter when the renter has enough memory
available. For this sprint, memory is all that matters. The default will be 500
MiB of memory per renter, and it should be a configurable value via renter
settings in the API.
We will need to make sure that when the renter completes a chunk, it reports
back to the coordinator so that the coordinator can update the amount of memory
available. Enough logging and sanity checking needs to be in place that we can
easily detect and resolve errors where either the renter or the coordinator is
not properly handling the memory management.
The same pool of memory will be used to assign downloads when the downlaod
system gets upgraded. Before downloads can be added to the memory pool, there
needs to be some way to choose between multiple potential jobs (uploads +
downloads) when memory becomes available.
Edge case: if the total size of the memory pool is less than the minimum size
needed to download or repair a chunk, the renter will block until all memory has
been returned, and then go ahead with that chunk, ensuring progress even if the
settings do not technically allow for it.
Sprint 7: Piece Data Fetcher

The renter is now receiving 'chunkRepair' objects when it is doing repairs. The
coordinator is responsible for making sure that there is enough memory for the
renter to perform the upload (and a corresponding download if needed), so it is
safe to asynchronously fetch, erasure code, and encrypt the piece data. Whether
the data is being fetched from the Sia network or from disk, it should be done
in a separate thread. The chunkRepair object should be filled out to have the
piece data added, and then the completed chunkRepair object can be passed to the
upload workers.
Sprint 8: New Worker Dynamics

The new worker upload dynamics are vastly simplified over the existing upload
dynamics, but they are still complicated, and have multiple points where
incosistency or buggy code could cause the whole component to choke. There needs
to be plentiful testing, sanity checks, and progress verification to make sure
that the code is running like clockwork.
Once the piece has been fully filled out, data included, it is put into a linked
list of pieces that represent unfinished work. 'workersRemaining' for the
pieceRepair object is set equal to the number of active workers in the renter.
Pieces are added at the tail of the linked list. Workers start at the head of
the linked list, and scroll forward through the linked list one piece at a time.
The linked list has a global lock that the workers must hold when walking
through the linked list. When they get to the next piece, they pull the piece
out of the linked list and mark 'acceptedByWorker' as true. Other workers may
have pieces pointing to this piece, so it's forward pointer must remain
unchanged.
The worker will only accept the piece for work if it is in the usableHosts list
for the piece, and also the number of 'uses' for that host is greater than the
number of times the host has been used by the whole chunk. If the worker cannot
accept the piece, it will decrement 'workersRemaining' for that piece and move
on to the next piece. If the 'workersRemaining' value for the piece hits '0',
this means that this is actually the tail of the linked list, and that all
workers have unsuccessfully scolled past the piece. The piece needs to be marked
as completed in the chunk (even though no piece was actually completed), and
then deleted from the linked list.
When the worker accepts a piece, it will add itself to the list of 'usedHosts'
in the corresponding chunk, even before it knows whether the upload will be
successful. It will try to upload the piece to its corresponding host. If the
upload succeeds, it will mark the piece as completed. If the upload fails, the
piece's values for 'workersRemaining' is reset, and the piece is put at the head
of the linked list again, so that all the workers can have another chance to
scroll past it and try to upload it. Because the worker has incremented the
number of 'uses' in the corresponding chunk, it will not try this piece again.
(unless perhaps the piece lists this host multiple times).
When a piece is marked as 'completed' in the chunk, the chunk needs to be
checked for completion. If all pieces for the chunk have been completed, the
chunk needs to report back to the coordinator, indicating which pieces have been
successful, what hosts they ended up getting uploaded to, which pieces failed.
The coordinator then needs to report back to the repair scanner (future sprint)
to indicate that the chunk repair has completed (whether it was fully successful
or not), so that the repair scanner can update itself.
When a worker gets to the end of the linked list, it will wait on a sync.Cond
for a broadcast signal that more pieces have been added. When more pieces do get
added, they need to issue a broadcast on the sync.Cond.
The end of this sprint marks the completion of the breaking changes to the
upload code, after this sprint is completed, the upload system should be ready
for release and should be a lot more stable.
Sprint 9: File Health Updater

The current file health monitor scans through the whole filesystem every few
minutes and checks for any files that have lost health. This is really
inefficient, and will become even more taxing after we migrate from an
all-in-memory filesystem to a database based filesystem.
The filesystem will be extended to have new fields. The first is a health rating
for the file. The second is a 'last-checkup' timestamp indicating when the
file's health was last monitored. And finally the third is a field indicating
what renter (if any) in the distributed cluster of renters has the file locally
(a renter that has the file locally can do repairs very cheaply. If nobody has
the file locally, the repairs can be performed but are more costly, and
therefore will only be performed for files in a worse health state). This final
field is not guaranteed to be complete, it may be missing some renters. Folders
will have a health score which is the minimum of all their files and subfolders.
Folders will have a 'last-checkup' timestamp which is the least-recent of all
their files and subfolders. Folders will indicate what renters contain the
folder locally in the distributed cluster only if that renter has all files and
subfolders locally. If a renter is missing even a single file, it is not listed.
The new file health upadater will depth-first-search the folder hierarchy of the
filesystem, favoring the folders that have not been checked recently, updating
the health field for the files. This will significantly reduce the amount of
unencessary scanning performed by the renter.
Health is broken into multiple grades, which are signals to the file repair
scanner. The repair scanner will favor reparing files that are in the poorest
health. Health scores span from 0 to 100.
0:     File has been lost, cannot be recovered without human intervention.
1-9:   File in urgent condition, may become lost without immediate action.
10-89: File has moderate redundancy, but needs repairs soon.
90-99: File is not at full redundancy, but also does not need repairs soon.
100:   File is at full redundancy.
The file health updater will start running at startup, and continue running
until every file has been scanned recently. What counts as 'recent' is a
user-configurable variable defaulting to 72 hours. Because the file health
updater does not actually issue any work, it can safely run continuously until
its job is complete. Upon finishing the scan, it'll sleep until the next scan is
required.
Once the coordinator is extended to support multiple renters, each renter can
focus on a different directory of the global filesystem, allowing the work for
health scanning to be split up efficiently.
Sprint 10: File Repair Scanner

The file repair scanner will depth-first-search the filesystem, following
folders that have the lowest health scores, and then it will issue work to the
coordinator indicating what repair operations need to be performed. Work is
issued on a per-chunk basis, meaning a single file can have a large number of
repair items for the coordinator.
The health of the files is not updated when work is issued, it is updated when
work is returned. The file repair scanner will need to keep in memory a list of
files and folders that it has visited on the most recent iteration of repair
scanning. It will not revisit a file or folder until the coordinator signals
that it has finished all repair jobs submitted for that file or folder.
The repair job itself is a list of pieces that need to be uploaded to get to
each health tier. Meaning, if the file is at '10' health (third tier), there
will be a list of pieces that need to be repaired to reach a health >90, and a
second list of pieces that need to be repaired to reach a health of 100. The
coordinator will decide how many health brackets to move up the file based on
the availaility of resources and the number of files that are at each health
bracket. In addition to the lists of pieces, the repair job will also have a
list of hosts which are acceptable for repair in each bracket, sorted by how
desirable they are (if a host is undesirable, it will not be listed at all).
This means that the repair scanner does all the heavy lifting for doing things
like ensuring geographic diversity, hitting minimum latency and throughput
requirements, etc.
The amount of memory consumed for a repair job is tiny, so the coordinator can
afford to have hundreds or even thousands of repair jobs queued up. Having a big
queue allows the coordinator to allocate resources with maximum efficiency,
ensuring that the full strength of the renter cluster is active at all times
while doing repairs. Though there is room for potentially hundreds or thousands
of simultaneous repair jobs in the queue, it is still a finite number. If the
coordinator runs out of queue space, it will block when the renter tries to
submit more work. This block will halt the repair scanner until repair jobs
begin completing, and ensure that the repair scanner never allocates too much
work at once.
When scheduling repairs, the repair scanner will have two types of repairs.
Files that are below a health of 10 will be repaired to a health of 10, and not
higher. This is to ensure that if there are many files below a health of 10, all
resources are spent getting files out of the 'urgent' health to the next tier.
If all files are above 10 health, then repair jobs will take the files from 10
health to 100 health. This is because throughput is best and resource usage is
most efficent when doing a full repair in one pass. Files that are above 90
health will only be repaired if one of the renters in the coordinator system
have the file locally, otherwise the repair scanner will wait until the file is
below 90 health to issue a repair.
There is some cross-object concurrency happening that needs to be carefully
managed. The first is that the coordinator can block a submission of work, which
is intended to freeze the entire repair scanner, but is not intended to freeze
the coordinator. The second is that the repair scanner keeps a list of jobs it
has submimtted, and the coordinator must remember to signal that it has finished
a job when the job completes, otherwise the repair scanner will build up a large
amount of memory use, and it will not go back and revisit files which may have
only been partially repaired.
Sprint 11: Host Throughput And Latency Measuring

Currently worker throughput and latency is not measured when uploading and
downloading. We don't really know how fast each host is, which means we can't
optimize to select the faster hosts for priority files.
The hostdb needs to be upgraded to support tracking host speed and throughput,
and the upload and download functions for a host need to return speed and
throughput values when the operation is complete, so that the database can be
updated accordingly.
These values will help significantly improve the download speeds in sprint 12.
Sprint 12: Download System Upgrade

The download system will be upgraded to match the upload system, where a chunk
issues a stack of work in the form of a bunch of pieces that workers can pick
up.
Just like uploads, the memory required for performing downloads needs to be
allocated ahead of time. Workers will fetch the pieces one-by-one. Just as
upload pieces have a list of applicable workers, download pieces also have a
list of applicable workers.
Because memory cannot be released until a full download is complete, a single
slow worker can bottleneck a full 40 MiB of memory, which can be a big deal.
Fast workers may not be able to perform downloads because slow workers are
hogging all of the memory.
To mitigate this, make several stacks of pieces instead of just one stack of
pieces. Each stack has a different performance tier. Initially, it will just be
a fast stack and slow stack, but eventually we can expand to using a larger
number of stacks.
Workers are limited in what stack they can use based on performance. The slowest
50% of workers are only allowed to upload pieces in the slow stack. The fastest
50% of workers can upload from either the fast stack or the slow stack, however
they will only get work from the slow stack if the fast stack is empty.
Chunks are also broken into two types of chunks (and eventually more). There are
normal chunks and priority chunks. Normal chunks will wait to be added to a
stack until there are idle workers, at which point it will be added to the stack
with idle workers. More often than not, this will be a fast stack. Priority
chunks are always added to the fast stack, but will wait in line patiently
behind normal chunks, and won't get added to a stack until all the normal chunks
ahead of it have been added to a stack.
The reason for this more elaborate setup is to prevent fast workers with
available bandwidth from being bottlenecked by slower workers who are consuming
all of the memory. If there is lots of work to do, the fast workers will always
be working together on it, meaning the bottleneck is going to be based around
your median worker instead of around your slowest worker. The slow workers are
only passed work if there are no idle fast workers, yet there are low-priority
pieces that need to be uploaded.
The coordinator queuing downloads is going to have to be careful in how it marks
chunks as normal vs. priority. Generally, only the last 2 or 3 chunks of a file
need to be marked as priority, and the other chunks can all be marked as normal.
This is to prevent the final bits of a download from taking a long time due to a
few slow hosts, only fast hosts are allowed to handle the end. This means the
individual file should be available faster.
Sprint 13: Minimized Erasure Coding Sizes

Currently the erasure coding happens on 4 MiB pieces, meaning that the total
amount of memory that needs to be allocated to perform erasure coding or
recovery is piecesPerChunk * 4 MiB. That is a full 200 MiB for most chunks. If
instead of doing the erasure coding on the full 4 MiB pieces, we break pieces
into 4 KiB sub-pieces and erasure code those, we can perform the operation
reaching only a peak memory usage of about 50 MiB, which means there will be
more memory available for other operations, and the upload/download processes
will not seem so incredibly memory intensive.
Sprint 14: Global Bandwidth Ratelimiter

User configured, restricts upload and download speeds as well as overall caps
for a given renter machine. Coordinator is aware of the limits and progress
towards those limits of all renter machines.
By default, the gbr will allow spurts of up to 3 seconds at full speed to guage
the full capabilities of the connection (once every few minutes), and then set
the limits to 90% of those speeds. Users can speeds in terms of percentages, and
also in terms of absolute bytes per second. If the user chooses an absolute
speed, the gbr will not allow spurts that test the overall connection speed.
The gbr will function as a stateful wrapper around a conn object. Instead of
calling net.Dial, connections with hosts will have to use gbr.LimitedDial, which
will have a timer that reads and writes data in 64kb segments, and then sleeps
between reads or writes to keep the total bandwidth in line with the desired
limits.
The gbr itself is a separate package, is created by the server, and is passed to
the modules as a dependency. The same gbr can be passed to multiple modules,
allowing the user to have greater control over the resource usage of siad.
Sprint 15: Extend the Filesystem Database to Support Small Chunks + Files

This database can be extended to support small chunks (and therefore small
files) by adding a few extra object typese to the database. This will not be
done as a part of Sprint #4, but here are the object types:
+ a partial-sector chunk
+ a packed sector
+ a full-data chunk

A partial-sector chunk is a chunk that has pieces which are smaller than
4 MiB. Instead of these pieces being identified by a host+sector hash, they
are identified by a UID pointing to a packed sector, and an offset
indicating where in the packed sector the piece is

A packed sector contains a bunch of chunk pieces from different chunks. The
pieces are stored packed together into a single sector that is uploaded to a
host. The packed sector needs a bunch of Merkle roots (up to log(n) roots
per piece) so that as pieces get added and delted, it can easily recompute
the roots.

A full-data chunk is for chunks that are so small that the overhead for
storing them in a packed sector (about 40 bytes per piece) is close to or
greater than the size of the data. Instead of using packed sectors, the full
data is actually just stored as the chunk, and the Sia network is not used
at all. This method is typically only needed for files smaller than 8 KiB.

Finally, we need one more bucket which lists the gaps in all of the packed
sectors. When a new partial-sector chunk is added to the filesystem, the gaps
bucket is used to figure out if a new packed sector needs to be created, or if
existing packed sectors have enough space to house the new pieces. The keys for
this bucket are host ids (their pubkeys), and the values are a list of gaps in
packed sectors for that host.
HostID -> Packed Sector Gaps