Skip to content

Instantly share code, notes, and snippets.

@DavidVorick
Last active October 11, 2017 01:27
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save DavidVorick/d9f70ea668a8104ee8f42d2dc6d3164b to your computer and use it in GitHub Desktop.
Save DavidVorick/d9f70ea668a8104ee8f42d2dc6d3164b to your computer and use it in GitHub Desktop.
Renter Propsal - First 15 Sprints

This document is a proposed roadmap for the renter. It contains a lot of long term code projects, and a path to get there incrementally. It is our goal to never refactor more than a small part of the code at a time, and to divide the code cleanly so that multiple people can easily work on the code in parallel.

Sprint 1: Distributed Renter Abstraction

An important feature for moving towards large enterprise customers is a distribtuted renter, where multiple renters use the same filesystem and the same set of contracts. This allows larger systems to be set up with multiple nodes, multiple failure points (for CDN type setups), and the ability to perform large jobs such as uploading and downloading in parallel.

In terms of roadmap, a distributed renter is 12-18 months away. However, we need to start preparing the renter to be distributed immediately, because if we do not write the other components of the renter in anticipation of a distributed system, we will have to re-write them later. Preparing to migrate to a distributed system early allows us to migrate to it faster when we finally do decide to implement it.

Right now we will be building an abstraction that is used by the filesystem and contract system components of the renter. This system will be called the 'coordinator', and will be a package Sia/modules/renter/coordinator. All filesystem operations and contract system operations will be proposed to the coordinator, and then the coordinator will check with the other renters it is coordinating with and then executing the changes that make sense. If other renters propose changes, it will coordinate and may execute those changes as well.

Initially, the coordinator will be essentially pass-through. It will not actually be a distributed system when we implement it, it'll just execute the commands that are coming in. But we will make sure that every command we get could easily be translated to a distributed and fault tolerant renter system in the future.

The reason for doing this abstraction first is because programming around distributed systems requires a lot of planning and careful design, and we need to be in that mindset when we are building out the rest of the components. It's a limitation which is going to affect all of our other decisions, so we need to be making those decisions with a distributed system in mind immediately. And again, this first step is just writing the abstraction, the implementation itself will not actually be a full distributed system.

Sprint 2: Filesystem Abstraction

The filesystem is currently intermingled with the uploading, the downloading, and the contracting of the renter. This creates messy, co-dependent code that is difficult to cleanly modify without introducing bugs. Several bugs exist in the filesystem code which are hard to test for and track down, and when discovered it's difficult to be certain that you are fixing them without introducing new bugs.

We will be creating a filesystem abstraction in package Sia/modules/renter/filesystem that will handle all of the file operations for the renter. The upload loop, download loop, and work distributor will all interact with the filesystem through this interface instead of accessing its fields directly. This makes it far easier to change and update the filesystem, and also easier to change and update the uploader, downloader, and work distributor.

The existing filesystem will be translated into this abstraction, and then later upgraded.

The abstraction will have the following methods:

+ Add Folder
+ Delete Folder
+ View Folder // contains folders and files

+ Add File
+ Delete File
+ Modify File // overwrite, append, truncate all supported
+ View File   // contains posix metadata and a list of chunks

+ View Chunk   // contains a list of sectors
+ Modify Chunk // change the redundancy settings on this chunk

Reminder: most actions that interact with the filesystem will actually be interacting with the coordinator, which will then check the action against any known updates from remote renters, and then the coordinator will access the filesystem to get return values. The filesystem on each renter can be different, and can be overlapping, but user-queries into the filesystem should query the global filesystem, not just the local one.

Sprint 3: Filesharing

Once both the filesystem and the contract system have been cleanly abstracted away, the path has been paved for adding filesharing (finally). We will be adding two methods to the filesystem - export and import. Export will take all of the metadata for a file or folder and roll it into a sharable file (similar to a .torrent file). That file can then be imported into other renters.

The exporting is quite straightforward. Importing requires a little extra work. The imported files need to be scanned to see what hosts are necessary to download the file. If the importer does not have contracts with enough hosts to perform the download, the user will have to confirm the creation of some 'download-only' file contracts that can be used to fetch the file.

We will need to create extensions to the API, the CLI, and the GUI to support filesharing.

A strong proposal for the filesharing format already exists, and some variation of that proposal is what can be used as the standard filesharing format within Sia.

Filesharing will prefer not to share any of the archive-grade hosts for the redundancy. When sharing a file, the user accepts a risk of deduplication. If sharing only 1.2x redundancy of the file, then only 0.2x of the redundancy can possibly be deduplicated away, which means the risk is minimal. When sharing, the user will have to be careful to always share the same hosts, so that repeated sharings of the file do not increase vulnerability to deduplication. The recommendation is to share the first 1.2x shards of the file.

Sprint 3a: Filesharing Based Backup Snapshots

This feature is to be based on the filesharing format. Instead of sharing just a single file or folder, you share the entire root directory with yourself. This shared file will be good for at least 6 weeks, and can be stored redundantly on a centralized cloud or backed up to offsite machines so that Sia data can be recovered in the event of local machine failure.

More complex filesystem backup schemes are planned, this is a simple one to get full backup support completed as soon as possible for Sia.

With this feature in place, all major features required to use Sia as a production backup system are complete.

Sprint 4: Transition the Filesystem to a Database

The filesystem needs to be moved to a database. This will allow for substantially better scaling properties. We will need an upgrade process from the current filesystem to the new database system.

The database will have 2 buckets:

Object Name -> UID
	+ Can be a folder
	+ Can be a file
UID -> Object
	+ Can be a folder
	+ Can be a file
	+ Can be a full-sector chunk

The object name is a filepath.

A UID is 16 random bytes that uniquely identify an object. The UID of the
root folder will always be 0x00000000000000000000000000000001.

A folder is an object with posix metadata and a list of contents. The
contents can be files and folders, just as in posix systems. The files and
folders are listed by name. When renaming a file or folder, the object for
the parent directory needs to also be updated. The root folder cannot be
renamed.

A file contains posix metadata, sia metadata (like health information), and
a list of chunks by UID, in order that they appear in the file.  The chunk
list will contain a length in bytes for the size of the chunk, as different
chunks can be different sizes. The file does not need to indicate whether
the chunks are full sector, partial-sector, or full-data chunks.

Full-sector chunks are chunks that are composed of full, 4 MiB sectors.
These chunks contain metadata indicating their redundancy settings, and then
a list of hosts and sectors for each of the pieces in the chunk.

This system has the advantage that individual chunks can have their redundancy settings updated on the fly, and the file data does not change at all. The file can be easily modified (append, data changes in the middle, or even appending in the middle) without having to do many disk operations. Renaming is very easy, you just have to change the file name in its metadata, and then in the folder data that's housing the file.

This system will scale very well. Adding a chunk or updating a chunk in a very large file still only requires writing one page in the database.

This filesystem has been designed to be easily extendable (explained in a later sprint) to small files + small chunks.

Sprint 5: Separate 'threadedQueueRepairs' in Repair Loop

This sprint is going to break uploading for the renter. Uploading will not work again until sprint 6, 7, and 8 have also been completed, so we should work on all of these sprints as a team in one fast, focused effort. We should all be comfortable with the designs before we get started, and even have testing ready before we get started, so that we can finish as fast as possible.

'threadedQueueRepairs' is going to be broken out of the repair loop into its own file. Instead of using the in-memory file object to pass to the upload loop, it is going to break the file into a bunch of 'chunkRepair' objects.

chunkRepair objects will be layed out as such:

type usableHost struct {
	uses int      // number of times this host can be used within a chunk. Usually 1.
	pubkey string // the string-encoded public key of the host
}

type pieceRepair struct {
	aceeptedByWorker bool // Whether a worker has accepted the piece
	workersRemaining int  // The number of workers that have not tried to repair this piece

	chunk       *chunkRepair // The chunk for this piece
	data        []byte       // The erasure-coded, encrypted piece data
	usableHosts []usableHost
}

type chunkRepair struct {
	// TODO: Fields specifying the name + location of the file + chunk
	// TODO: Fields specifying erasure coding + encryption settings/keys
	// TODO: Fields specifying how many incomplete pieces exist

	usedHosts map[string]int // number of times each host has been used in this chunk
	pieces    []pieceRepair
}

Each piece is completely separate in specifying a set of hosts that can be used so that complex CDN settings can be taken into account when establishing this object. Any complex settings should be handled by the queue repairs function, instead of happening inside the actual repair logic.

The chunkRepair object is sent to the coordinator. The coordinator then has the responsibility of knowing which renter to send the chunkRepair object to, based on which renters have files locally, which renters have available resources, and which renters are controlling which contracts. If no renters are ready yet for the chunk, the coordinator will block until a renter becomes ready.

The 'data' fields of the pieceRepair objects are not filled out until after the chunk has been given to a renter. The renter will fetch the data (from disk or from the Sia network), erasure code it, encrypt it, and then fill out the pieces.

A future feature will require 'threadedQueueRepairs' to be aware of when a chunk repair has completed, so there needs to be a function or channel that can be called/used when a chunk has completed. There will need to be sanity checks and verifications that confirm the code is properly reporting back every time a chunk has completed.

Sprint 6: Coordinator Renter Selection

Initially the coordinator will only be coordinating over a single renter. The renter selection function should be separate in a way that makes it easy to re-write entirely when support for multiple renters is added.

The coordinator will keep track of how much memory the single renter is using, and it will only assign a chunk to the renter when the renter has enough memory available. For this sprint, memory is all that matters. The default will be 500 MiB of memory per renter, and it should be a configurable value via renter settings in the API.

We will need to make sure that when the renter completes a chunk, it reports back to the coordinator so that the coordinator can update the amount of memory available. Enough logging and sanity checking needs to be in place that we can easily detect and resolve errors where either the renter or the coordinator is not properly handling the memory management.

The same pool of memory will be used to assign downloads when the downlaod system gets upgraded. Before downloads can be added to the memory pool, there needs to be some way to choose between multiple potential jobs (uploads + downloads) when memory becomes available.

Edge case: if the total size of the memory pool is less than the minimum size needed to download or repair a chunk, the renter will block until all memory has been returned, and then go ahead with that chunk, ensuring progress even if the settings do not technically allow for it.

Sprint 7: Piece Data Fetcher

The renter is now receiving 'chunkRepair' objects when it is doing repairs. The coordinator is responsible for making sure that there is enough memory for the renter to perform the upload (and a corresponding download if needed), so it is safe to asynchronously fetch, erasure code, and encrypt the piece data. Whether the data is being fetched from the Sia network or from disk, it should be done in a separate thread. The chunkRepair object should be filled out to have the piece data added, and then the completed chunkRepair object can be passed to the upload workers.

Sprint 8: New Worker Dynamics

The new worker upload dynamics are vastly simplified over the existing upload dynamics, but they are still complicated, and have multiple points where incosistency or buggy code could cause the whole component to choke. There needs to be plentiful testing, sanity checks, and progress verification to make sure that the code is running like clockwork.

Once the piece has been fully filled out, data included, it is put into a linked list of pieces that represent unfinished work. 'workersRemaining' for the pieceRepair object is set equal to the number of active workers in the renter.

Pieces are added at the tail of the linked list. Workers start at the head of the linked list, and scroll forward through the linked list one piece at a time. The linked list has a global lock that the workers must hold when walking through the linked list. When they get to the next piece, they pull the piece out of the linked list and mark 'acceptedByWorker' as true. Other workers may have pieces pointing to this piece, so it's forward pointer must remain unchanged.

The worker will only accept the piece for work if it is in the usableHosts list for the piece, and also the number of 'uses' for that host is greater than the number of times the host has been used by the whole chunk. If the worker cannot accept the piece, it will decrement 'workersRemaining' for that piece and move on to the next piece. If the 'workersRemaining' value for the piece hits '0', this means that this is actually the tail of the linked list, and that all workers have unsuccessfully scolled past the piece. The piece needs to be marked as completed in the chunk (even though no piece was actually completed), and then deleted from the linked list.

When the worker accepts a piece, it will add itself to the list of 'usedHosts' in the corresponding chunk, even before it knows whether the upload will be successful. It will try to upload the piece to its corresponding host. If the upload succeeds, it will mark the piece as completed. If the upload fails, the piece's values for 'workersRemaining' is reset, and the piece is put at the head of the linked list again, so that all the workers can have another chance to scroll past it and try to upload it. Because the worker has incremented the number of 'uses' in the corresponding chunk, it will not try this piece again. (unless perhaps the piece lists this host multiple times).

When a piece is marked as 'completed' in the chunk, the chunk needs to be checked for completion. If all pieces for the chunk have been completed, the chunk needs to report back to the coordinator, indicating which pieces have been successful, what hosts they ended up getting uploaded to, which pieces failed. The coordinator then needs to report back to the repair scanner (future sprint) to indicate that the chunk repair has completed (whether it was fully successful or not), so that the repair scanner can update itself.

When a worker gets to the end of the linked list, it will wait on a sync.Cond for a broadcast signal that more pieces have been added. When more pieces do get added, they need to issue a broadcast on the sync.Cond.

The end of this sprint marks the completion of the breaking changes to the upload code, after this sprint is completed, the upload system should be ready for release and should be a lot more stable.

Sprint 9: File Health Updater

The current file health monitor scans through the whole filesystem every few minutes and checks for any files that have lost health. This is really inefficient, and will become even more taxing after we migrate from an all-in-memory filesystem to a database based filesystem.

The filesystem will be extended to have new fields. The first is a health rating for the file. The second is a 'last-checkup' timestamp indicating when the file's health was last monitored. And finally the third is a field indicating what renter (if any) in the distributed cluster of renters has the file locally (a renter that has the file locally can do repairs very cheaply. If nobody has the file locally, the repairs can be performed but are more costly, and therefore will only be performed for files in a worse health state). This final field is not guaranteed to be complete, it may be missing some renters. Folders will have a health score which is the minimum of all their files and subfolders. Folders will have a 'last-checkup' timestamp which is the least-recent of all their files and subfolders. Folders will indicate what renters contain the folder locally in the distributed cluster only if that renter has all files and subfolders locally. If a renter is missing even a single file, it is not listed.

The new file health upadater will depth-first-search the folder hierarchy of the filesystem, favoring the folders that have not been checked recently, updating the health field for the files. This will significantly reduce the amount of unencessary scanning performed by the renter.

Health is broken into multiple grades, which are signals to the file repair scanner. The repair scanner will favor reparing files that are in the poorest health. Health scores span from 0 to 100. 0: File has been lost, cannot be recovered without human intervention. 1-9: File in urgent condition, may become lost without immediate action. 10-89: File has moderate redundancy, but needs repairs soon. 90-99: File is not at full redundancy, but also does not need repairs soon. 100: File is at full redundancy.

The file health updater will start running at startup, and continue running until every file has been scanned recently. What counts as 'recent' is a user-configurable variable defaulting to 72 hours. Because the file health updater does not actually issue any work, it can safely run continuously until its job is complete. Upon finishing the scan, it'll sleep until the next scan is required.

Once the coordinator is extended to support multiple renters, each renter can focus on a different directory of the global filesystem, allowing the work for health scanning to be split up efficiently.

Sprint 10: File Repair Scanner

The file repair scanner will depth-first-search the filesystem, following folders that have the lowest health scores, and then it will issue work to the coordinator indicating what repair operations need to be performed. Work is issued on a per-chunk basis, meaning a single file can have a large number of repair items for the coordinator.

The health of the files is not updated when work is issued, it is updated when work is returned. The file repair scanner will need to keep in memory a list of files and folders that it has visited on the most recent iteration of repair scanning. It will not revisit a file or folder until the coordinator signals that it has finished all repair jobs submitted for that file or folder.

The repair job itself is a list of pieces that need to be uploaded to get to each health tier. Meaning, if the file is at '10' health (third tier), there will be a list of pieces that need to be repaired to reach a health >90, and a second list of pieces that need to be repaired to reach a health of 100. The coordinator will decide how many health brackets to move up the file based on the availaility of resources and the number of files that are at each health bracket. In addition to the lists of pieces, the repair job will also have a list of hosts which are acceptable for repair in each bracket, sorted by how desirable they are (if a host is undesirable, it will not be listed at all). This means that the repair scanner does all the heavy lifting for doing things like ensuring geographic diversity, hitting minimum latency and throughput requirements, etc.

The amount of memory consumed for a repair job is tiny, so the coordinator can afford to have hundreds or even thousands of repair jobs queued up. Having a big queue allows the coordinator to allocate resources with maximum efficiency, ensuring that the full strength of the renter cluster is active at all times while doing repairs. Though there is room for potentially hundreds or thousands of simultaneous repair jobs in the queue, it is still a finite number. If the coordinator runs out of queue space, it will block when the renter tries to submit more work. This block will halt the repair scanner until repair jobs begin completing, and ensure that the repair scanner never allocates too much work at once.

When scheduling repairs, the repair scanner will have two types of repairs. Files that are below a health of 10 will be repaired to a health of 10, and not higher. This is to ensure that if there are many files below a health of 10, all resources are spent getting files out of the 'urgent' health to the next tier. If all files are above 10 health, then repair jobs will take the files from 10 health to 100 health. This is because throughput is best and resource usage is most efficent when doing a full repair in one pass. Files that are above 90 health will only be repaired if one of the renters in the coordinator system have the file locally, otherwise the repair scanner will wait until the file is below 90 health to issue a repair.

There is some cross-object concurrency happening that needs to be carefully managed. The first is that the coordinator can block a submission of work, which is intended to freeze the entire repair scanner, but is not intended to freeze the coordinator. The second is that the repair scanner keeps a list of jobs it has submimtted, and the coordinator must remember to signal that it has finished a job when the job completes, otherwise the repair scanner will build up a large amount of memory use, and it will not go back and revisit files which may have only been partially repaired.

Sprint 11: Host Throughput And Latency Measuring

Currently worker throughput and latency is not measured when uploading and downloading. We don't really know how fast each host is, which means we can't optimize to select the faster hosts for priority files.

The hostdb needs to be upgraded to support tracking host speed and throughput, and the upload and download functions for a host need to return speed and throughput values when the operation is complete, so that the database can be updated accordingly.

These values will help significantly improve the download speeds in sprint 12.

Sprint 12: Download System Upgrade

The download system will be upgraded to match the upload system, where a chunk issues a stack of work in the form of a bunch of pieces that workers can pick up.

Just like uploads, the memory required for performing downloads needs to be allocated ahead of time. Workers will fetch the pieces one-by-one. Just as upload pieces have a list of applicable workers, download pieces also have a list of applicable workers.

Because memory cannot be released until a full download is complete, a single slow worker can bottleneck a full 40 MiB of memory, which can be a big deal. Fast workers may not be able to perform downloads because slow workers are hogging all of the memory.

To mitigate this, make several stacks of pieces instead of just one stack of pieces. Each stack has a different performance tier. Initially, it will just be a fast stack and slow stack, but eventually we can expand to using a larger number of stacks.

Workers are limited in what stack they can use based on performance. The slowest 50% of workers are only allowed to upload pieces in the slow stack. The fastest 50% of workers can upload from either the fast stack or the slow stack, however they will only get work from the slow stack if the fast stack is empty.

Chunks are also broken into two types of chunks (and eventually more). There are normal chunks and priority chunks. Normal chunks will wait to be added to a stack until there are idle workers, at which point it will be added to the stack with idle workers. More often than not, this will be a fast stack. Priority chunks are always added to the fast stack, but will wait in line patiently behind normal chunks, and won't get added to a stack until all the normal chunks ahead of it have been added to a stack.

The reason for this more elaborate setup is to prevent fast workers with available bandwidth from being bottlenecked by slower workers who are consuming all of the memory. If there is lots of work to do, the fast workers will always be working together on it, meaning the bottleneck is going to be based around your median worker instead of around your slowest worker. The slow workers are only passed work if there are no idle fast workers, yet there are low-priority pieces that need to be uploaded.

The coordinator queuing downloads is going to have to be careful in how it marks chunks as normal vs. priority. Generally, only the last 2 or 3 chunks of a file need to be marked as priority, and the other chunks can all be marked as normal. This is to prevent the final bits of a download from taking a long time due to a few slow hosts, only fast hosts are allowed to handle the end. This means the individual file should be available faster.

Sprint 13: Minimized Erasure Coding Sizes

Currently the erasure coding happens on 4 MiB pieces, meaning that the total amount of memory that needs to be allocated to perform erasure coding or recovery is piecesPerChunk * 4 MiB. That is a full 200 MiB for most chunks. If instead of doing the erasure coding on the full 4 MiB pieces, we break pieces into 4 KiB sub-pieces and erasure code those, we can perform the operation reaching only a peak memory usage of about 50 MiB, which means there will be more memory available for other operations, and the upload/download processes will not seem so incredibly memory intensive.

Sprint 14: Global Bandwidth Ratelimiter

User configured, restricts upload and download speeds as well as overall caps for a given renter machine. Coordinator is aware of the limits and progress towards those limits of all renter machines.

By default, the gbr will allow spurts of up to 3 seconds at full speed to guage the full capabilities of the connection (once every few minutes), and then set the limits to 90% of those speeds. Users can speeds in terms of percentages, and also in terms of absolute bytes per second. If the user chooses an absolute speed, the gbr will not allow spurts that test the overall connection speed.

The gbr will function as a stateful wrapper around a conn object. Instead of calling net.Dial, connections with hosts will have to use gbr.LimitedDial, which will have a timer that reads and writes data in 64kb segments, and then sleeps between reads or writes to keep the total bandwidth in line with the desired limits.

The gbr itself is a separate package, is created by the server, and is passed to the modules as a dependency. The same gbr can be passed to multiple modules, allowing the user to have greater control over the resource usage of siad.

Sprint 15: Extend the Filesystem Database to Support Small Chunks + Files

This database can be extended to support small chunks (and therefore small files) by adding a few extra object typese to the database. This will not be done as a part of Sprint #4, but here are the object types:

+ a partial-sector chunk
+ a packed sector
+ a full-data chunk

A partial-sector chunk is a chunk that has pieces which are smaller than
4 MiB. Instead of these pieces being identified by a host+sector hash, they
are identified by a UID pointing to a packed sector, and an offset
indicating where in the packed sector the piece is

A packed sector contains a bunch of chunk pieces from different chunks. The
pieces are stored packed together into a single sector that is uploaded to a
host. The packed sector needs a bunch of Merkle roots (up to log(n) roots
per piece) so that as pieces get added and delted, it can easily recompute
the roots.

A full-data chunk is for chunks that are so small that the overhead for
storing them in a packed sector (about 40 bytes per piece) is close to or
greater than the size of the data. Instead of using packed sectors, the full
data is actually just stored as the chunk, and the Sia network is not used
at all. This method is typically only needed for files smaller than 8 KiB.

Finally, we need one more bucket which lists the gaps in all of the packed sectors. When a new partial-sector chunk is added to the filesystem, the gaps bucket is used to figure out if a new packed sector needs to be created, or if existing packed sectors have enough space to house the new pieces. The keys for this bucket are host ids (their pubkeys), and the values are a list of gaps in packed sectors for that host.

HostID -> Packed Sector Gaps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment