pmrowla/remote-benchmarks.svg Secret

## remote-benchmarks.svg

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              remote-benchmarks.svg
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## remote-optimization.md

      
    Raw
  

              remote-optimization.md
            
          
    DVC remote commands


All DVC remote commands (status -c, push, pull, gc -c, etc) share a "cloud status" step
For a given DVC versioned dataset (i.e. commit), DVC must verify whether other not each file exists in local cache and on the remote:

Files that are not in local (.dvc/cache) must be fetched (pulled) from remote
Files that are not in remote must be pushed to remote
For gc, files which exist in either local or remote (with -c) but are not in the dataset must be removed


DVC must compare list of files in local with list of files in remote and then take any relevant action per command
DVC remote.cache_exists() function - given list of files, return which ones exist in the (remote or local) cache

For local cache, testing file existence simple (i.e. just use OS stat/lstat/etc)
Potential bottleneck for all remote commands (especially for large datasets)


Querying status from remotes

Note: discussion applies to all remotes except RemoteLocal and RemoteSSH unless stated otherwise (local and ssh have their own overridden cache_exists() implementations)
For typical cloud remotes, there are two options to determine if a file exists:

Query each file individually (S3 HeadObject)

Performance depends on number of files being queried


Iterate over (potentially entire) listing of all files on the remote (S3 ListObjects/ListObjectsV2)

Results are paginated and must be fetched sequentially
Ex. S3 returns 1000 objects per call, DVC must iterate over objects [0...999], [1000...1999], etc
Performance depends on remote size (total files in remote)


Which option is potentially faster depends on the relative sizes of <files to query> and <total files on remote>

For small query set and large remote, option 1 is faster (ex. 10 files to query, remote with 1m total files)
For large query set and (relatively) small remote, option 2 is faster (ex. 1m files to query, remote with 10m total files)

Given sufficiently large remote, option 2 will be very slow even for large query sets


Remote size threshold for determining which behavior is optimal is only dependent on size of the query set

Simplified S3 example


Given 10 files to query, testing each file individually requires 10 HeadObject API calls
S3 ListObjectsV2 returns 1000 files per page (i.e. per S3 API call)

Iterating over remote requires remote_size / 1000 API calls


If the remote contains over 10 * 1000 = 10k files, querying each file individually will be faster
If the remote contains < 10k files, listing the full remote will be faster

Potential optimizations


Choose the correct behavior between options 1 and 2

Requires knowing the size of remote


Run queries in parallel

Easy for option 1
More complex for option 2 due to pagination


Reduce the number of files to query

0.91.0 (last release before remote optimizations)

no_traverse option


Undocumented remote config option for selecting cache_exists() behavior
Named after rclone --no-traverse option
Specifies which query method to use
Confusing usage
Hard coded regardless of remote size

no_traverse = true:

Default for all remotes except GDrive
Query each file individually (full remote is "not traversed")
Done in parallel
Progress bar

no_traverse = false:

Default for GDrive
Iterate over full remote listing ("traverse" the remote)
Done in sequentially in main thread
No progress bar

DVC appears to hang


Optimzation #1 - Deprecate no_traverse


DVC now automatically selects optimal cache_exists() behavior for any given remote command
If DVC can determine whether or not remote size is over or under the threshold, we can choose optimal method

Some caveats apply - relative speed of different API calls, Python list/set performance for large lists, etc


Estimating remote size in DVC

DVC (remote) cache structure:
.dvc/cache
├── 00
│   ├── 411460f7c92d2124a67ea0f4cb5f85
│   ├── 6f52e9102a8d3be2fe5614f42ba989
│   └── ...
├── 01
├── 02
├── 03
├── ...
└── ff


MD5 hashes are evenly distributed
DVC cache will be evenly distributed
Remote size can be estimated based on the size of a subset of the remote

Ex: If 00/ contains 10 files, the expected size of the remote would be roughly 256 * 10 = 2,560


DVC can fetch list of files under some prefix (00/) and use that to estimate total remote size
Size estimation can be short circuited once we hit the threshold value

Ex: If threshold for a given query is 256k, we can stop as soon as 00/ listing returns >1k files
We don't care about the final total file count in 00/


Longer prefixes can be used for remotes which support it

Ex: on S3, ListObjectsV2 can be called with any arbitrary prefix
First 3-characters from checksum can be used as prefix instead of only 2 (i.e. 00/0)
Estimated remote size would then be 4096 * len(files_in_prefix)


Certain remotes (gdrive, hdfs) only allow listing at directory level

Limited to 2-character prefixes (i.e. cache subdirectories)


Other things to note


When DVC determines full remote listing should be used:

Remote listing now queried in parallel by prefixes
Progress bar now used


gc -c also uses the same size estimation logic and displays progress

For gc, full remote file list must always be fetched (since unused files need to be removed)


End result


DVC significantly faster in general for users that would have previously benefited from using no_traverse = false, but were using the default behavior instead
DVC faster in cases where users were using no_traverse = false properly (since it is now done in parallel)
UI improvements

0.91.0 (default):

0.91.0 (no_traverse = false):

Latest:

Optimization #2 - Mini-indexes

Trust .dir files on remote

.dir files in DVC can be treated as "mini-indexes" for remotes, to significantly reduce the number of files needed to be queried
    file            checksum
.
├── data        -> 1234.dir
│   ├── bar     -> 5678
│   └── foo     -> 90ab
└── data.dvc

Old behavior:

DVC checks remote for existence of 1234.dir, 5678, 90ab

New behavior:

DVC checks remote for existence of 1234.dir

If exists, DVC trusts .dir file existence, and skips check for 5678, 90ab
If does not exist, DVC continues with check for 5678, 90ab


Potential gc -c problem

Problem:

Case where 1234.dir exists on the remote, but either 5678 or 90ab are missing
DVC would skip check for missing files, and incorrectly assume they exist

DVC would not upload missing files during push
DVC would try to download missing files during pull


Solution:

When push-ing a directory, DVC now uploads the .dir file last, after all files in that directory have been successfully uploaded

If any file contained in that directory failed to upload, the .dir file will not be pushed to the remote


When running gc -c, all .dir files are removed first, before removing any files contained in those directories

Even in event of a partial gc -c, .dir file would have been removed, ensuring that DVC does not skip checks for other files


Potential for problems resulting from users deleting objects from remote on their own still exists
Local index for .dir files


DVC now keeps a basic local index for directories which are known to be on a given remote
Sqlite3 DB in .dvc/tmp/index/<filename>.idx

Filename is SHA256 hash of remote URL


.dvc/
├── cache
├── config
├── lock
├── state
├── tmp
│   ├── index
│   │   └── 28fa66f02befae2fd5f57364b126dd498d6534180840005a0f10c03a62cc51a0.idx
│   └── rwlock
├── updater
└── updater.lock


When DVC sees that a .dir file exists on the remote, the directory (and its contents) are added to the index

Index is also updated for successfully push-ed directories


When querying a remote, DVC validates the index by verifying that all indexed .dir files still exist on the remote

If an indexed .dir file is not found on the remote (because it was removed by gc -c for example), the entire index for that remote is invalidated and cleared


cache_exists() first checks for any indexed files, and then queries the remote for remaining files
Standalone files are not indexed

End result


DVC performance significantly improved for the case where a user has a large push-ed directory, and then adds or modifies a (relatively) small number of files to that directory

    file            checksum
.
├── data        -> 4321.dir
│   ├── bar     -> 5678
│   ├── baz     -> cdef
│   └── foo     -> 90ab
└── data.dvc

Old behavior:

DVC checks for existence of 4321.dir, 5678, cdef, 90ab on remote

New behavior:

DVC checks for existence of 4321.dir on remote (does not exist)
DVC checks index for 5678, cdef 90ab

5678, 90ab exist in index (since they were in the previous version of data/)


DVC checks for 5678 on remote


(Minor) Optimization #3


DVC uses PathInfo objects to represent local filesystem paths and remote URL paths
PathInfo is slow (relative to string paths) when dealing with large #'s of paths

Performance regression from when PathInfo's were introduced


Use string paths in DVC in places where using objects would substantially affect performance

Benchmarks


S3 remote w/~1M total files, local commit containing single directory w/100k files
All times in seconds
Python 3.7, MacOS Catalina, DVC installed from pip, dual-core 3.1GHz i7 cpu


rclone usage:
rclone copy --dry-run --progress --exclude "**/**.unpacked/" .dvc/cache dvc-temp:...

dvc status -c for dataset-registry-private (S3 remote, 2M imagenet files across 2 directories)

Latest: dvc status -c  258.39s user 33.38s system 87% cpu 5:31.93 total
0.91.0 (no_traverse = false): 802.44s user 87.22s system 42% cpu 35:04.45 total

Potential future improvements


Better indexes

Handle standalone files


More protection from dangerous use cases (gc -c)
Cache structure improvements

Take advantage of prefix-based queries


De-duplication (optimize storage rather than runtime performance)