Skip to content

Instantly share code, notes, and snippets.

@pmrowla
Last active April 30, 2020 15:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pmrowla/338d9645bd05df966f8aba8366cab308 to your computer and use it in GitHub Desktop.
Save pmrowla/338d9645bd05df966f8aba8366cab308 to your computer and use it in GitHub Desktop.
DVC remote optimizations
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

DVC remote commands

  • All DVC remote commands (status -c, push, pull, gc -c, etc) share a "cloud status" step
  • For a given DVC versioned dataset (i.e. commit), DVC must verify whether other not each file exists in local cache and on the remote:
    • Files that are not in local (.dvc/cache) must be fetched (pulled) from remote
    • Files that are not in remote must be pushed to remote
    • For gc, files which exist in either local or remote (with -c) but are not in the dataset must be removed
  • DVC must compare list of files in local with list of files in remote and then take any relevant action per command
  • DVC remote.cache_exists() function - given list of files, return which ones exist in the (remote or local) cache
    • For local cache, testing file existence simple (i.e. just use OS stat/lstat/etc)
    • Potential bottleneck for all remote commands (especially for large datasets)

Querying status from remotes

Note: discussion applies to all remotes except RemoteLocal and RemoteSSH unless stated otherwise (local and ssh have their own overridden cache_exists() implementations)

For typical cloud remotes, there are two options to determine if a file exists:

  1. Query each file individually (S3 HeadObject)
    • Performance depends on number of files being queried
  2. Iterate over (potentially entire) listing of all files on the remote (S3 ListObjects/ListObjectsV2)
    • Results are paginated and must be fetched sequentially
    • Ex. S3 returns 1000 objects per call, DVC must iterate over objects [0...999], [1000...1999], etc
    • Performance depends on remote size (total files in remote)

Which option is potentially faster depends on the relative sizes of <files to query> and <total files on remote>

  • For small query set and large remote, option 1 is faster (ex. 10 files to query, remote with 1m total files)
  • For large query set and (relatively) small remote, option 2 is faster (ex. 1m files to query, remote with 10m total files)
    • Given sufficiently large remote, option 2 will be very slow even for large query sets
  • Remote size threshold for determining which behavior is optimal is only dependent on size of the query set

Simplified S3 example

  • Given 10 files to query, testing each file individually requires 10 HeadObject API calls
  • S3 ListObjectsV2 returns 1000 files per page (i.e. per S3 API call)
    • Iterating over remote requires remote_size / 1000 API calls
  • If the remote contains over 10 * 1000 = 10k files, querying each file individually will be faster
  • If the remote contains < 10k files, listing the full remote will be faster

Potential optimizations

  • Choose the correct behavior between options 1 and 2
    • Requires knowing the size of remote
  • Run queries in parallel
    • Easy for option 1
    • More complex for option 2 due to pagination
  • Reduce the number of files to query

0.91.0 (last release before remote optimizations)

no_traverse option

  • Undocumented remote config option for selecting cache_exists() behavior
  • Named after rclone --no-traverse option
  • Specifies which query method to use
  • Confusing usage
  • Hard coded regardless of remote size

no_traverse = true:

  • Default for all remotes except GDrive
  • Query each file individually (full remote is "not traversed")
  • Done in parallel
  • Progress bar

no_traverse = false:

  • Default for GDrive
  • Iterate over full remote listing ("traverse" the remote)
  • Done in sequentially in main thread
  • No progress bar
    • DVC appears to hang

Optimzation #1 - Deprecate no_traverse

  • DVC now automatically selects optimal cache_exists() behavior for any given remote command
  • If DVC can determine whether or not remote size is over or under the threshold, we can choose optimal method
    • Some caveats apply - relative speed of different API calls, Python list/set performance for large lists, etc

Estimating remote size in DVC

DVC (remote) cache structure:

.dvc/cache
├── 00
│   ├── 411460f7c92d2124a67ea0f4cb5f85
│   ├── 6f52e9102a8d3be2fe5614f42ba989
│   └── ...
├── 01
├── 02
├── 03
├── ...
└── ff
  • MD5 hashes are evenly distributed
  • DVC cache will be evenly distributed
  • Remote size can be estimated based on the size of a subset of the remote
    • Ex: If 00/ contains 10 files, the expected size of the remote would be roughly 256 * 10 = 2,560
  • DVC can fetch list of files under some prefix (00/) and use that to estimate total remote size
  • Size estimation can be short circuited once we hit the threshold value
    • Ex: If threshold for a given query is 256k, we can stop as soon as 00/ listing returns >1k files
    • We don't care about the final total file count in 00/
  • Longer prefixes can be used for remotes which support it
    • Ex: on S3, ListObjectsV2 can be called with any arbitrary prefix
    • First 3-characters from checksum can be used as prefix instead of only 2 (i.e. 00/0)
    • Estimated remote size would then be 4096 * len(files_in_prefix)
  • Certain remotes (gdrive, hdfs) only allow listing at directory level
    • Limited to 2-character prefixes (i.e. cache subdirectories)

Other things to note

  • When DVC determines full remote listing should be used:
    • Remote listing now queried in parallel by prefixes
    • Progress bar now used
  • gc -c also uses the same size estimation logic and displays progress
    • For gc, full remote file list must always be fetched (since unused files need to be removed)

End result

  • DVC significantly faster in general for users that would have previously benefited from using no_traverse = false, but were using the default behavior instead
  • DVC faster in cases where users were using no_traverse = false properly (since it is now done in parallel)
  • UI improvements

0.91.0 (default):

asciicast

0.91.0 (no_traverse = false):

asciicast

Latest:

asciicast

Optimization #2 - Mini-indexes

Trust .dir files on remote

.dir files in DVC can be treated as "mini-indexes" for remotes, to significantly reduce the number of files needed to be queried

    file            checksum
.
├── data        -> 1234.dir
│   ├── bar     -> 5678
│   └── foo     -> 90ab
└── data.dvc

Old behavior:

  • DVC checks remote for existence of 1234.dir, 5678, 90ab

New behavior:

  • DVC checks remote for existence of 1234.dir
    • If exists, DVC trusts .dir file existence, and skips check for 5678, 90ab
    • If does not exist, DVC continues with check for 5678, 90ab

Potential gc -c problem

Problem:

  • Case where 1234.dir exists on the remote, but either 5678 or 90ab are missing
  • DVC would skip check for missing files, and incorrectly assume they exist
    • DVC would not upload missing files during push
    • DVC would try to download missing files during pull

Solution:

  • When push-ing a directory, DVC now uploads the .dir file last, after all files in that directory have been successfully uploaded
    • If any file contained in that directory failed to upload, the .dir file will not be pushed to the remote
  • When running gc -c, all .dir files are removed first, before removing any files contained in those directories
    • Even in event of a partial gc -c, .dir file would have been removed, ensuring that DVC does not skip checks for other files

Potential for problems resulting from users deleting objects from remote on their own still exists

Local index for .dir files

  • DVC now keeps a basic local index for directories which are known to be on a given remote
  • Sqlite3 DB in .dvc/tmp/index/<filename>.idx
    • Filename is SHA256 hash of remote URL
.dvc/
├── cache
├── config
├── lock
├── state
├── tmp
│   ├── index
│   │   └── 28fa66f02befae2fd5f57364b126dd498d6534180840005a0f10c03a62cc51a0.idx
│   └── rwlock
├── updater
└── updater.lock
  • When DVC sees that a .dir file exists on the remote, the directory (and its contents) are added to the index
    • Index is also updated for successfully push-ed directories
  • When querying a remote, DVC validates the index by verifying that all indexed .dir files still exist on the remote
    • If an indexed .dir file is not found on the remote (because it was removed by gc -c for example), the entire index for that remote is invalidated and cleared
  • cache_exists() first checks for any indexed files, and then queries the remote for remaining files
  • Standalone files are not indexed

End result

  • DVC performance significantly improved for the case where a user has a large push-ed directory, and then adds or modifies a (relatively) small number of files to that directory
    file            checksum
.
├── data        -> 4321.dir
│   ├── bar     -> 5678
│   ├── baz     -> cdef
│   └── foo     -> 90ab
└── data.dvc

Old behavior:

  • DVC checks for existence of 4321.dir, 5678, cdef, 90ab on remote

New behavior:

  • DVC checks for existence of 4321.dir on remote (does not exist)
  • DVC checks index for 5678, cdef 90ab
    • 5678, 90ab exist in index (since they were in the previous version of data/)
  • DVC checks for 5678 on remote

asciicast

(Minor) Optimization #3

  • DVC uses PathInfo objects to represent local filesystem paths and remote URL paths
  • PathInfo is slow (relative to string paths) when dealing with large #'s of paths
    • Performance regression from when PathInfo's were introduced
  • Use string paths in DVC in places where using objects would substantially affect performance

Benchmarks

  • S3 remote w/~1M total files, local commit containing single directory w/100k files
  • All times in seconds
  • Python 3.7, MacOS Catalina, DVC installed from pip, dual-core 3.1GHz i7 cpu

benchmarks

rclone usage:

rclone copy --dry-run --progress --exclude "**/**.unpacked/" .dvc/cache dvc-temp:...

dvc status -c for dataset-registry-private (S3 remote, 2M imagenet files across 2 directories)

  • Latest: dvc status -c 258.39s user 33.38s system 87% cpu 5:31.93 total
  • 0.91.0 (no_traverse = false): 802.44s user 87.22s system 42% cpu 35:04.45 total

Potential future improvements

  • Better indexes
    • Handle standalone files
  • More protection from dangerous use cases (gc -c)
  • Cache structure improvements
    • Take advantage of prefix-based queries
  • De-duplication (optimize storage rather than runtime performance)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment