- All DVC remote commands (
status -c
,push
,pull
,gc -c
, etc) share a "cloud status" step - For a given DVC versioned dataset (i.e. commit), DVC must verify whether other not each file exists in local cache and on the remote:
- Files that are not in local (
.dvc/cache
) must be fetched (pulled) from remote - Files that are not in remote must be pushed to remote
- For
gc
, files which exist in either local or remote (with-c
) but are not in the dataset must be removed
- Files that are not in local (
- DVC must compare list of files in local with list of files in remote and then take any relevant action per command
- DVC
remote.cache_exists()
function - given list of files, return which ones exist in the (remote or local) cache- For local cache, testing file existence simple (i.e. just use OS
stat
/lstat
/etc) - Potential bottleneck for all remote commands (especially for large datasets)
- For local cache, testing file existence simple (i.e. just use OS
Note: discussion applies to all remotes except RemoteLocal
and RemoteSSH
unless stated otherwise (local and ssh have their own overridden cache_exists()
implementations)
For typical cloud remotes, there are two options to determine if a file exists:
- Query each file individually (S3
HeadObject
)- Performance depends on number of files being queried
- Iterate over (potentially entire) listing of all files on the remote (S3
ListObjects
/ListObjectsV2
)- Results are paginated and must be fetched sequentially
- Ex. S3 returns 1000 objects per call, DVC must iterate over objects [0...999], [1000...1999], etc
- Performance depends on remote size (total files in remote)
Which option is potentially faster depends on the relative sizes of <files to query>
and <total files on remote>
- For small query set and large remote, option 1 is faster (ex. 10 files to query, remote with 1m total files)
- For large query set and (relatively) small remote, option 2 is faster (ex. 1m files to query, remote with 10m total files)
- Given sufficiently large remote, option 2 will be very slow even for large query sets
- Remote size threshold for determining which behavior is optimal is only dependent on size of the query set
- Given 10 files to query, testing each file individually requires 10
HeadObject
API calls - S3
ListObjectsV2
returns 1000 files per page (i.e. per S3 API call)- Iterating over remote requires
remote_size / 1000
API calls
- Iterating over remote requires
- If the remote contains over
10 * 1000 = 10k
files, querying each file individually will be faster - If the remote contains < 10k files, listing the full remote will be faster
- Choose the correct behavior between options 1 and 2
- Requires knowing the size of remote
- Run queries in parallel
- Easy for option 1
- More complex for option 2 due to pagination
- Reduce the number of files to query
- Undocumented remote config option for selecting
cache_exists()
behavior - Named after
rclone --no-traverse
option - Specifies which query method to use
- Confusing usage
- Hard coded regardless of remote size
no_traverse = true
:
- Default for all remotes except GDrive
- Query each file individually (full remote is "not traversed")
- Done in parallel
- Progress bar
no_traverse = false
:
- Default for GDrive
- Iterate over full remote listing ("traverse" the remote)
- Done in sequentially in main thread
- No progress bar
- DVC appears to hang
- DVC now automatically selects optimal
cache_exists()
behavior for any given remote command - If DVC can determine whether or not remote size is over or under the threshold, we can choose optimal method
- Some caveats apply - relative speed of different API calls, Python list/set performance for large lists, etc
DVC (remote) cache structure:
.dvc/cache
├── 00
│ ├── 411460f7c92d2124a67ea0f4cb5f85
│ ├── 6f52e9102a8d3be2fe5614f42ba989
│ └── ...
├── 01
├── 02
├── 03
├── ...
└── ff
- MD5 hashes are evenly distributed
- DVC cache will be evenly distributed
- Remote size can be estimated based on the size of a subset of the remote
- Ex: If
00/
contains 10 files, the expected size of the remote would be roughly256 * 10 = 2,560
- Ex: If
- DVC can fetch list of files under some prefix (
00/
) and use that to estimate total remote size - Size estimation can be short circuited once we hit the threshold value
- Ex: If threshold for a given query is 256k, we can stop as soon as
00/
listing returns >1k files - We don't care about the final total file count in
00/
- Ex: If threshold for a given query is 256k, we can stop as soon as
- Longer prefixes can be used for remotes which support it
- Ex: on S3,
ListObjectsV2
can be called with any arbitrary prefix - First 3-characters from checksum can be used as prefix instead of only 2 (i.e.
00/0
) - Estimated remote size would then be
4096 * len(files_in_prefix)
- Ex: on S3,
- Certain remotes (gdrive, hdfs) only allow listing at directory level
- Limited to 2-character prefixes (i.e. cache subdirectories)
- When DVC determines full remote listing should be used:
- Remote listing now queried in parallel by prefixes
- Progress bar now used
gc -c
also uses the same size estimation logic and displays progress- For
gc
, full remote file list must always be fetched (since unused files need to be removed)
- For
- DVC significantly faster in general for users that would have previously benefited from using
no_traverse = false
, but were using the default behavior instead - DVC faster in cases where users were using
no_traverse = false
properly (since it is now done in parallel) - UI improvements
0.91.0 (default):
0.91.0 (no_traverse = false
):
Latest:
.dir
files in DVC can be treated as "mini-indexes" for remotes, to significantly reduce the number of files needed to be queried
file checksum
.
├── data -> 1234.dir
│ ├── bar -> 5678
│ └── foo -> 90ab
└── data.dvc
Old behavior:
- DVC checks remote for existence of
1234.dir
,5678
,90ab
New behavior:
- DVC checks remote for existence of
1234.dir
- If exists, DVC trusts .dir file existence, and skips check for
5678
,90ab
- If does not exist, DVC continues with check for
5678
,90ab
- If exists, DVC trusts .dir file existence, and skips check for
Problem:
- Case where
1234.dir
exists on the remote, but either5678
or90ab
are missing - DVC would skip check for missing files, and incorrectly assume they exist
- DVC would not upload missing files during
push
- DVC would try to download missing files during
pull
- DVC would not upload missing files during
Solution:
- When
push
-ing a directory, DVC now uploads the.dir
file last, after all files in that directory have been successfully uploaded- If any file contained in that directory failed to upload, the
.dir
file will not be pushed to the remote
- If any file contained in that directory failed to upload, the
- When running
gc -c
, all.dir
files are removed first, before removing any files contained in those directories- Even in event of a partial
gc -c
,.dir
file would have been removed, ensuring that DVC does not skip checks for other files
- Even in event of a partial
Potential for problems resulting from users deleting objects from remote on their own still exists
- DVC now keeps a basic local index for directories which are known to be on a given remote
- Sqlite3 DB in
.dvc/tmp/index/<filename>.idx
- Filename is SHA256 hash of remote URL
.dvc/
├── cache
├── config
├── lock
├── state
├── tmp
│ ├── index
│ │ └── 28fa66f02befae2fd5f57364b126dd498d6534180840005a0f10c03a62cc51a0.idx
│ └── rwlock
├── updater
└── updater.lock
- When DVC sees that a
.dir
file exists on the remote, the directory (and its contents) are added to the index- Index is also updated for successfully
push
-ed directories
- Index is also updated for successfully
- When querying a remote, DVC validates the index by verifying that all indexed
.dir
files still exist on the remote- If an indexed
.dir
file is not found on the remote (because it was removed bygc -c
for example), the entire index for that remote is invalidated and cleared
- If an indexed
cache_exists()
first checks for any indexed files, and then queries the remote for remaining files- Standalone files are not indexed
- DVC performance significantly improved for the case where a user has a large
push
-ed directory, and then adds or modifies a (relatively) small number of files to that directory
file checksum
.
├── data -> 4321.dir
│ ├── bar -> 5678
│ ├── baz -> cdef
│ └── foo -> 90ab
└── data.dvc
Old behavior:
- DVC checks for existence of
4321.dir
,5678
,cdef
,90ab
on remote
New behavior:
- DVC checks for existence of
4321.dir
on remote (does not exist) - DVC checks index for
5678
,cdef
90ab
5678
,90ab
exist in index (since they were in the previous version ofdata/
)
- DVC checks for
5678
on remote
- DVC uses PathInfo objects to represent local filesystem paths and remote URL paths
- PathInfo is slow (relative to string paths) when dealing with large #'s of paths
- Performance regression from when PathInfo's were introduced
- Use string paths in DVC in places where using objects would substantially affect performance
- S3 remote w/~1M total files, local commit containing single directory w/100k files
- All times in seconds
- Python 3.7, MacOS Catalina, DVC installed from pip, dual-core 3.1GHz i7 cpu
rclone usage:
rclone copy --dry-run --progress --exclude "**/**.unpacked/" .dvc/cache dvc-temp:...
dvc status -c
for dataset-registry-private (S3 remote, 2M imagenet files across 2 directories)
- Latest:
dvc status -c 258.39s user 33.38s system 87% cpu 5:31.93 total
- 0.91.0 (
no_traverse = false
):802.44s user 87.22s system 42% cpu 35:04.45 total
- Better indexes
- Handle standalone files
- More protection from dangerous use cases (
gc -c
) - Cache structure improvements
- Take advantage of prefix-based queries
- De-duplication (optimize storage rather than runtime performance)