Skip to content

Instantly share code, notes, and snippets.

@malkab
Last active August 31, 2021 10:34
Show Gist options
  • Save malkab/c76887152debf66e85130d1bd1dcb22e to your computer and use it in GitHub Desktop.
Save malkab/c76887152debf66e85130d1bd1dcb22e to your computer and use it in GitHub Desktop.
DVC (Data Version Control)

DVC (Data Version Control)

A Git for Big Data. Use DVC for storing big data assets as the perfect companion to Git for Big Data projects.

Project homepage: https://dvc.org/

Remotes

We currently SSH remotes. At the development machine where the folder used as repository is located, the .dvc/config has this structure:

[core]
    remote = storage
[cache]
    local = storage
['remote "storage"']
    url = /mnt/samsung_hdd_1_5tb/dvc_storage/didactica-python_remote_sensing

However, to use this folder in a machine with SSH access, use:

[core]
    remote = euler-ssh
['remote "euler-ssh"']
    url = ssh://malkab@euler/mnt/samsung_hdd_1_5tb/dvc_storage/didactica-python_remote_sensing

Usage Pattern

When deciding the remote, using a common remote for several repos has the advantage of reducing remote size if those repos share data, but it's detrimental toward flexibility of moving and storing data for idle repos in other storage media. So, when a set of repos will share a big deal of data, use the same remote for all repos, and use different ones in all the other cases.

When starting a project with DVC, load the data into the repo folder and initialize both Git and DVC. Add data to DVC. This will add the DVC data to the .gitignore. Work with the data. When done with the project, data in DVC can be safely deleted once it has been pushed to the remote.

When resuming the project, use dvc status to check if data is available. Unavailable data will be reported as DELETED, but don't commit this change to DVC! It is deleted to save HD space on an idle project. With a dvc pull data will be restored from the remote.

After completely removing the Git repo from the HD, the DVC remote in a local folder will remain. When repulling the Git repo, check .dvc/config, which has been added to the repo, to check the original DVC remote local folder. If it's accessible, a dvc pull will fetch all the data. If not, restore the DVC remote local folder back to the HD and update the .dvc/config to relink and update the project again.

Recipes

# Initialise Git and DVC.
# Will generate some files that are directly added to the stage zone for the
# next commit.
git init
dvc init

# Add a remote for DVC, multiple remote types are available.
# Remotes are added to .dvc/config. In this case, a local folder is
# used, many other options are available. In this case, it also configures
# the folder for use as local cache so cache doesn't use additional space.
dvc remote add -d storage /mnt/samsung_hdd_1_5tb/dvc_storage
dvc config cache.local storage

# Add data for tracking with DVC.
# This will hash the files and create the cache. BEWARE!!! THE CACHE CONSUMES
# SO MUCH SPACE AS THE ORIGINAL FILES IF NOT USED IN cache.local MODE (see
# above). The cache is at .dvc/cache.
# dvc tracked content will be automatically added to the upper folder level
# .gitignore.
# For each folder or file added, a .dvc file is created and that needs to be
# commited.
dvc add data

# Push
dvc push
git push

# Once the repo is cloned, a pull will retrieve the big data if the remote is
# available.
git pull
dvc pull

# Check remotes, status, and DVC tracked stuff
dvc status                  # Don't mind "deleted" results here
dvc remote list
dvc list -R --dvc-only .

# Remove a file from DVC. This targets the .dvc files describing assets under DVC control.
dvc remove the/file/path
dvc gc -w
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment