Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save CMCDragonkai/1a4860671145b295fe7a4d8bc3968e87 to your computer and use it in GitHub Desktop.
Save CMCDragonkai/1a4860671145b295fe7a4d8bc3968e87 to your computer and use it in GitHub Desktop.
Using ZFS to Version Control Large Datasets #zfs

Using ZFS to Version Control Large Datasets

Imagine you are a machine learning engineer. You're dealing with large datasets.

You are about to perform a large scale change to the data. Or even a small change.

You wish you had your data under version control. You wish you had a git for data.

But the data is too big for git. So you usually just backup via a copy.

But here comes ZFS to the rescue! The advantage of using ZFS is that you can take advantage of its copy on write feature.

You need to create a dataset first:

# assuming your root pool is called rpool
zfs create rpool/data

# if you are using legacy mount
# you also need to use this
# otherwise rpool/data is already mounted under where rpool is mounted
mkdir --parents /tmp/data && \
mount -t zfs rpool/data /tmp/data

Now you need put your data into rpool/data. Once you have done this, you take a snapshot.

zfs snapshot rpool/data@1

Now you can work on it the data. At any time you can rollback to rpool/data@1.

zfs rollback rpool/data@1

You can checkpoint your work by creating extra snapshots:

# do this every time you reach a milestone of changes
zfs snapshot rpool/data@2
zfs snapshot rpool/data@3

What if you want to work on a different variation of the same data?

You just fork!

zfs clone rpool/data@2 rpool/data2
# you may need to mount rpool/data2

You can then work on rpool/data2. If you decide that this clone should be the primary dataset, you can promote it:

zfs promote rpool/data2

Once you are done, you can destroy all clones, snapshots and datasets in that order.

The only problem with all of this, is that you need to be on a ZFS system. You may need to use sudo for all of the above. You need to copy the data into the mounted dataset at the beginning. And you may also need to tedious mounting. There's no way to easily turn a directory into a zfs dataset. You have to first move that directory somewhere else. Then create the dataset with that path.

Since I have a rpool and rpool/tmp already. I would create an extra dataset just for these kinds of things. Something like rpool/data. Then inside create sub-datasets for specific workspaces. Like rpool/data/satellite-imagery.

One just has to remember that the dataset names are not directly mapped to the filesystem. Where you can mount rpool/data/satellite-imagery anywhere.

Alternatively you could create a /data directory, and put everything there.

Then in your projects, you would symlink there. That could be one place to store all the big data things. Ultimately you are most likely keeping this in a remote object store as backups anyway.

@xk2600
Copy link

xk2600 commented Sep 29, 2020

I have had a very similar question pondering in the back of my mind for a long time... I don't know if you're familiar with the 'fossil-scm vs git' discussion... in short, there are two approaches with version control from an architecture perspective... do we keep fragments as a bunch of files in the operating system or use a database to provide fine grained tracking of fragments? While I personally love fossil because of it's rich feature-set out of the box, on large projects it can become frustratingly slower for day to day tasks.

Anyways, my point is why not have it both ways? Lots of interesting things happening at the moment... OpenZFS is moving to a single codebase between FreeBSD, Illumos, and Linux (also ZFS on FUSE is fairly stable). Why not wrap a toolchain around leveraging ZFS as a source control system?

I'm sure there are some limitations/caveats that might compel one to think this is a bad idea, like block size on modern hardware being 4096B, but it would be an interesting project. ZFS in and of itself as you've pointed out in your gist provides native versioning. Plus, I can see where snapshots + zfs send/recv would be massively beneficial. The current ZoL project is also working on firming up native encryption, and I believe we are going to have full support for it in 13-current on FreeBSD. (albeit metadata and other non data information is not encrypted) Native encryption could provide native support for repos (datasets) which could host "secret" data, which has been something lacking in git, so much so some fine fellows wrote git-secret. Also, due to ZFS's hierarchical nature internally with external mount points, it would be very easy to have "sub module" like functionality in repos.

I am going to try and put something rough together (continuing where you left off) and share if/when I get something which might interest you or others...

Anyways, I was just dropping a note, as I appreciate your gist. If you're interested in such a project, feel free to message me.

(edited for clarity and to flesh out thoughts I forgot.)

@CMCDragonkai
Copy link
Author

Yea that would be cool!

@VictorieeMan
Copy link

Great note; cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment