Skip to content

Instantly share code, notes, and snippets.

@cflewis
Last active August 23, 2017 13:34
Show Gist options
  • Save cflewis/87843028576459b0f6ebf55f1b200891 to your computer and use it in GitHub Desktop.
Save cflewis/87843028576459b0f6ebf55f1b200891 to your computer and use it in GitHub Desktop.

Non-Local File Systems Should Be Supported

TL;DR

Files are all over the Internet, and Go should provide a reasonable abstraction in its stdlib to handle this.

Problem

Files are everywhere. Local disk is the obvious place. But they're also in storage buckets like Google Cloud Storage and Amazon S3. They're in code repositories like GitHub and Google Source. They're in personal storage like Dropbox and Evernote. They're on other machines or filesystems on corporate Intranets. Tests which write to disk are much less error-prone if they write to volatile memory, such as a RAMFS.

Go 1's os package provides an os.File struct. The implicit assumption is that local disk storage is the only thing the stdlib should care about, and that external packages should provide access to these other sources. In 2007, this would have been a reasonable approach. In 2017, files are increasingly unlikely to be local.

For the Open Source Programs Office, we have a need to scan files, whether they exist on GitHub, Google Source, or locally. We want those scanners to be agnostic to how the file is opened and read, and those scanners need to be able to dictate how they walk the file system.

Why the stdlib isn't enough

  • io.Reader doesn't carry a path with it, so you'd need to put that in a structure along with it if you were to do any meaningful logging. You'll also need to carry along information about where the remote file repository is. You might be interested in other things, like os.Stat information. If you want to open other files from the same repository that a given file is, you'll need a pointer to something that implements that too.
  • filepath.Walk() takes a string and only walks the local disk.
  • Not all systems you interact with can simulate a remote file repository as a local disk using FUSE, and this necessitates a new download dependency (separate from go get?).
  • os.File is a struct, so you can't simulate a remote file repository as a local disk using native Go.

Current solutions force programs to be aware of the file system abstraction

That I know of, there are two packages that deal with abstracting the file system:

  • Afero, starred 1062 times.
  • OKLog FS, entire package starred 1704 times.

At OSPO, we used Afero. We pass around an afero.Fs (file system) with a url.URL, with a fake schema (e.g. github://golang/go/master/README.md), which the filesystem then parses and uses to find the correct method to retrieve a file (such as cloning the Git repository to /tmp and then pointing all future accesses to that cache).

This is OK in practice, but forces the program and any API consumers to be aware of this library in a way that is inelegant. Having a standard means across all Go programs increases readability, reliability and code sharing.

Passing a filesystem as a parameter also has non-obvious limitations. For example, it is not possible to switch between repositories within the same function invocation: it has to use the file system it was passed. I bring this up not to discount that perhaps there was a better way to do it, but to note that such gotchas are hard to see before going far down an implementation, and that Gophers often look to the stdlib for the better way to do something.

Where this is done in Go 1.0's stdlib

This is a very similar issue to how the sql package talks to different drivers. Any Go program can be written almost entirely agnostic of the database driver outside of the main package. The main package can import the drivers it wants to allow, define a data source name adapater that means something to the driver, and pass around a *DB from that point out. This is fairly analagous to passing an afero.FS struct. But it has all the benefits of being standardized: readability, writability, and the ability to share file system accessors trivially. Being able to simply download a new driver for a new database system in a matter of seconds without needing to rewrite code is awesome. This isn't possible for file repositories.

Proposed Solution

I know that experience reports are supposed to be problems and not solutions, but in order to start the conversation, I'd take the approach from sql and modify it slightly. A Register() function can be used to register new file systems. os.Open is translated to something like fs.Open(driver, path string) fs.File (where fs.File is an interface, not a struct) and a new Walk(driver, root string, walkFn WalkFunc) is written that can walk these repositories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment