@ First of all can you tell us a bit more about yourself? What you do, where do you live?
I'm a computer programmer from Portland, OR. Since 2014 I've led a small grant funded team called Dat to build tools to bring more reproducibility to scientific data sharing and analysis. Our work is 100% open source and funded entirely through grants and we are housed in side a not-for-profit organization called US Open Data.
One way to think about Dat is as an independent software research + development team. I'd say our main focus is to try and introduce ideas from the open source software world into the world of scientific computing + data which doesn't prioritize and/or fund very many general purpose software tools. Our current major funder, the Alfred P Sloan foundation, has been doing a lot of grant funding in this area recently to try and get universities to start investing in data science tools and infrastructure since it impacts so many academic disciplines now.
@ How are you associated with the Node.js project?
I'm not involved with the Node.js core project directly, but am active in the community. Back in 2013 I helped start NodeSchool (http://nodeschool.io/), which has grown into an amazing community of over 130 chapters globally with thousands of people learning and teaching Node.js with open source curriculum.
@ What are the other open source projects you are involved with?
I use Node.js for all sorts of weird thing, from distributed systems to 3D graphics and have published nearly 300 modules to npm over the last 4 years (https://www.npmjs.com/~maxogden).
@ OK, now let's talk a bit about HyperOS. What is it?
HyperOS is a distribution of TinyCore Linux that we created to support our use case of running containers on top of a version controlled distributed filesystem. We found existing tools like Docker great at deploy your code to a server, but very difficult to use if you wanted someone to run your container on their laptop.
HyperOS came out of the Dat project, which is a dataset version control tool. Our goal with Dat is to make it easy to share datasets, big or small, over a network. We realized it would be really powerful to be able to version control the machine environment (a Linux VM) and the dataset (raw data files like CSVs etc) together in the same repository.
A huge problem scientific software, and to a lesser extent open source software in general, is getting someone elses code to run on your machine. Sometimes it's easy, but sometimes it takes hours or days of debugging someomes Makefile to make it work on your system. There's a saying in science, "works on my machine", which is the typical dismissive answer you receive upon asking someone why the code to reproduce the results in their scientific paper didn't work. We're trying to address this problem, and HyperOS is one of our ideas about how to do so.
@ Who are the target users of HyperOS?
Our main audience is scientists. This is because we are currently paid to write software for scientists. However, we are making sure none of our tools are specific to scientific use cases. We would love to get feedback on our tools from anyone who is interested in using containers to do software dependency management (e.g. fulfilling a similkar need of apt-get or homebrew).
@ Can you give us some use cases where HyperOS makes sense?
Say you had a 2GB CSV file, a Python script that imports it into a PostgreSQL database, runs a query and generates a PNG chart with gnuplot. You now want to package this up in a container and have a colleage reproduce your results on their laptop. You're both running Mac OS X.
Option A is to tell them to install PostgreSQL, Python and gnuplot manually, download the CSV file and run your Python script. This might not sound that hard to some, but there are so many different variables in play that could cause the entire process to fail. You might be using a different verion of Python, PostgreSQL or gnuplot. The CSV URL might return a 404. Your operating system might not have a distribution of one of the software dependencies available.
Option B is to use Docker. You could install the Docker Toolbox on your machine (currently around ~175MB), which includes VirtualBox. Then you could open the Docker Quickstart Terminal app it installs, create a Dockerfile and finally build and publish a Docker image to the Docker Hub. Your colleage would also have to install the Docker Toolbox, open the Docker Quickstart Terminal and Docker pull your image. Once the entire image is done downloading, they can run your container in the terminal.
To get the 2GB CSV, you could either put a script in the container that used curl to download it when the container gets run, meaning the URL where the CSV lives hopefully never 404s (a common occurence in science), or you could include the 2GB file inside the built Docker image, meaning you'd have to rebuild the image every time the CSV changes.
We think this kind of flow involves too many complicated steps for scientists. For example, scientists have long favored flat file formats over complicated databases, even if the databases are more powerful, because at the end of they day they only care about the science -- the code itself is just a means to that end and they aren't willing to invest in what might turn out to be technical debt.
Most people use Docker for secure cloud deployments, and it's great for that. We think containers could also be very useful for local software dependency management.
Option C is to use HyperOS. It downloads 14MB and runs a Linux VM, the total process usually takes less than 1 minute. Then it can do what we call "live booting", which is where we mount a lazy virtual system and spawn a container on top of it. The filesystem is managed on your host OS by a tool we wrote called HyperFS, which you can think of as a version controlled distributed filesystem. The main defining feature of HyperOS is that the actual filesystem in HyperOS is immutable. This is in part thanks to the way TinyCore Linux works, but also because we run 100% of user code inside HyperOS on top of virtual filesystems that are persisted into volumes on the host OS.
We don't have to download the entire filesystem in order to live boot it, we just have to get the filesystem metadata (the filenames, lengths and permissions). When linux needs to read a file like /bin/bash, we fetch it on demand from the remote data source (which could be a single place like the Docker Hub, or a P2P network like BitTorrent). This means you only have to download the data you actually use in the container. Instead of downloading 600MB to run a shell script, we can live boot a container to a bash prompt with only 50MB.
Another huge difference between Docker and HyperOS is that we can do version control for containers. Since everything runs on top of our version controlled filesystem it lets us explore exiciting new possibilities for containers such as forking someones container, installing or modifying some software and sending them a diff.
@ It seems to be using a different way to install Linux 'on' Mac OS X. Can you explain what is 'npm install linux' project?
We are using the new Hypervisor.framework that Apple released as part of OS X Yosemite. It is a operating system level hypervisor that is built into Mac OS. We're using the xhyve project (https://github.com/mist64/xhyve) to interface with it, which is a C port of the bhyve hypervisor API from FreeBSD.
@ What's the reason behind using npm for it? Any clear advantages (any obvious ones against using a virtualized environment)?
We use npm to install our command line tools and to download the 14MB HyperOS distribution. We find npm to be an excellent choice for writing and distributing command line tools.
@ Can you talk a bit about the install process, how does it work?
When you run npm install linux -g it downloads hypercore (https://github.com/maxogden/hypercore), which is our custom TinyCore build. It includes the base TinyCore rootfs plus OpenSSH, OpenSSL and our virtual filesystem mounting utility hyperfused (https://github.com/mafintosh/hyperfused). It also downloads the TinyCore vmlinuz64 64-Bit Linux kernel binary.
We include a 250KB xhyve binary, compiled for Mac OS 64-Bit. The rest of the linux package is our command line interface that lets you spawn hypercore + xhyve and execute commands inside the VM over SSH.
@ I am curious why you targeted OS X (though Windows support is coming). Is OS X more popular among Linux users?
We found xhyve/Hypervisor.framework to be pretty simple since it's built into the OS and has a completely programmatic API. To support Windows we need to integrate with Hyper-V which will involve some manual setup steps for the user and potentially some PowerShell scripting our our part. We could also go the route of bundling VirtualBox, which is what Docker does, but we really want to try and keep the dependencies as simple as possible.
@ What are the core components of the project?
We took the Merkle Directed Acyclic Graph design used by Git and built a distributed filesystem on top of it. This lets us have version control capabilities on the filesystem. Since Linux containers are just filesystems (because everything in Linux is a file), we can replicate and version control containers. The last piece is a way to execute the containers, which requires running Linux. The linux module on npm is our way to do that on Mac OS and eventually Windows using the operating system hypervisors. For Linux users they won't need any special hypervisor software.
@ What are your plans to integrate HyperOS with the npm install linux project?
We're in the process of integrating HyperOS with our Dat command line tool to simplify this workflow down to a single command (e.g. "dat run"). We want to make sure HyperOS is a standalone project in the spirit of the Unix Philosophy, and Dat is just one project that uses it. We also want to use as much existing container standards as we can, e.g. we are looking into being able to run Docker Containers inside HyperOS.
@ Will there be any commercial offering based on the integration?
Not at this time. Luckily we are grant funded and can hopefully continue to write open source software with future grants. If you would like to talk to us about supporting our work with grants, please contact me! It's difficult to find funding for these kinds of fundamental open source public good utilities, and we are very interested in continuing down this path.
@ What's in there for enterprise customers, does it benefit them?
I think enterprise customers will still use tools like Docker to deploy their containers into production. I see HyperOS are more of a developer tool for local development.
@ The 'npm' project is in early stages, what are your long term goals?
Dat is currently in Beta, and our vision for the 1.0 is to combine our filesystem version control with the HyperOS container runtime. We want to bring an end to the "works on my machine" excuse by providing a version control tool that can share reproducible code and data workflows between collaborators.
To get involved with the project, check out our website http://dat-data.com/ or the #dat channel on irc.freenode.net.