armandmcqueen/notebook_persistence.md

## notebook_persistence.md

      
    Raw
  

              notebook_persistence.md
            
          
    Bind Mounts in Notebook Persistence

We want to store the state of a notebook in NFS so that if the container dies, all of the vital state is persistent and minimal work is lost.
There have been a couple of different proposals for how EFS should be laid out and how bind mounts should work. It's gotten confusing for me, so I'm summarizing them here to see if I understand them correctly.
Note on terminology - Notebook with a capital 'N' refers to the Determined concept of a Notebook which is an instance of JupyterLab. notebook with a lowercase 'n' refers to a Jupyter notebook (a .ipynb).
In all cases, there will be a directory on EFS called /shared-data that will be bind-mounted to the jupyterlab container at /shared-data. It can be used to share datasets (or anything else) between every Notebook that runs in Determined.
Data Model 1: State is linked to a User

In this model, state such a notebook files and other files generated while using a notebook are linked to a User and not to a specific Determined Notebook. One Notebook by a user can cause another Notebook by that user to no longer work, by changing state that the second Notebook relies on. Determined knows that the state for a Notebook is somewhere in the user's EFS folder, but it has no way to know what data is needed for which Notebook.
EFS Layout

/shared-data

/user-data/userA
/user-data/userB
/user-data/userC

Option 1 - mount the user's user-data directory

In this option, the container mounts the user's user-data folder and uses that folder as the working directory.
When the JupyterLab opens, they are in the /run/determined/workdir directory which is backed by the /user-data/userA directory on EFS. If they were to open a new Notebook, they would see the same thing.
To share data with another user, they would need to use /shared-data or manually use an external tool like cloud storage.
Questions/thoughts:


When a user creates a Notebook, we show them an example notebook. Where do we put that? /user-data/userA/notebook.ipynb?
We fill in jupyter-conf.py with container specific information (e.g. c.NotebookApp.base_url       = "/proxy/98e760cb-d0f1-4c9f-b51e-39a94d2b99c4/"). Where will that go so that it's always correct? Probably not EFS.
When the EFS starts to fill up, Determined cannot help with the cleanup process as it does not know which data is linked to which Notebook. Data scientists will have to manually clean up their user-data directories.
What do we do about user-data when a user is deleted since no one else has access to that user's user-data?

This model seems not great - sharing data across users is important for larger DS teams
Option 2 - mount the entire user-data directory

In this option, the user can see the entire user-data directory, but only the user's specific directory has write access.
This model seems fine for the short-term, but I don't understand this model well.
We can't have the container's home directory /run/determined/workdir be backed by the /user-data folder. Where is the user-data folder mounted? How do we ensure that when someone opens a Notebook and starts working on a new notebook that notebook file is persisted on EFS? How are you going to ensure that the user has write access to their user-data folder but not to anyone else's? The container always executes as root.
We could do two bind mounts?
/run/determined/workdir -> /user-data/userA
/user-data -> /user-data

You have full read access to everything anyone does on the cluster. That is probably not going to work long-term with enterprise customers, but it is consistent with our current perimeter-based approach to authZ.
Questions/thoughts:


Same questions/thoughts as Option 1
UserA might rely on data in /user-data/userB so UserB could end up breaking UserB's code.
We will eventually need to support AuthZ and I'm certain how we would make that work in this option.

Data Model 2: State is linked to a Notebook

In this model, instead of state being linked to a user, state is linked to a Notebook. Notebook state is not dependent on who created it - they can be transferred, reproduced, forked, deleted. Notebook state can only be changed by interacting with the Notebook, it cannot be changed by accident. Determined understands how state stored in EFS relate to Notebooks - this means that Determined can do a lot more of the heavy lifting around managing state.
EFS Layout

/shared-data

/notebook-data/notebook-1
/notebook-data/notebook-2
/notebook-data/notebook-3

Option 3 - mount the notebook directory

In this option, the container mounts a notebook state folder and uses that as the home directory. In most cases it will create a a new notebook-data directory, either from scratch or by copying an exiting notebook-data directory. In some specific cases it will reuse an existing notebook-data directory.
When the JupyterLab opens, they are in the /run/determined/workdir directory which is backed by the /notebook-data/notebook-1 directory on EFS. If they were to open a new fresh Notebook, they would be put in /run/determined/workdir dir backed by /notebook-data/notebook-2. If they were to fork Notebook 1, they would be put in the /run/determined/workdir dir backed by /notebook-data/notebook-3 (which is a copy of /notebook-data/notebook-1).
Questions/thoughts:


IMO this more accurately models how state relates to Notebooks in the long-term, particularly around reproducibility, sharing of Notebooks, and automatically managing storage/infrastructure. In the "state is linked to a User" model, I feel like we will have to rearchitect as soon as more requirments come in such as some wanting to add AuthZ to the product.
This has some upfront cost that will pay off longer term - since we are managing state instead of having the user manage their own user-data folder, we need to provide tools to do management tasks like clean up.
Notebooks should be fully contained objects in our system like Experiments - so we can apply auth around them, fork them, reproduce them, describe the amount of disk space it is taking up,
In this model, Notebook's have state, but since Notebooks aren't persisted in the DB, state will outlive the Notebook and so can't be deleted through the UI. This model makes less sense when Notebooks aren't persisted to the DB, is that not a part of Notebook persistence?