Almost every notebook contains a pd.read_csv(file_path)
, or a similar command to load data. Dealing with file paths, however, is kinda troublesome: moving notebooks around becomes a problem, and then the notebook has to know project properties. In this notebook, we discuss a couple of approaches to handle this problem
Starting a notebook is always easy, you just start a couple of cells which often contain a df.head()
. However, as the project grows (and in industry they always do), you will need to organize your folders. You will need a folder for the data, and another folder for notebooks. As the EDA progresses, you will need further folders representing different subsections of the main analysis. On top of that, your project should be reproducible, so that your peers can download the code, run the script, and it will work as intended (hopefully yielding the same results you had :)
So, if you have a read_csv(relative_path_to_data)
on your notebook, moving it from one folder to another will require a change in the code. This is undesirable, we would like it to work regardless of its location. You could solve this by using read_csv(absolute_path_to_data)
, but this is even worse: you will deal with paths lengthier than they need to be, and your code will probably break if you try to run it on another machine.
Let's say you have your working directory on /system_name/project
, from which you run jupyter lab
or jupyter notebook
. The data directory is located at /system_name/project/data
, and your notebooks are in system_name/project/notebooks
In this notebook, we will propose two ways to solve this problem:
- Using a environment variable inside a notebook
- Using a data module
With this approach, we inform the system the location of the data directory through the usage of an environment variable. Before starting a jupyter server, we will set the variable by doing https://gist.github.com/d8136136b0e720d73d7bbb4c05f99a2c
If you are on the /system_name/project
folder, you can do:
https://gist.github.com/2f7e1bf63104c5c2c4e116f244ee1c66
to achieve the same effect. Now, this variable is accessible to all child processes you start from your bash terminal. In your notebooks, you can now do: https://gist.github.com/8879089e1cefb6797feff4fbb4ab727d
Now, the only thing your notebook needs to know is the file_name
of the dataset. Sounds fair, right?
Another thing you can try to do is changing the working directory of the notebook itself by doing this:
https://gist.github.com/e682a8699e919454dea03d61ebb524e3
This works, but I prefer the former, as the latter makes the notebook work in a directory that it is not, feels slightly shady :).
Finally, it might be a bit boring to set the environment variable every time you start a jupyter server. You can automate this process using python-dotenv. It will search for a .env
file, first in the local directory, and then in all it's parents. Once it does, it will load the variables defined there. Check the project documentation if you like the idea!
We used an environment variable to hold information about the project configuration, and exposed this to the notebook. But what about moving this responsibility somewhere else? We can create a module whose responsibility is to know the data directory, and where the datasets are. I prefer this approach, as it will make datasets explicit symbols inside the code.
We will need a project_package
folder to represent, well, the project's package. Inside it, we will create a project_data.py
module:
https://gist.github.com/37740014f97552d7a202c946610d7d9b
We use the __file__
dunder method which returns the current file path, and the built-in Path
class to navigate through the directory. We make this package installable by creating the setup.py
inside the project
folder:
https://gist.github.com/9ae552488a5ddf89092f0ca0dfde889b
We are almost there! Now, we install the package we just created in development mode, so that changes to the package won't require a reinstall: https://gist.github.com/24825b5a10eda3fe024446cee841ea5f
This should install the project_package
package, which can be accessed from the notebook:
https://gist.github.com/043bd44a04ec86cb5718073dffdd9fb4
This way, any notebook in any environment and location will access the data using the same method. If the data location changes, there's just one location we need to change: the project_data.py
module.