Every aspect of your product should contribute to one of these 5 principles:
- Small
- Fast
- Reproducible
- Transparent
- Frictionless
- Docker
- each command / layer of your dockerfile should be in order of least to most modified, in order to take advantage of caching as much as possible.
- reduce the size of your container by removing build caches and running multiple bash commands in one docker command
pip install --no-cache-dir
RUN apt-get update && apt-get install -y build-essential curl && rm -rf /var/lib/apt/lists/*
- e.g. rm'ing a file/folder after installing something ironically makes a container larger, not smaller.
- Python Environment
- use (pip + pip-tools + venv) instead of conda for fast, painless, deterministic python builds
- Persistence
- structured data should live in a sql database, anywhere else and you're asking for problems.
- unstructured data should live on a filesystem or in a bucket as flat files, preferably parquets. If you're still using csvs, here is your sign to stop.
- Dev Environment
- Use a 'CDE' like coder
- This ensures your envs are fresh, reproducible, frictionless, and elastic.
- Reduce 3rd party dependencies, only use them if they are extremely necessary.
- Do you really need dask to parallelize?
- Do you really need a redis cache?
- etc.
- Reduce feedback loops
- CI/CD, tooling, testing, codebase complexity, build times, runtime, etc should not get in the way of building the product. If any of these 'feedback loops' are painful or slow, they will increase the time of the project.
- Reduce 3rd party data feeds, i.e. always go to the source unless near impossible.
- Said another way, don't let someone else clean the data before you do. A 3rd party is not solving for the problem you are solving for, their cleaning could be your bug.
- Each 3rd party feed you can replace, is one less dependency / fail point for your product.
- Bring in a dedicated DE to your project earlier than you think you need them.
- They can start working on bringing in fresh data, productionizing code, building up pipelines, etc even before you have a finalized solution.
- DEs need to understand the DSs need for experimentation and exploration as much as DSs need to understand DEs need for a robust production system / pipeline.
- Unit Test Everything (the pipelines, system, models, data ingestion, etc)
- It's a huge time commitment to do this right, but if it saves you even once it's worth it.
- Keep idempotence in mind. Re-running an operation should not change the outcome.
- If you rerun your Airflow DAG, would your users see duplicates, changes?
- If you rerun your model, do your predictions change?
- Git
- Use it for all your code all of the time, including DS experiments
- Follow Trunk Based Development
- Lint and format your code to ensure consistency across developers. At the minimum have pre-commit hooks that run ruff on the codebase.
ruff
lints, whileruff format
formats. - Code
- Write code, not too much, mostly functions
- Do not treat code as sacred - if it needs to be deleted, refactored, or minimized, anyone on the team should not hesitate to do so.
- More than one person should understand every part of the codebase
- Single-responsibility principle
- Separate out your I/O code from your secrets management
- Separate out your business logic from your modeling code
- Separate out your data ingestion from your feature creation
- etc.
- Composition over inheritance
- Profile your code (pyinstrument and scalene are great tools for the job). You will be surprised where your code is spending its time and how much influence you have on reducing that time.
- When in doubt, go read The Zen of Python
python -c "import this"
- If you're still in doubt, go read Rob Pike's 5 Rules of Programming