Skip to content

Instantly share code, notes, and snippets.

@aglove2189
Last active November 3, 2023 16:24
Show Gist options
  • Save aglove2189/b3505d53170da9af4b56d4480e3b9a6f to your computer and use it in GitHub Desktop.
Save aglove2189/b3505d53170da9af4b56d4480e3b9a6f to your computer and use it in GitHub Desktop.
How to Build Resilient Data Products

How to Build Resilient Data Products

Every aspect of your product should contribute to one of these 5 principles:

  1. Small
  2. Fast
  3. Reproducible
  4. Transparent
  5. Frictionless

A semi comprehensive list

  1. Docker
    • each command / layer of your dockerfile should be in order of least to most modified, in order to take advantage of caching as much as possible.
    • reduce the size of your container by removing build caches and running multiple bash commands in one docker command
      • pip install --no-cache-dir
      • RUN apt-get update && apt-get install -y build-essential curl && rm -rf /var/lib/apt/lists/*
      • e.g. rm'ing a file/folder after installing something ironically makes a container larger, not smaller.
  2. Python Environment
    • use (pip + pip-tools + venv) instead of conda for fast, painless, deterministic python builds
  3. Persistence
    • structured data should live in a sql database, anywhere else and you're asking for problems.
    • unstructured data should live on a filesystem or in a bucket as flat files, preferably parquets. If you're still using csvs, here is your sign to stop.
  4. Dev Environment
    • Use a 'CDE' like coder
    • This ensures your envs are fresh, reproducible, frictionless, and elastic.
  5. Reduce 3rd party dependencies, only use them if they are extremely necessary.
    • Do you really need dask to parallelize?
    • Do you really need a redis cache?
    • etc.
  6. Reduce feedback loops
    • CI/CD, tooling, testing, codebase complexity, build times, runtime, etc should not get in the way of building the product. If any of these 'feedback loops' are painful or slow, they will increase the time of the project.
  7. Reduce 3rd party data feeds, i.e. always go to the source unless near impossible.
    • Said another way, don't let someone else clean the data before you do. A 3rd party is not solving for the problem you are solving for, their cleaning could be your bug.
    • Each 3rd party feed you can replace, is one less dependency / fail point for your product.
  8. Bring in a dedicated DE to your project earlier than you think you need them.
    • They can start working on bringing in fresh data, productionizing code, building up pipelines, etc even before you have a finalized solution.
  9. DEs need to understand the DSs need for experimentation and exploration as much as DSs need to understand DEs need for a robust production system / pipeline.
  10. Unit Test Everything (the pipelines, system, models, data ingestion, etc)
    • It's a huge time commitment to do this right, but if it saves you even once it's worth it.
  11. Keep idempotence in mind. Re-running an operation should not change the outcome.
    • If you rerun your Airflow DAG, would your users see duplicates, changes?
    • If you rerun your model, do your predictions change?
  12. Git
  13. Lint and format your code to ensure consistency across developers. At the minimum have pre-commit hooks that run ruff on the codebase. ruff lints, while ruff format formats.
  14. Code
    • Write code, not too much, mostly functions
    • Do not treat code as sacred - if it needs to be deleted, refactored, or minimized, anyone on the team should not hesitate to do so.
    • More than one person should understand every part of the codebase
    • Single-responsibility principle
      • Separate out your I/O code from your secrets management
      • Separate out your business logic from your modeling code
      • Separate out your data ingestion from your feature creation
      • etc.
    • Composition over inheritance
    • Profile your code (pyinstrument and scalene are great tools for the job). You will be surprised where your code is spending its time and how much influence you have on reducing that time.
  15. When in doubt, go read The Zen of Python python -c "import this"
  16. If you're still in doubt, go read Rob Pike's 5 Rules of Programming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment