Skip to content

Instantly share code, notes, and snippets.

@metafeather
Last active October 12, 2022 12:53
Show Gist options
  • Save metafeather/d9c6b36da207679b320e213dd06746e9 to your computer and use it in GitHub Desktop.
Save metafeather/d9c6b36da207679b320e213dd06746e9 to your computer and use it in GitHub Desktop.
Learning the ML Tech Stack in 2022
created tags source author description
2022-10-12T11:15:16 (UTC +01:00)
Liam Clancy
Moussa Taifi PhD
This post is inspired by a classic article “It’s the future” from CircleCI and a nifty post about Javascript frameworks.

How it feels to learn MLOps in 2022

Excerpt This post is inspired by a classic article “It’s the future” from CircleCI and a nifty post about Javascript frameworks.


Like any ML Platform discussion, it should be taken with* a pinch of salt*:

Hey, I got this Video Categorisation Prediction model, and my TL said you could help me deploy it? I haven’t coded much production ML software lately, and I’ve heard MLOps is the way to go. You are the most up-to date Ops person around here, right?

  • The correct title is MLOps engineer, but yeah, I can help with that. MLOps in 2022 is my thing. Canine posture detection, Kitchen Edge ML, NLP for flying IoT, Active learning for Self-driving TV remotes, anything to help. I spent the past month at ODSC, MLOps Summit, PyData, and KDD, so I probably can describe the latest technologies to build and deploy ML products.

Nice. My model takes in a set of numerical features and returns a probability of Video Category, so I just need a simple REST endpoint to respond to requests from the front-end application. I was thinking of a simple Flask app to do the prediction and return the probabilities, what do you think?

  • Hmm yeah that would be too backward, no one uses Flask for predictive endpoints anymore. You should try KServe, it is 2022 after all.

Oh, OK. What’s KServe?

  • It is a Kubernetes native model serving layer that started at Google as part of the Kubeflow project, but now it is an “independent” project. It really gives ML engineers the control and performance required for serious ML predictive services by handling the transformation, prediction, and explainability of your model predictions.

That sounds like something I could use. Can I install it on my laptop?

  • Yeah, it does work on laptops, you’ll need a local Kubernetes cluster to run it but you can spin one with Kind. You’ll also need Elasticsearch and Kibana to handle the logging.

I can run those locally as well right? What else?

  • The logging is going to be pretty useful for monitoring the inputs, the results, the predictions latency. You can use Prometheus/Grafana/DataDog for the rest of the system metrics. You can also add some triggers transfer data to a broker to collect data around adversarial samples, outlier detection, and detecting concept drift. Also, before I forget, If you are doing some transformations on the inputs, you’d better use Feast for the live feature store.

Feature store? What’s that?

  • A feature store is like a data warehouse, but instead of human users using it for reports and dashboards, it is optimized for ML machines that need to store and retrieve pre-calculated features. Tecton, Butterfree, and Bytehub are good choices too, but Feast has an integration with Kubeflow/KServe, so I’d go with that.

What’s wrong with our data warehouse?

  • It’s 2022. No one queries data warehouses directly for low-latency ML features. What are your latency requirements? Anything below 100ms, and you are going to need to cache the features on feast-redis for quick feature “hydrations” .

Feature hydrations? Fine, so if I add my pre-calculated features to the feature store I am good to go?

  • Not yet. You’ll need a data job to update the feature store with the new unseen features you might be getting. That can be triggered adhoc, based on events like model accuracy, or just an hourly trigger. You can set your job on Airflow or a Kubeflow Pipelines, but remember that Kubeflow triggers only have the periodic and cron versions so far, so if you need event-based triggers you might want to use Airflow with their 2.0 API based job triggering. What’s the plan for retraining?

I was thinking to use a basic daily cron job to update the model and the new feature look up tables?

  • Yes, but you might need a more robust solution. For example, if you are going to trigger continuous retraining when the accuracy metrics drop, then you need a solid DAG scheduler to avoid hitting conflicts around your training compute quotas. How are you packaging the model?

I was going to pickle the whole script and model.

  • Yeah, that’s too old school. You might as well use a model registry like MLMD registry, and package the model itself in the ONNX format. That way if you want to move to a different modeling library, tensorflow, pytorch, fastai, sklearn, xgboost, MPI you can do that later.

More tools? What’s MLMD?

  • MLMD is a platform for the machine learning lifecycle. It has some great ideas around artifact tracking, model registries, and model packaging. For the model serving app itself, you’ll have to add the conda+docker components even though I would go with conda-forge+pypi from an internal mirror. And remember that Docker is not free anymore starting January 2022 so ask the Devops team for a pro license._

I am getting a bit lost here. Why do I need conda if I am using docker?

  • That’s unless you wanna debug GCC compilation issues by yourself. Some python libraries that you are using are probably a wrapper around c code. So conda helps with providing pre-compiled binaries even though wheels are getting pretty popular for pre-compiled binaries. Docker can help with the rest, and is a hard-dependency for deploying on Kubernetes. Also keep in mind that If you go with the Kubeflow Pipeline DSL you will need a separate docker container for each step in my pipeline. You might want to checkout Kale.

Kale?

  • Yes, it is an add-on to Kubeflow. Both a vegetable, and a satellite of Jupiter, the planet … Anyways, Kale has a nifty annotation system for your notebook cells that helps you generate Kubeflow pipelines without using the KF pipeline DSL directly. On another note, what about data validation? Are you using TFDV?

TFDV?

  • Yes, Tensorflow Data Validation, that’s a toolkit for calculating stats, getting the schemas, and detecting anomalies. It uses Apache Beam under the hood to run planet scale data validation jobs. Are you familiar with TFRecords?

TFRecords?

  • Yes it is a super optimized binary format built on Protocol buffers to be cross platform and cross language while being super efficient at data serialization. To simplify things for the validation part if you are still using Pandas then Great Expectations should do the trick as well.

What do you mean “still” using Pandas. I use Pandas for almost everything.

  • That’s fine you are not alone but Ray, Dask, and Koalas can scale up your workloads a lot better… In term of retraining when the performance drops are you going to use a custom data lake for the new training datasets? Your are probably using DVC for versioning the files, but if you are scaling your data might as well upgrade to Snowflake.

DVC? LakeFS?

  • Yeah they are like Git for data. or at least an attempt. For smaller dataset you can stick Data version control e.g. DVC and hook it up to your blob storage and to your git repo. You can then go cherry-pick older data files that gave you better results. LakeFS is the next level up for your datalake when the files sizes and count go into the TBs. Are you using some streaming stack for shipping the results to your storage for observability? You might want to get familiar with Spark+Kafka+Delta stack to ship your predictions input, transformed features, and outputs to your data lake and then load them up to Snowflake. That should help you merge your predictions with Actuals when you get them. You can hook that up to Amundsen for data discovery once you are settled on the reporting warehouse. But back to the model serving. Are you going to run A/B tests?

Yes I was thinking I could switch between two top models once every other day to compare the live performance

  • You can do better. Just run Istio on the K8s cluster and you can split and mirror traffic, simulate timeouts, add circuit breakers, shape traffic, inject failures, and everything in between. That’s more reliable than the switcheroo A/B scheme you are suggesting. That opens up a bunch of doors for switching to a model mesh or a fully distributed inference graph later on.

I am not sure I need all that at this point, but if it is easy to setup then I don’t mind. Realistically, I am just trying to return a probability score to the caller based on the numerical features I get. I used to just expose that as a simple Flask API from a model I built on my notebook.

  • Look, it is 2022. No one uses Flask to serve models anymore. It ends up slow, and super convoluted, unmanageable; everyone knows that. At least look into FastAPI for a modern stack, especially the async parts and the built-in input data validation could be useful if you are planning on doing the feature transforms live at prediction time. And for your notebooks you might want to move to Jupytext to add some level of human-readable version control of the source.

Jupytext, kk, I’ll take a look.

  • Are you set on the dashboarding tooling? I think you are using probably plain Matplotlib, but if you are not set on that you can build simple dashboards with Streamlit, and we can upgrade to Superset down the line. For your experimentation you could launch a Tensorboard server for model debugging if you are using Tensorflow. You can couple that with the MLMD experiment tracking for a more generic metric tracking, model registry, and model promotion.

Yeah, I meant to ask about that….

  • Yeah, take a look at those. Also what are you planning on using for CI/CD? We could use a simple CI server for running the integration tests of your serving layer. Something simple like Concourse, could work for your containers, but if you want to use Tekton it became a bit more cloud native these days (Not to be confused with Tecton the feature store vendor from uber/michaelangelo). Looks like it is getting full k8s support. For something more custom, you can look at the Continuous Machine Learning CML project. That helps organize the CICD process around PRs with metrics displayed as PR comments to decide if a new run is worth merging to main and promoting to prod. Also, you might wanna checkout Argo CD. They are K8s native and have some nice ideas around ML workflows for continuous delivery.

Right, right, CML, ArgoCD, Tekton….. I do use Git for most of my projects so I should be able to pick those up.

  • Sounds good…their docs are usually not too bad. One more question on my mind, is what are you up to with labeling? Are you using some text or image data in this model? If so, you might need some help setting up Snorkel for entity recognition and extracting data from those income statements, even though I am not sure how you will integrate that with the feature store solution we pick. Current feature stores don’t play super well with unstructured data.

I remember vaguely hearing about Snorkel. I don’t need that right now, I am using the existing numerical data and labels I get from our standard data pipelines and data warehouse.

  • I see yeah that’s fine for now but I hear they have their own issues upstream with the recent reorg and senior data engineer attrition. Are you subscribed to their data versioning updates? Do you have some drift distribution and outlier monitoring for the input data? Are you planning on managing that on your own? I see that Pachyderm is moving towards that quickly, we just need to see how to integrate it with the current data engineering pipeline you are consuming.

Yes I just rely on the standard data warehouse we have that collects video information….What’s pachyderm?

  • That’s a good question! They are targeting to be a data pipeline orchestration + data versioning. I heard some impressive testimonials but who can trust the sales folks these days. One thing you might be interested in there is their reproducibility component with their immutable data lineage. One more question you might wanna look into is RBAC for setting permissions to the data and models as well as model security monitoring and alerting. I suppose you have something in place for that, right?

I use my personal user creds for accessing the data right now from my notebooks, but I have asked for an application user with the right permissions. I am still waiting for the Okta integration, but right now, the client-facing app team told me they are OK with using a manually pre-shared token to access my API over https.

  • That should be enough for a couple of weeks, until the audit team hears about this. They will probably ask for rotating credentials, encryption in transit and at rest with user managed keys, model DDOS protection with a CDN, GDPR/CCPA support for the data that passes through your pipeline. Also, what’s your story around explainability? I don’t know much about that but I understand anything beyond shallow decision trees is a no-no? Did you play with Google’s Model Card Toolkit+ the What-if tools? Maybe a commercial offering like Fiddler might be more accessible, they are figuring out all that contrastive explanations thing, but using them would mean sending our company PII data to a third party. Maybe they integrated some of that federated learning, and privacy-aware inference stuff in their product. I’d have to check. Anyways you’ll have to think how customer service will be able to answer customer complaint right?

I haven’t thought about that part yet.

  • What about resource provisioning and configuration management? Did you use Terraform+Helm before? You might need to brush up on that to get going with the infra side of things. Well as long as you keep IaC front and center, you should fine.

IaC?

  • Infrastructure as Code that is. It is pretty necessary as soon as you try to scale stuff on public clouds. That’s unless you wanna play with ARM or Biceps templates, or you like editing Cloud formation templates.

Thanks a lot, I am a bit overwhelmed by all this. Besides, I think I am going to go back to my product manager to check on the timeline. I think my estimates were a bit on the lower side.

  • Yeah that could be a lot to take in. You probably saw the MLOps Stack Canvas + the Canonical Stack for ML apps . It might help your organize your thoughts. Here is a nice picture I saw on this blogpost from the AIIA folks.

Current AIIA template for building a Canonical ML Stack

AIIA?

  • The AI Infrastructure Alliance… you know things are getting real when you need an Alliance to create interoperability. It reminds me a bit of the CNCF for the “Cloud Native” folks when k8s was getting started.

No idea what the CNCF is but I think I can ignore that for now.

  • Yeah, don’t worry about that. Watching those silent wars was kind of fun, but most of them are over. You know maybe we’ll all just end up using some end-to-end platform like AWS Sagemaker / Azure ML / GCP Vertex AI ? Right?

Do you think that those would be easier to use?

  • Well, it depends who you ask. Did you check with our compliance, NetOps, and DBA teams about the integration blockers for public clouds usage?

Hmm, I see what you are saying… Thanks anyways. Let me do some research on what we talked about, and I’ll come back with more questions.

Bye!

Disclaimer: The views expressed on this post are mine and do not necessarily reflect the views of my current or past employers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment