An abstraction to guide ML system design and operation. MLOps can use this pattern as a discipline to deploy ML pipelines to production quickly and effectively.
The fact is ML applications introduce a new culture; their deployment and operations require new discipline and processes different from the existing DevOps practices" -Xu
Summarized notes from this thesis
- Lack of environment that mirrors production for data scientists. Data scientists use local machine to develop models; the environment is completely different from production resulting the need to re-implement from scratch for production
- Programming style conflict. Data scientists tend to develop models with a monolithic program, not following software engineering best practices. ML pipelines should provide a framework of pre-defined canonical unit of operations as components such that ML code can follow ML engineering best practices, as opposed to free-form flexibility
- System design anti-patterns. Glue code and pipeline jungles, causing integration issues. Interfaces between components—both code and data—should be made explicit and simple enough so that implementing such interfaces is easy to use for ML code authors.
ML systems have a special capacity for incurring technical debt, because they have all of the maintenance problems of traditional code plus an additional set of ML-specific issues; and it is unfortunately common for systems that incorporate machine learning methods to end up with many anti-patterns.
- Bringing ML applications to production quickly and reliably, and
- Ensure ML applications operational 24x7 while meeting all the functional and nonfunctional requirements
Principles, abstractions of reusable/repeatable paradigms, collaboration approach and guidelines for separation of concerns and team collaboration
- data collection → data cleaning → feature engineering → model training
- Serving ML models and meeting all the functional and non-functional system requirements
- Front Controller, Model-Serving, and Dynamic Infrastructure Platform
- user interaction logic, a client to display predictions, API endpoint, mobile front end, IoT edge device
- Composite pattern consisting of an Observer and a Trigger
- Performance metrics of ML models, and the threshold used to trigger retraining for the next generation/version of the model
- MS Connector: Define the type and format of artifacts passed, development stack, model training pipeline
- MR Connector: Define the rules of how retrained models are to be tested, versioned and released.
- SR Connector: Define the metrics for ML model performance monitoring, retraining threshold, and retraining data source and code
- SC Connector: Define the type, format and protocol of data exchange between Client application and Service entry point.
- Data scientists are typically responsible for the model layer; data collecting & cleaning feature engineering, model training
- Client developers are responsible for developing the front end for users to access the ML model.
- MLOps engineers are responsible for building and operationalizing the infrastructure for serving models.
A common best practice of serving ML models is to expose them as RESTful APIs for the benefit of platform independence and service evolution.
- Design infrastructure for the Service layer based on functional, non-functional requirements
- Hosting service -- ECS provides managed container service.
- Storage -- Amazon S3 bucket is used to store ML artifacts, Elastic Container Registry (ECR) is used to store ML container images
- Security -- IAM is used to assign permissions to access AWS services,Security Group is used to control inbound and outbound traffic
- Auto Scaling -- Auto Scaling Group manages the auto scaling
- CloudWatch -- AWS CloudWatch is used to monitor the ML infrastructure
- Load Balancing -- Elastic Load Balancer is used to balance network traffic and provides URL
- Front Controller -- RESTful API
- Design all the connectors that interface with Service
- Best practices for components to integrate is through modularization, well defined interfaces, separation of concerns, and design by contracts. (See paper for code implementation)
- MS Connector
- Convert to Microservices to take advantage of container technology
- Define protocol for microservice-enabled code
- Containerization
- Offer the benefits of isolation, portability, agility, scalability, and fast deployment. It also raises new challenges; each service runs in its own process and communicates with other processes using protocols such as HTTP or AMQP
- Put each microservice in a separate container and use HTTP or AMQP to exchange parameters between the services
- Convert to Microservices to take advantage of container technology
- SC Connector
- Determine the protocol between Front Controller and client endpoint.
- ML model is exposed as a RESTful API, the client needs to invoke the service by sending HTTP requests and get responses in json file format.
- MR Connector
- Define the rules of how retrained models are to be tested, versioned and released
- When a new version is produced and deposited to GitHub repository, it can trigger CI/CD process before the model is deployed to the cloud
- SR Connector
- How to initiate retraining, and what parameters to dynamically set at runtime.
- MS Connector
- Design retraining pipeline
Good abstraction and insight of ML system structure that help teams to quickly identify their tasks, roles and responsibilities
- Lack of prototype stack mirroring production environment: With separation of concerns, data scientists can rely on MLOps to build the prototype stack that mirrors the production environment
- Programming style conflict: Guidelines for teams to follow best practice to create well defined interface, to design by contract, and to heed modularization
- System design anti-patterns: Test more often, get rid of experimental code and dead code paths, and deal with technical debt and anti-pattern practice quickly to reduce integration issues.
- Reduced time and difficulty to deploy ML models to production
- Capability to scale up/down horizontally and automatically.
- Live model monitoring, tracking and retraining
With ML finding its way into all facets of software development, it is critical to use design patterns to create realiable and scalable systems. The MSC/R design pattern serves this purpose.
Gained value from these notes? Leave a comment sharing what you took away from it!