kpn-advanced-analytics/what_is_model_factory.md

## what_is_model_factory.md

      
    Raw
  

              what_is_model_factory.md
            
          
    As today’s companies strive to become more data driven, reliable analytics and data science has become an essential part of staying competitive and keeping costs under control. Because of this most mid to large companies have created their own analytics or data science teams, or roles, that focus on producing, maintaining and scoring models. These models are essentially pieces of code that use raw data to produce insights or strategies that other teams and management rely on. Because most analytics and data scientist teams create and tune these models by hand, this work is more of a R&D field rather than a bookkeeping one. For this reason, like most R&D products, the models need to be made ready to be run in a stable, reliable and auditable fashion as fast as possible.
What is Model Factory?

Model factory is a framework that helps analytics and data science teams go from a development model to a stable, reliable and auditable production model faster and with less guess work. A model factory is not one software product but multiple software products linked together in fashion that supports a recommended workflow. The framework was created to take out the ad-hoc work of adding code/features to models and processes to account for:

Version control of models
Automated testing
Input data checks
Auditability
Monitoring
Notifications
Scheduling

The speed at which a company’s data science team can develop, test and automate models is a large factor in how fast the company can innovate and respond to the market. Model Factory was created to automate and speed-up the mundane tasks like those mentioned above to allow the data scientists to focus on finding and utilizing new insights. The framework can also help newer teams to formulate internal standards for measuring performance and releasing new model versions.
Framework

The framework architecture is designed to be flexible to allow companies to use software and servers that are already available and are supported by their IT departments.

The framework consists of four major components which will be discussed further in details.
Version control

Data scientist often have to collaborate with each other on a project, and it can be annoying to constantly exchange the files between each other. It is not only inconvenient, but also easy to loose some piece of work. Also, if the last version of the code gets broken, there is a chance that you spend hours on reverting back to the previous working version.
Version control systems are an excellent way to solve all these problems. Most developers have worked with some sort of version control system, but data scientist may find it a foreign concept. Benefits of placing your projects under version control include the ability to:

revert back to previous versions of code when the code gets broken;
restore accidentally deleted scripts;
track code changes;
maintain multiple versions of a project (branching);
create triggers for certain events in updates

We find the concept of feature branching particularly interesting and think that it may change the whole way of working for data scientists. Typically, you have two main branches: development and master (production branch). At the same time, data scientists can work simultaneously within different feature branches: one data scientist is trying to add extra variable transformations to the model (within feature branch 1), another data scientist is working on adapting another machine learning algorithm (within feature branch 3). We can see that feature 1 goes first to development and then to production (V2), and feature 3 is still being worked on. In the meantime production model gets broken and gets fixed in Hot Fix branch, resulting in version V2.1 in production. The described process continues.

You will find more on branching in the following blog.
Orchestrator

When having a model in production it is very important to have a reliable scheduling tool that is able to pull the last version of production model from a version control system, send notifications in case something gets broken and create some simple report about model run.
Jenkins, Luigi, Airflow and some other workflow management platforms can be used as an orchestrator that satisfies all these conditions. Let’s take Jenkins as example.
Jenkins Job configuration consists of multiple building blocks.
Source code management

Jenkins allows you to select your Git/Mercurial repository and the branch (development/master), so going from development to production is the matter of merging development branch into the master branch. Because the production Jenkins job listens to the master branch, the next run will be therefore the newest version of the model.
Most data science teams have a separate development Jenkins job for each of their models. These jobs are usually triggered by hand or on every commit to run automated tests on the last commited version of the model.

Build

The last version of the code in the repository will be pulled to Jenkins and can be used in a build step in Jenkins. The simplest build step is letting Jenkins run a shell script that executes the last checked-out model. Because this is a shell, it can do much more complex things, for example, running a script on a remote server.

Post-build Actions

An example of a post-build action could be to create a report containing the information about the model run. In "Publish HTML reports" plugin it is possible to specify the location where the report is created (creation of the report itself happens in build steps)

Another example of a post-build action is a Slack/email notification. After the job is finished, Jenkins server can notify the users with the end status of the run.

Here you can read more about configuring a Jenkins job.
Compute engine

Jenkins job triggers a computational process using a compute engine. In general, any analytical service can be used as a compute engine within Model Factory. It can be a big machine with Python / R installed on it. It can be also a Spark / Aster cluster to make computations in a parallel fashion.
Every computational process/ job has its unique identifier (we call it a session id), which is used to store the results of the process in an unified way. For example, model scores, different model statistics like variable importance and model quality metrics can be stored using the same identifier for the particular run. Because for every session id the model id can be retrieved, it is possible to look in the history and compare models of different time periods with each other.
Storage

When operating a data science or analytics team it is important to create standard metrics or KPIs to measure model’s stability, performance and input data quality. Metrics can be for example Variable-Importance, False Positives or any other commonly used measure. These standard metrics allow for every iteration of a model to be comparable to old versions and makes it easier to definitively determine if an iteration is better. The metrics are meant to provide a means to quantify the improvement and quality of the result in a manner that has been agreed upon by the team. The metrics should be calculated in exactly the same way for them to be comparable to those of previous versions. Because the metrics have been agreed upon from beforehand creating boilerplate code/functions to calculate them can be made and packaged to be reused by all team members. This not only ensures that the metrics are calculated in the same manner(even across different models) but also speeds up the development process. Standardizing the code to calculate metrics discourage calculating metrics in a not agreed upon manner which can avoid misinterpretations or results.
We at KPN Advanced Analytics have developed an R and a Python package that we use to track our model metrics. These packages use a database table structure to store the metrics and scores. The packages and table structure are the results of months of tweaks to account for various use cases. The R package is still in beta and the Python package in alpha stage but they are actively used by our teams in our production system.
You can find more information on our packages here.
Model Factory uses a relational database as its primary storage (e.g., Teradata, PostgresSQL, MySQL). However, models in Model Factory can use input data from any type of data store (e.g., Hadoop, file store, web).
Currently Model Factory can only store the data in a relational database. Tables in the database use session id and model id as a key. The models itself can be stored as a blob type in a table. Model summary, variable importance statistics and model quality information (test results) can be stored in the corresponded tables. Here is an example of the table structure that can be used within the Model Factory. The structure can be changed depending on the business needs.