Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save OddExtension5/271276f79288ace87ea15e08797a2843 to your computer and use it in GitHub Desktop.
Save OddExtension5/271276f79288ace87ea15e08797a2843 to your computer and use it in GitHub Desktop.
Deploying Machine Learning Model in Production

Supervised Learning Pipeline

Here is a general end to end pipeline for a data science project.

1. Define Business Objective & Criteria for Success

  • Experimental Design
    • Identify the business/product objective
    • Identify & hypothesize goals and criteria for success
    • Create a set of questions for identifying correct data set
    • Define which machine learning evaluation metric will be used to quantify quality of predictions
    • Identify data sources, time window of data collected, data formats, data dictionary, features, target & evaluation metric

2. Data Aquisition

  • Define what/how much data we need, where it lives, what format it's in & load dataset
  • Import data from local or remote data source & determine the most approperiate tools to work with the data
    • Pandas has functions for common open source data formats including data base connectors for MySQL & PostgreSQL
    • Use Spark for Big Data
  • Gather/Read any documentation available for the data (schema, data dictionary)
  • Load and pre-process the data into a representation which is ready for model training
    • If the data is available in an open source data format (JSON, CSV, XML, EXCEL), you'll be able to leverage open source tools
    • If the data is available in a closed source format(fixed formatted row) then you will need to develop a parser to format the data into approperiate columns
    • Ensure correct data types are imputed
    • Look at the values. Ensure they make sense in the context of each column
    • Look for missing/empty values
    • For categorical fields, what are the unique values in the field?
    • For numeric fields, are all values numbers?
    • Split-out validation dataset

3. Exploratory Data Analysis

  • Gather insights by using exploratory methods, descriptive & inferential statistics
    • Find median, mode, std dev, min, max, average for each column. Do these make sense in the context of the column?
    • Do financial values have reasonable upper bounds?
    • Univariate feature distributions (to observe stability & other patterns of a given feature like skew)
    • Feature & target correlations
    • Target analysis (plot of feature vs target)
    • Are there any outliers?
    • Do the column values seem to follow a normal distribution? Uniform? Exponential (i.e. long tail)? If exponential, taking log(X) may be beneficial for linear regerssion.

4. Feature Engineering

  • Perform feature scaling / normalization
  • Inject domain knowledge (structure) into the data by adding or modifying existing columns
    • Linear combinations of two or more features (ratios or other arithmetic variations)
    • Adding new columns for day of year, hour of day from a datetime column
  • Convert categorical data into numerical values using one-hot encoding

5. Feature Selection

  • Drop highly correlated features (see correlation section above)
  • PCA
  • Recusive Feature Elimination
  • Regularization method using LASSO

6. Select, build & evaluate the model

  • Establish a baseline model for comparison
  • Spot Check & Compare Algorithms
  • Run a spot check of single model performance & tune the top 3 best performing learners
    • Evaluate Algorithms with Standardization
    • Improve accuracy
  • You may generally find ensemble Methods (such as Bagging and Boosting, Gradient Boosting) to be quite useful

7. Refine the model (Hyper-parameter tuning)

  • Use GridSearch to search & tune hyper-parameters

8. Finalize Model (use all training data and confirm using validation dataset)

  • Save model binary along with model training results
  • Predictions on validation dataset

9. Communicate the results

  • Summarize findings with narrative, storytelling techniques
  • Present limitations, assumptions of your analysis
  • Identify follow-up problems and questions for future analysis

Variables Not to be used for training a ML model: todo

Not all variables available in the dataset should be used during training. Here is a list of questions to help you figure out which variables to exclude from the training production.

  1. Is the variable available during time of inference (i.e. production prediction)? You'll want to first know when you'll be making a prediction?

  2. Do you know if a plane will arrive late prior to taking off?

  3. In some regulated industries, some variables are illegal to use for predictive modeling.

  4. For example, personally identifiable information (PII) is one such example.

  5. How likely is the variable available in production?

  6. Determine a threshold for how available you expect a variable to be available during time of inference and remove variables which exceed that threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment