OddExtension5/1.end-to-end-pipeline-for-data-science-project.md

## 1.end-to-end-pipeline-for-data-science-project.md

      
    Raw
  

              1.end-to-end-pipeline-for-data-science-project.md
            
          
    Supervised Learning Pipeline

Here is a general end to end pipeline for a data science project.
1. Define Business Objective & Criteria for Success

Experimental Design

Identify the business/product objective
Identify & hypothesize goals and criteria for success
Create a set of questions for identifying correct data set
Define which machine learning evaluation metric will be used to quantify quality of predictions
Identify data sources, time window of data collected, data formats, data dictionary, features, target & evaluation metric


2. Data Aquisition

Define what/how much data we need, where it lives, what format it's in & load dataset
Import data from local or remote data source & determine the most approperiate tools to work with the data

Pandas has functions for common open source data formats including data base connectors for MySQL & PostgreSQL
Use Spark for Big Data


Gather/Read any documentation available for the data (schema, data dictionary)
Load and pre-process the data into a representation which is ready for model training

If the data is available in an open source data format (JSON, CSV, XML, EXCEL), you'll be able to leverage open source tools
If the data is available in a closed source format(fixed formatted row) then you will need to develop a parser to format the data into approperiate columns
Ensure correct data types are imputed
Look at the values. Ensure they make sense in the context of each column
Look for missing/empty values
For categorical fields, what are the unique values in the field?
For numeric fields, are all values numbers?
Split-out validation dataset


3. Exploratory Data Analysis

Gather insights by using exploratory methods, descriptive & inferential statistics

Find median, mode, std dev, min, max, average for each column. Do these make sense in the context of the column?
Do financial values have reasonable upper bounds?
Univariate feature distributions (to observe stability & other patterns of a given feature like skew)
Feature & target correlations
Target analysis (plot of feature vs target)
Are there any outliers?
Do the column values seem to follow a normal distribution? Uniform? Exponential (i.e. long tail)? If exponential, taking log(X) may be beneficial for linear regerssion.


4. Feature Engineering

Perform feature scaling / normalization
Inject domain knowledge (structure) into the data by adding or modifying existing columns

Linear combinations of two or more features (ratios or other arithmetic variations)
Adding new columns for day of year, hour of day from a datetime column


Convert categorical data into numerical values using one-hot encoding

5. Feature Selection

Drop highly correlated features (see correlation section above)
PCA
Recusive Feature Elimination
Regularization method using LASSO

6. Select, build & evaluate the model

Establish a baseline model for comparison
Spot Check & Compare Algorithms
Run a spot check of single model performance & tune the top 3 best performing learners

Evaluate Algorithms with Standardization
Improve accuracy


You may generally find ensemble Methods (such as Bagging and Boosting, Gradient Boosting) to be quite useful

7. Refine the model (Hyper-parameter tuning)

Use GridSearch to search & tune hyper-parameters

8. Finalize Model (use all training data and confirm using validation dataset)

Save model binary along with model training results
Predictions on validation dataset

9. Communicate the results

Summarize findings with narrative, storytelling techniques
Present limitations, assumptions of your analysis
Identify follow-up problems and questions for future analysis


## 2.Variables-not-to-be-used-for-training-a-ML-model.md

      
    Raw
  

              2.Variables-not-to-be-used-for-training-a-ML-model.md
            
          
    Variables Not to be used for training a ML model: todo

Not all variables available in the dataset should be used during training. Here is a list of questions to help you figure out which variables to exclude from the training production.


Is the variable available during time of inference (i.e. production prediction)? You'll want to first know when you'll be making a prediction?


Do you know if a plane will arrive late prior to taking off?


In some regulated industries, some variables are illegal to use for predictive modeling.


For example, personally identifiable information (PII) is one such example.


How likely is the variable available in production?


Determine a threshold for how available you expect a variable to be available during time of inference and remove variables which exceed that threshold.