Skip to content

Instantly share code, notes, and snippets.

@farrajota
Last active June 28, 2023 20:41
Show Gist options
  • Save farrajota/27c980b3a54b9949744d8387d38fa93d to your computer and use it in GitHub Desktop.
Save farrajota/27c980b3a54b9949744d8387d38fa93d to your computer and use it in GitHub Desktop.
Data science process with checklists

data science checklist


Step 1


Data loading# data science checklist


Step 1


Analyse the project's goals/objectives

  • check if the goal is to predict, classify or cluster data

Data loading

  • check data sources:

    • if file:
      • check for file extension
      • check file size/dimension (if fits in memory)
      • use appropriate method for reading/loading data from the file
    • if URL:
      • check URL healthcheck
      • download url to disk/memory
      • follow previous step is file
    • if streaming:
      • store batch of data
  • check for available metadata (sometimes there's information about the data)

Exploratory data analysis

  • visualize a data sample for the train/test set
  • check the data types
  • determine the independent and dependent variables on the dataset

Data preprocessing/cleaning/wrangling

  • check if data is clean
    • check for missing values
      • if it has missing values:
        • % missing values < 10%
          • delete rows
        • % missing values > 10%
          • delete feature

Data modeling

  • Model selection:
    • select metrics for evaluation
    • select random forest algorithm to optimize
    • define model's hyperparameters (small set of parameters)
    • train/optimize models using cross validation
    • select best model
    • evaluate model for under- or overfitting effects

Solution analysis

  • check the model accuracy and set it as the minimal baseline for the dataset
  • check which features contributed most to the model

Step 2


Data preprocessing/cleaning/wrangling

  • check the data format

    • check if data has a valid range of values (e.g., ages above 100 are usually unlikely)
    • check for data type (numerical or categorical)
      • if categorical:
        • check the % of unique values vs total values
      • if low, feature is categorical and does not need further processing
        • if high, feature may need further analysis and processing
      • if numerical:
        • check if is continuous or categorical (% unique values vs total values)
          • continuous:
          • check for Homoscedasticity
            • check for normality (skewed/kurtosis)
            • check for linearity
      • categorical:
        • convert to categorical
  • preprocess data

    • if numerical, scale data using mean subtraction and std division
    • if categorical, convert to one-hot encoding (with n-1 categories)
  • check if data is clean

    • check for missing values
      • if it has missing values:
        • try multiple imputation methods (by training multiple random forrest models and evaluate their accuracy)
        • delete rows or features
        • consider imputation and remove/impute values
    • check for outliers
      • univariate
      • bivariate
      • multivariate
  • feature engineering

    • check for correlations between features
    • evaluate factor analysis
    • evaluate dimensionality reduction

Data modeling

  • Model selection:
    • select metrics for evaluation
    • select algorithms to optimize
    • define model's hyperparameters
    • train/optimize models using cross validation
    • evaluate trained models for under- or overfitting effects
    • select best model(s)
  • Model train
    • train/optimize best model(s) using the full training data
    • (optional) train an ensemble of models

Step 3


Inference

  • Apply the trained model on new data

Overall process


  1. data load
  • import data
  1. data preprocess
  • clean
  • transform
  • normalize
  1. data modeling
  • model selection
  • hyperparameter selection
  • model training
  • select best model
  • model ensemble (optional)
  1. inference
  • evaluate model on unseen data

  • check data sources:

    • if file:
      • check for file extension
      • check file size/dimension (if fits in memory)
      • use appropriate method for reading/loading data from the file
    • if URL:
      • check URL healthcheck
      • download url to disk/memory
      • follow previous step is file
    • if streaming:
      • store batch of data
  • check for available metadata (sometimes there's information about the data)

Data preprocessing/cleaning/wrangling

  • check if data is clean
    • check for missing values
      • if it has missing values:
        • % missing values < 10%
          • delete rows
        • % missing values > 10%
          • delete feature

Data modeling

  • Model selection:
    • select metrics for evaluation
    • select random forest algorithm to optimize
    • define model's hyperparameters (small set of parameters)
    • train/optimize models using cross validation
    • select best model
    • evaluate model for under- or overfitting effects

First analysis

  • check model accuracy and set as the minimal baseline
  • check which features contributed most to the model

Step 2


Data preprocessing/cleaning/wrangling

  • check the data format

    • check if data has a valid range of values (e.g., ages above 100 are usually unlikely)
    • check for data type (numerical or categorical)
      • if categorical:
        • check the % of unique values vs total values
      • if low, feature is categorical and does not need further processing
        • if high, feature may need further analysis and processing
      • if numerical:
        • check if is continuous or categorical (% unique values vs total values)
          • continuous:
          • check for Homoscedasticity
            • check for normality (skewed/kurtosis)
            • check for linearity
      • categorical:
        • convert to categorical
  • preprocess data

    • if numerical, scale data using mean subtraction and std division
    • if categorical, convert to one-hot encoding (with n-1 categories)
  • check if data is clean

    • check for missing values
      • if it has missing values:
        • try multiple imputation methods (by training multiple random forrest models and evaluate their accuracy)
        • delete rows or features
        • consider imputation and remove/impute values
    • check for outliers
      • univariate
      • bivariate
      • multivariate
  • feature engineering

    • check for correlations between features
    • evaluate factor analysis
    • evaluate dimensionality reduction

Data modeling

  • Model selection:
    • select metrics for evaluation
    • select algorithms to optimize
    • define model's hyperparameters
    • train/optimize models using cross validation
    • evaluate trained models for under- or overfitting effects
    • select best model(s)
  • Model train
    • train/optimize best model(s) using the full training data
    • (optional) train an ensemble of models

Step 3


Inference

  • Evaluate trained model on new data

Overall process


  1. data load
  • import data
  1. data preprocess
  • clean
  • transform
  • normalize
  1. data modeling
  • model selection
  • hyperparameter selection
  • model training
  • select best model
  • model ensemble (optional)
  1. inference
  • evaluate model on unseen data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment