farrajota/checklist.md

## checklist.md

      
    Raw
  

              checklist.md
            
          
    data science checklist


Step 1

Data loading# data science checklist

Step 1

Analyse the project's goals/objectives

check if the goal is to predict, classify or cluster data

Data loading


check data sources:

if file:

check for file extension
check file size/dimension (if fits in memory)
use appropriate method for reading/loading data from the file


if URL:

check URL healthcheck
download url to disk/memory
follow previous step is file


if streaming:

store batch of data


check for available metadata (sometimes there's information about the data)


Exploratory data analysis

visualize a data sample for the train/test set
check the data types
determine the independent and dependent variables on the dataset

Data preprocessing/cleaning/wrangling

check if data is clean

check for missing values

if it has missing values:

% missing values < 10%

delete rows


% missing values > 10%

delete feature


Data modeling

Model selection:

select metrics for evaluation
select random forest algorithm to optimize
define model's hyperparameters (small set of parameters)
train/optimize models using cross validation
select best model
evaluate model for under- or overfitting effects


Solution analysis

check the model accuracy and set it as the minimal baseline for the dataset
check which features contributed most to the model


Step 2

Data preprocessing/cleaning/wrangling


check the data format

check if data has a valid range of values (e.g., ages above 100 are usually unlikely)
check for data type (numerical or categorical)

if categorical:

check the % of unique values vs total values


if low, feature is categorical and does not need further processing

if high, feature may need further analysis and processing


if numerical:

check if is continuous or categorical (% unique values vs total values)

continuous:
check for Homoscedasticity

check for normality (skewed/kurtosis)
check for linearity


categorical:

convert to categorical


preprocess data

if numerical, scale data using mean subtraction and std division
if categorical, convert to one-hot encoding (with n-1 categories)


check if data is clean

check for missing values

if it has missing values:

try multiple imputation methods (by training multiple random forrest models and evaluate their accuracy)
delete rows or features
consider imputation and remove/impute values


check for outliers

univariate
bivariate
multivariate


feature engineering

check for correlations between features
evaluate factor analysis
evaluate dimensionality reduction


Data modeling

Model selection:

select metrics for evaluation
select algorithms to optimize
define model's hyperparameters
train/optimize models using cross validation
evaluate trained models for under- or overfitting effects
select best model(s)


Model train

train/optimize best model(s) using the full training data
(optional) train an ensemble of models


Step 3

Inference

Apply the trained model on new data


Overall process


data load


import data


data preprocess


clean
transform
normalize


data modeling


model selection
hyperparameter selection
model training
select best model
model ensemble (optional)


inference


evaluate model on unseen data


check data sources:

if file:

check for file extension
check file size/dimension (if fits in memory)
use appropriate method for reading/loading data from the file


if URL:

check URL healthcheck
download url to disk/memory
follow previous step is file


if streaming:

store batch of data


check for available metadata (sometimes there's information about the data)


Data preprocessing/cleaning/wrangling

check if data is clean

check for missing values

if it has missing values:

% missing values < 10%

delete rows


% missing values > 10%

delete feature


Data modeling

Model selection:

select metrics for evaluation
select random forest algorithm to optimize
define model's hyperparameters (small set of parameters)
train/optimize models using cross validation
select best model
evaluate model for under- or overfitting effects


First analysis

check model accuracy and set as the minimal baseline
check which features contributed most to the model


Step 2

Data preprocessing/cleaning/wrangling


check the data format

check if data has a valid range of values (e.g., ages above 100 are usually unlikely)
check for data type (numerical or categorical)

if categorical:

check the % of unique values vs total values


if low, feature is categorical and does not need further processing

if high, feature may need further analysis and processing


if numerical:

check if is continuous or categorical (% unique values vs total values)

continuous:
check for Homoscedasticity

check for normality (skewed/kurtosis)
check for linearity


categorical:

convert to categorical


preprocess data

if numerical, scale data using mean subtraction and std division
if categorical, convert to one-hot encoding (with n-1 categories)


check if data is clean

check for missing values

if it has missing values:

try multiple imputation methods (by training multiple random forrest models and evaluate their accuracy)
delete rows or features
consider imputation and remove/impute values


check for outliers

univariate
bivariate
multivariate


feature engineering

check for correlations between features
evaluate factor analysis
evaluate dimensionality reduction


Data modeling

Model selection:

select metrics for evaluation
select algorithms to optimize
define model's hyperparameters
train/optimize models using cross validation
evaluate trained models for under- or overfitting effects
select best model(s)


Model train

train/optimize best model(s) using the full training data
(optional) train an ensemble of models


Step 3

Inference

Evaluate trained model on new data


Overall process


data load


import data


data preprocess


clean
transform
normalize


data modeling


model selection
hyperparameter selection
model training
select best model
model ensemble (optional)


inference


evaluate model on unseen data