Here is a general end to end pipeline for a data science project.
1. Define Business Objective & Criteria for Success
- Experimental Design
- Identify the business/product objective
- Identify & hypothesize goals and criteria for success
- Create a set of questions for identifying correct data set
- Define which machine learning evaluation metric will be used to quantify quality of predictions
- Identify data sources, time window of data collected, data formats, data dictionary, features, target & evaluation metric
2. Data Aquisition
- Define what/how much data we need, where it lives, what format it's in & load dataset
- Import data from local or remote data source & determine the most approperiate tools to work with the data
- Pandas has functions for common open source data formats including data base connectors for MySQL & PostgreSQL
- Use Spark for Big Data
- Gather/Read any documentation available for the data (schema, data dictionary)
- Load and pre-process the data into a representation which is ready for model training
- If the data is available in an open source data format (JSON, CSV, XML, EXCEL), you'll be able to leverage open source tools
- If the data is available in a closed source format(fixed formatted row) then you will need to develop a parser to format the data into approperiate columns
- Ensure correct data types are imputed
- Look at the values. Ensure they make sense in the context of each column
- Look for missing/empty values
- For categorical fields, what are the unique values in the field?
- For numeric fields, are all values numbers?
- Split-out validation dataset
3. Exploratory Data Analysis
- Gather insights by using exploratory methods, descriptive & inferential statistics
- Find median, mode, std dev, min, max, average for each column. Do these make sense in the context of the column?
- Do financial values have reasonable upper bounds?
- Univariate feature distributions (to observe stability & other patterns of a given feature like skew)
- Feature & target correlations
- Target analysis (plot of feature vs target)
- Are there any outliers?
- Do the column values seem to follow a normal distribution? Uniform? Exponential (i.e. long tail)? If exponential, taking log(X) may be beneficial for linear regerssion.
4. Feature Engineering
- Perform feature scaling / normalization
- Inject domain knowledge (structure) into the data by adding or modifying existing columns
- Linear combinations of two or more features (ratios or other arithmetic variations)
- Adding new columns for day of year, hour of day from a datetime column
- Convert categorical data into numerical values using one-hot encoding
5. Feature Selection
- Drop highly correlated features (see correlation section above)
- PCA
- Recusive Feature Elimination
- Regularization method using LASSO
6. Select, build & evaluate the model
- Establish a baseline model for comparison
- Spot Check & Compare Algorithms
- Run a spot check of single model performance & tune the top 3 best performing learners
- Evaluate Algorithms with Standardization
- Improve accuracy
- You may generally find ensemble Methods (such as Bagging and Boosting, Gradient Boosting) to be quite useful
7. Refine the model (Hyper-parameter tuning)
- Use GridSearch to search & tune hyper-parameters
8. Finalize Model (use all training data and confirm using validation dataset)
- Save model binary along with model training results
- Predictions on validation dataset
9. Communicate the results
- Summarize findings with narrative, storytelling techniques
- Present limitations, assumptions of your analysis
- Identify follow-up problems and questions for future analysis