Master Machine Learning with scikit-learn
Part 1
Chapter 1: Introduction
- 1.1 Course overview
- 1.2 scikit-learn vs Deep Learning
- 1.3 Prerequisite skills
- 1.4 Course setup and software versions
- 1.5 Course outline
- 1.6 Course datasets
- 1.7 Meet your instructor
Chapter 2: Review of the Machine Learning workflow
- 2.1 Loading and exploring a dataset
- 2.2 Building and evaluating a model
- 2.3 Using the model to make predictions
- 2.4 Q&A: How do I adapt this workflow to a regression problem?
- 2.5 Q&A: How do I adapt this workflow to a multiclass problem?
- 2.6 Q&A: Why should I select a Series for the target?
- 2.7 Q&A: How do I add the model's predictions to a DataFrame?
- 2.8 Q&A: How do I determine the confidence level of each prediction?
- 2.9 Q&A: How do I check the accuracy of the model's predictions?
- 2.10 Q&A: What do the "solver" and "random_state" parameters do?
- 2.11 Q&A: How do I show all of the model parameters?
- 2.12 Q&A: Should I shuffle the samples when using cross-validation?
Chapter 3: Encoding categorical features
- 3.1 Introduction to one-hot encoding
- 3.2 Transformer methods: fit, transform, fit_transform
- 3.3 One-hot encoding of multiple features
- 3.4 Q&A: When should I use transform instead of fit_transform?
- 3.5 Q&A: What happens if the testing data includes a new category?
- 3.6 Q&A: Should I drop one of the one-hot encoded categories?
- 3.7 Q&A: How do I encode an ordinal feature?
- 3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?
- 3.9 Q&A: Should I encode numeric features as ordinal features?
Chapter 4: Improving your workflow with ColumnTransformer and Pipeline
- 4.1 Preprocessing features with ColumnTransformer
- 4.2 Chaining steps with Pipeline
- 4.3 Using the Pipeline to make predictions
- 4.4 Q&A: How do I drop some columns and passthrough others?
- 4.5 Q&A: How do I transform the unspecified columns?
- 4.6 Q&A: How do I select columns from a NumPy array?
- 4.7 Q&A: How do I select columns by data type?
- 4.8 Q&A: How do I select columns by column name pattern?
- 4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?
- 4.10 Q&A: Should I use Pipeline or make_pipeline?
- 4.11 Q&A: How do I examine the steps of a Pipeline?
Chapter 5: Workflow review #1
- 5.1 Recap of our workflow
- 5.2 Comparing ColumnTransformer and Pipeline
- 5.3 Creating a Pipeline diagram
Part 2
Chapter 6: Encoding text data
- 6.1 Vectorizing text
- 6.2 Including text data in the model
- 6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?
- 6.4 Q&A: What happens if the testing data includes new words?
- 6.5 Q&A: How do I vectorize multiple columns of text?
- 6.6 Q&A: Should I one-hot encode or vectorize categorical features?
Chapter 7: Handling missing values
- 7.1 Introduction to missing values
- 7.2 Three ways to handle missing values
- 7.3 Missing value imputation
- 7.4 Using "missingness" as a feature
- 7.5 Q&A: How do I perform multivariate imputation?
- 7.6 Q&A: What are the best practices for missing value imputation?
- 7.7 Q&A: What's the difference between ColumnTransformer and FeatureUnion?
Chapter 8: Fixing common workflow problems
- 8.1 Two new problems
- 8.2 Problem 1: Missing values in a categorical feature
- 8.3 Problem 2: Missing values in the new data
- 8.4 Q&A: How do I see the feature names output by the ColumnTransformer?
- 8.5 Q&A: Why did we create a Pipeline inside of the ColumnTransformer?
- 8.6 Q&A: Which imputation strategy should I use with categorical features?
- 8.7 Q&A: Should I impute missing values before all other transformations?
- 8.8 Q&A: What methods can I use with a Pipeline?
Chapter 9: Workflow review #2
- 9.1 Recap of our workflow
- 9.2 Comparing ColumnTransformer and Pipeline
- 9.3 Why not use pandas for transformations?
- 9.4 Preventing data leakage
Part 3
Chapter 10: Evaluating and tuning a Pipeline
- 10.1 Evaluating a Pipeline with cross-validation
- 10.2 Tuning a Pipeline with grid search
- 10.3 Tuning the model
- 10.4 Tuning the transformers
- 10.5 Using the best Pipeline to make predictions
- 10.6 Q&A: How do I save the best Pipeline for future use?
- 10.7 Q&A: How do I speed up a grid search?
- 10.8 Q&A: How do I tune a Pipeline with randomized search?
- 10.9 Q&A: What's the target accuracy we are trying to achieve?
- 10.10 Q&A: Is it okay that our model includes thousands of features?
- 10.11 Q&A: How do I examine the coefficients of a Pipeline?
- 10.12 Q&A: Should I split the dataset before tuning the Pipeline?
- 10.13 Q&A: What is regularization?
Chapter 11: Comparing linear and non-linear models
- 11.1 Trying a random forest model
- 11.2 Tuning random forests with randomized search
- 11.3 Further tuning with grid search
- 11.4 Q&A: How do I tune two models with a single grid search?
- 11.5 Q&A: How do I tune two models with a single randomized search?
Chapter 12: Ensembling multiple models
- 12.1 Introduction to ensembling
- 12.2 Ensembling logistic regression and random forests
- 12.3 Combining predicted probabilities
- 12.4 Combining class predictions
- 12.5 Choosing a voting strategy
- 12.6 Tuning an ensemble with grid search
- 12.7 Q&A: When should I use ensembling?
- 12.8 Q&A: How do I apply different weights to the models in an ensemble?
Chapter 13: Feature selection
- 13.1 Introduction to feature selection
- 13.2 Intrinsic methods: L1 regularization
- 13.3 Filter methods: Statistical test-based scoring
- 13.4 Filter methods: Model-based scoring
- 13.5 Filter methods: Summary
- 13.6 Wrapper methods: Recursive feature elimination
- 13.7 Q&A: How do I see which features were selected?
- 13.8 Q&A: Are the selected features the "most important" features?
- 13.9 Q&A: Is it okay for feature selection to remove one-hot encoded categories?
Chapter 14: Feature standardization
- 14.1 Standardizing numerical features
- 14.2 Standardizing all features
- 14.3 Q&A: How do I see what scaling was applied to each feature?
- 14.4 Q&A: How do I turn off feature standardization within a grid search?
- 14.5 Q&A: Which models benefit from standardization?
Chapter 15: Feature engineering with custom transformers
- 15.1 Why not use pandas for feature engineering?
- 15.2 Transformer 1: Rounding numerical values
- 15.3 Transformer 2: Clipping numerical values
- 15.4 Transformer 3: Extracting string values
- 15.5 Rules for transformer functions
- 15.6 Transformer 4: Combining two features
- 15.7 Revising the transformers
- 15.8 Q&A: How do I fix incorrect data types within a Pipeline?
- 15.9 Q&A: How do I create features from datetime data?
- 15.10 Q&A: How do I create feature interactions?
- 15.11 Q&A: How do I save a Pipeline with custom transformers?
Chapter 16: Workflow review #3
- 16.1 Recap of our workflow
- 16.2 What's the role of pandas?
Part 4
Chapter 17: High-cardinality categorical features
- 17.1 Recap of nominal and ordinal features
- 17.2 Preparing the census dataset
- 17.3 Setting up the encoders
- 17.4 Encoding nominal features for a linear model
- 17.5 Encoding nominal features for a non-linear model
- 17.6 Combining the encodings
- 17.7 Best practices for encoding
Chapter 18: Class imbalance
- 18.1 Introduction to class imbalance
- 18.2 Preparing the mammography dataset
- 18.3 Evaluating a model with train/test split
- 18.4 Exploring the results with a confusion matrix
- 18.5 Calculating rates from a confusion matrix
- 18.6 Using AUC as the evaluation metric
- 18.7 Cost-sensitive learning
- 18.8 Tuning the decision threshold
Chapter 19: Class imbalance walkthrough
- 19.1 Best practices for class imbalance
- 19.2 Step 1: Splitting the dataset
- 19.3 Step 2: Optimizing the model on the training set
- 19.4 Step 3: Evaluating the model on the testing set
- 19.5 Step 4: Tuning the decision threshold
- 19.6 Step 5: Retraining the model and making predictions
- 19.7 Q&A: Should I use an ROC curve or a precision-recall curve?
- 19.8 Q&A: Can I use a different metric such as F1 score?
- 19.9 Q&A: Should I use resampling to fix class imbalance?
Chapter 20: Going further
- 20.1 Q&A: How do I read the scikit-learn documentation?
- 20.2 Q&A: How do I stay up-to-date with new scikit-learn features?
- 20.3 Q&A: How do I improve my Machine Learning skills?
- 20.4 Q&A: How do I learn Deep Learning?