Skip to content

Instantly share code, notes, and snippets.

@justmarkham
Last active June 23, 2023 14:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save justmarkham/5a5acf3cf6ff238612b5f89ee22473b9 to your computer and use it in GitHub Desktop.
Save justmarkham/5a5acf3cf6ff238612b5f89ee22473b9 to your computer and use it in GitHub Desktop.
Data School course: "Master Machine Learning with scikit-learn" (launching in late 2023!)

Master Machine Learning with scikit-learn

Part 1

Chapter 1: Introduction

  • 1.1 Course overview
  • 1.2 scikit-learn vs Deep Learning
  • 1.3 Prerequisite skills
  • 1.4 Course setup and software versions
  • 1.5 Course outline
  • 1.6 Course datasets
  • 1.7 Meet your instructor

Chapter 2: Review of the Machine Learning workflow

  • 2.1 Loading and exploring a dataset
  • 2.2 Building and evaluating a model
  • 2.3 Using the model to make predictions
  • 2.4 Q&A: How do I adapt this workflow to a regression problem?
  • 2.5 Q&A: How do I adapt this workflow to a multiclass problem?
  • 2.6 Q&A: Why should I select a Series for the target?
  • 2.7 Q&A: How do I add the model's predictions to a DataFrame?
  • 2.8 Q&A: How do I determine the confidence level of each prediction?
  • 2.9 Q&A: How do I check the accuracy of the model's predictions?
  • 2.10 Q&A: What do the "solver" and "random_state" parameters do?
  • 2.11 Q&A: How do I show all of the model parameters?
  • 2.12 Q&A: Should I shuffle the samples when using cross-validation?

Chapter 3: Encoding categorical features

  • 3.1 Introduction to one-hot encoding
  • 3.2 Transformer methods: fit, transform, fit_transform
  • 3.3 One-hot encoding of multiple features
  • 3.4 Q&A: When should I use transform instead of fit_transform?
  • 3.5 Q&A: What happens if the testing data includes a new category?
  • 3.6 Q&A: Should I drop one of the one-hot encoded categories?
  • 3.7 Q&A: How do I encode an ordinal feature?
  • 3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?
  • 3.9 Q&A: Should I encode numeric features as ordinal features?

Chapter 4: Improving your workflow with ColumnTransformer and Pipeline

  • 4.1 Preprocessing features with ColumnTransformer
  • 4.2 Chaining steps with Pipeline
  • 4.3 Using the Pipeline to make predictions
  • 4.4 Q&A: How do I drop some columns and passthrough others?
  • 4.5 Q&A: How do I transform the unspecified columns?
  • 4.6 Q&A: How do I select columns from a NumPy array?
  • 4.7 Q&A: How do I select columns by data type?
  • 4.8 Q&A: How do I select columns by column name pattern?
  • 4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?
  • 4.10 Q&A: Should I use Pipeline or make_pipeline?
  • 4.11 Q&A: How do I examine the steps of a Pipeline?

Chapter 5: Workflow review #1

  • 5.1 Recap of our workflow
  • 5.2 Comparing ColumnTransformer and Pipeline
  • 5.3 Creating a Pipeline diagram

Part 2

Chapter 6: Encoding text data

  • 6.1 Vectorizing text
  • 6.2 Including text data in the model
  • 6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?
  • 6.4 Q&A: What happens if the testing data includes new words?
  • 6.5 Q&A: How do I vectorize multiple columns of text?
  • 6.6 Q&A: Should I one-hot encode or vectorize categorical features?

Chapter 7: Handling missing values

  • 7.1 Introduction to missing values
  • 7.2 Three ways to handle missing values
  • 7.3 Missing value imputation
  • 7.4 Using "missingness" as a feature
  • 7.5 Q&A: How do I perform multivariate imputation?
  • 7.6 Q&A: What are the best practices for missing value imputation?
  • 7.7 Q&A: What's the difference between ColumnTransformer and FeatureUnion?

Chapter 8: Fixing common workflow problems

  • 8.1 Two new problems
  • 8.2 Problem 1: Missing values in a categorical feature
  • 8.3 Problem 2: Missing values in the new data
  • 8.4 Q&A: How do I see the feature names output by the ColumnTransformer?
  • 8.5 Q&A: Why did we create a Pipeline inside of the ColumnTransformer?
  • 8.6 Q&A: Which imputation strategy should I use with categorical features?
  • 8.7 Q&A: Should I impute missing values before all other transformations?
  • 8.8 Q&A: What methods can I use with a Pipeline?

Chapter 9: Workflow review #2

  • 9.1 Recap of our workflow
  • 9.2 Comparing ColumnTransformer and Pipeline
  • 9.3 Why not use pandas for transformations?
  • 9.4 Preventing data leakage

Part 3

Chapter 10: Evaluating and tuning a Pipeline

  • 10.1 Evaluating a Pipeline with cross-validation
  • 10.2 Tuning a Pipeline with grid search
  • 10.3 Tuning the model
  • 10.4 Tuning the transformers
  • 10.5 Using the best Pipeline to make predictions
  • 10.6 Q&A: How do I save the best Pipeline for future use?
  • 10.7 Q&A: How do I speed up a grid search?
  • 10.8 Q&A: How do I tune a Pipeline with randomized search?
  • 10.9 Q&A: What's the target accuracy we are trying to achieve?
  • 10.10 Q&A: Is it okay that our model includes thousands of features?
  • 10.11 Q&A: How do I examine the coefficients of a Pipeline?
  • 10.12 Q&A: Should I split the dataset before tuning the Pipeline?
  • 10.13 Q&A: What is regularization?

Chapter 11: Comparing linear and non-linear models

  • 11.1 Trying a random forest model
  • 11.2 Tuning random forests with randomized search
  • 11.3 Further tuning with grid search
  • 11.4 Q&A: How do I tune two models with a single grid search?
  • 11.5 Q&A: How do I tune two models with a single randomized search?

Chapter 12: Ensembling multiple models

  • 12.1 Introduction to ensembling
  • 12.2 Ensembling logistic regression and random forests
  • 12.3 Combining predicted probabilities
  • 12.4 Combining class predictions
  • 12.5 Choosing a voting strategy
  • 12.6 Tuning an ensemble with grid search
  • 12.7 Q&A: When should I use ensembling?
  • 12.8 Q&A: How do I apply different weights to the models in an ensemble?

Chapter 13: Feature selection

  • 13.1 Introduction to feature selection
  • 13.2 Intrinsic methods: L1 regularization
  • 13.3 Filter methods: Statistical test-based scoring
  • 13.4 Filter methods: Model-based scoring
  • 13.5 Filter methods: Summary
  • 13.6 Wrapper methods: Recursive feature elimination
  • 13.7 Q&A: How do I see which features were selected?
  • 13.8 Q&A: Are the selected features the "most important" features?
  • 13.9 Q&A: Is it okay for feature selection to remove one-hot encoded categories?

Chapter 14: Feature standardization

  • 14.1 Standardizing numerical features
  • 14.2 Standardizing all features
  • 14.3 Q&A: How do I see what scaling was applied to each feature?
  • 14.4 Q&A: How do I turn off feature standardization within a grid search?
  • 14.5 Q&A: Which models benefit from standardization?

Chapter 15: Feature engineering with custom transformers

  • 15.1 Why not use pandas for feature engineering?
  • 15.2 Transformer 1: Rounding numerical values
  • 15.3 Transformer 2: Clipping numerical values
  • 15.4 Transformer 3: Extracting string values
  • 15.5 Rules for transformer functions
  • 15.6 Transformer 4: Combining two features
  • 15.7 Revising the transformers
  • 15.8 Q&A: How do I fix incorrect data types within a Pipeline?
  • 15.9 Q&A: How do I create features from datetime data?
  • 15.10 Q&A: How do I create feature interactions?
  • 15.11 Q&A: How do I save a Pipeline with custom transformers?

Chapter 16: Workflow review #3

  • 16.1 Recap of our workflow
  • 16.2 What's the role of pandas?

Part 4

Chapter 17: High-cardinality categorical features

  • 17.1 Recap of nominal and ordinal features
  • 17.2 Preparing the census dataset
  • 17.3 Setting up the encoders
  • 17.4 Encoding nominal features for a linear model
  • 17.5 Encoding nominal features for a non-linear model
  • 17.6 Combining the encodings
  • 17.7 Best practices for encoding

Chapter 18: Class imbalance

  • 18.1 Introduction to class imbalance
  • 18.2 Preparing the mammography dataset
  • 18.3 Evaluating a model with train/test split
  • 18.4 Exploring the results with a confusion matrix
  • 18.5 Calculating rates from a confusion matrix
  • 18.6 Using AUC as the evaluation metric
  • 18.7 Cost-sensitive learning
  • 18.8 Tuning the decision threshold

Chapter 19: Class imbalance walkthrough

  • 19.1 Best practices for class imbalance
  • 19.2 Step 1: Splitting the dataset
  • 19.3 Step 2: Optimizing the model on the training set
  • 19.4 Step 3: Evaluating the model on the testing set
  • 19.5 Step 4: Tuning the decision threshold
  • 19.6 Step 5: Retraining the model and making predictions
  • 19.7 Q&A: Should I use an ROC curve or a precision-recall curve?
  • 19.8 Q&A: Can I use a different metric such as F1 score?
  • 19.9 Q&A: Should I use resampling to fix class imbalance?

Chapter 20: Going further

  • 20.1 Q&A: How do I read the scikit-learn documentation?
  • 20.2 Q&A: How do I stay up-to-date with new scikit-learn features?
  • 20.3 Q&A: How do I improve my Machine Learning skills?
  • 20.4 Q&A: How do I learn Deep Learning?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment