justmarkham/course_outline.md Secret

## course_outline.md

      
    Raw
  

              course_outline.md
            
          
    Master Machine Learning with scikit-learn

Part 1

Chapter 1: Introduction


1.1 Course overview
1.2 scikit-learn vs Deep Learning
1.3 Prerequisite skills
1.4 Course setup and software versions
1.5 Course outline
1.6 Course datasets
1.7 Meet your instructor

Chapter 2: Review of the Machine Learning workflow


2.1 Loading and exploring a dataset
2.2 Building and evaluating a model
2.3 Using the model to make predictions
2.4 Q&A: How do I adapt this workflow to a regression problem?
2.5 Q&A: How do I adapt this workflow to a multiclass problem?
2.6 Q&A: Why should I select a Series for the target?
2.7 Q&A: How do I add the model's predictions to a DataFrame?
2.8 Q&A: How do I determine the confidence level of each prediction?
2.9 Q&A: How do I check the accuracy of the model's predictions?
2.10 Q&A: What do the "solver" and "random_state" parameters do?
2.11 Q&A: How do I show all of the model parameters?
2.12 Q&A: Should I shuffle the samples when using cross-validation?

Chapter 3: Encoding categorical features


3.1 Introduction to one-hot encoding
3.2 Transformer methods: fit, transform, fit_transform
3.3 One-hot encoding of multiple features
3.4 Q&A: When should I use transform instead of fit_transform?
3.5 Q&A: What happens if the testing data includes a new category?
3.6 Q&A: Should I drop one of the one-hot encoded categories?
3.7 Q&A: How do I encode an ordinal feature?
3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?
3.9 Q&A: Should I encode numeric features as ordinal features?

Chapter 4: Improving your workflow with ColumnTransformer and Pipeline


4.1 Preprocessing features with ColumnTransformer
4.2 Chaining steps with Pipeline
4.3 Using the Pipeline to make predictions
4.4 Q&A: How do I drop some columns and passthrough others?
4.5 Q&A: How do I transform the unspecified columns?
4.6 Q&A: How do I select columns from a NumPy array?
4.7 Q&A: How do I select columns by data type?
4.8 Q&A: How do I select columns by column name pattern?
4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?
4.10 Q&A: Should I use Pipeline or make_pipeline?
4.11 Q&A: How do I examine the steps of a Pipeline?

Chapter 5: Workflow review #1


5.1 Recap of our workflow
5.2 Comparing ColumnTransformer and Pipeline
5.3 Creating a Pipeline diagram

Part 2

Chapter 6: Encoding text data


6.1 Vectorizing text
6.2 Including text data in the model
6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?
6.4 Q&A: What happens if the testing data includes new words?
6.5 Q&A: How do I vectorize multiple columns of text?
6.6 Q&A: Should I one-hot encode or vectorize categorical features?

Chapter 7: Handling missing values


7.1 Introduction to missing values
7.2 Three ways to handle missing values
7.3 Missing value imputation
7.4 Using "missingness" as a feature
7.5 Q&A: How do I perform multivariate imputation?
7.6 Q&A: What are the best practices for missing value imputation?
7.7 Q&A: What's the difference between ColumnTransformer and FeatureUnion?

Chapter 8: Fixing common workflow problems


8.1 Two new problems
8.2 Problem 1: Missing values in a categorical feature
8.3 Problem 2: Missing values in the new data
8.4 Q&A: How do I see the feature names output by the ColumnTransformer?
8.5 Q&A: Why did we create a Pipeline inside of the ColumnTransformer?
8.6 Q&A: Which imputation strategy should I use with categorical features?
8.7 Q&A: Should I impute missing values before all other transformations?
8.8 Q&A: What methods can I use with a Pipeline?

Chapter 9: Workflow review #2


9.1 Recap of our workflow
9.2 Comparing ColumnTransformer and Pipeline
9.3 Why not use pandas for transformations?
9.4 Preventing data leakage

Part 3

Chapter 10: Evaluating and tuning a Pipeline


10.1 Evaluating a Pipeline with cross-validation
10.2 Tuning a Pipeline with grid search
10.3 Tuning the model
10.4 Tuning the transformers
10.5 Using the best Pipeline to make predictions
10.6 Q&A: How do I save the best Pipeline for future use?
10.7 Q&A: How do I speed up a grid search?
10.8 Q&A: How do I tune a Pipeline with randomized search?
10.9 Q&A: What's the target accuracy we are trying to achieve?
10.10 Q&A: Is it okay that our model includes thousands of features?
10.11 Q&A: How do I examine the coefficients of a Pipeline?
10.12 Q&A: Should I split the dataset before tuning the Pipeline?
10.13 Q&A: What is regularization?

Chapter 11: Comparing linear and non-linear models


11.1 Trying a random forest model
11.2 Tuning random forests with randomized search
11.3 Further tuning with grid search
11.4 Q&A: How do I tune two models with a single grid search?
11.5 Q&A: How do I tune two models with a single randomized search?

Chapter 12: Ensembling multiple models


12.1 Introduction to ensembling
12.2 Ensembling logistic regression and random forests
12.3 Combining predicted probabilities
12.4 Combining class predictions
12.5 Choosing a voting strategy
12.6 Tuning an ensemble with grid search
12.7 Q&A: When should I use ensembling?
12.8 Q&A: How do I apply different weights to the models in an ensemble?

Chapter 13: Feature selection


13.1 Introduction to feature selection
13.2 Intrinsic methods: L1 regularization
13.3 Filter methods: Statistical test-based scoring
13.4 Filter methods: Model-based scoring
13.5 Filter methods: Summary
13.6 Wrapper methods: Recursive feature elimination
13.7 Q&A: How do I see which features were selected?
13.8 Q&A: Are the selected features the "most important" features?
13.9 Q&A: Is it okay for feature selection to remove one-hot encoded categories?

Chapter 14: Feature standardization


14.1 Standardizing numerical features
14.2 Standardizing all features
14.3 Q&A: How do I see what scaling was applied to each feature?
14.4 Q&A: How do I turn off feature standardization within a grid search?
14.5 Q&A: Which models benefit from standardization?

Chapter 15: Feature engineering with custom transformers


15.1 Why not use pandas for feature engineering?
15.2 Transformer 1: Rounding numerical values
15.3 Transformer 2: Clipping numerical values
15.4 Transformer 3: Extracting string values
15.5 Rules for transformer functions
15.6 Transformer 4: Combining two features
15.7 Revising the transformers
15.8 Q&A: How do I fix incorrect data types within a Pipeline?
15.9 Q&A: How do I create features from datetime data?
15.10 Q&A: How do I create feature interactions?
15.11 Q&A: How do I save a Pipeline with custom transformers?

Chapter 16: Workflow review #3


16.1 Recap of our workflow
16.2 What's the role of pandas?

Part 4

Chapter 17: High-cardinality categorical features


17.1 Recap of nominal and ordinal features
17.2 Preparing the census dataset
17.3 Setting up the encoders
17.4 Encoding nominal features for a linear model
17.5 Encoding nominal features for a non-linear model
17.6 Combining the encodings
17.7 Best practices for encoding

Chapter 18: Class imbalance


18.1 Introduction to class imbalance
18.2 Preparing the mammography dataset
18.3 Evaluating a model with train/test split
18.4 Exploring the results with a confusion matrix
18.5 Calculating rates from a confusion matrix
18.6 Using AUC as the evaluation metric
18.7 Cost-sensitive learning
18.8 Tuning the decision threshold

Chapter 19: Class imbalance walkthrough


19.1 Best practices for class imbalance
19.2 Step 1: Splitting the dataset
19.3 Step 2: Optimizing the model on the training set
19.4 Step 3: Evaluating the model on the testing set
19.5 Step 4: Tuning the decision threshold
19.6 Step 5: Retraining the model and making predictions
19.7 Q&A: Should I use an ROC curve or a precision-recall curve?
19.8 Q&A: Can I use a different metric such as F1 score?
19.9 Q&A: Should I use resampling to fix class imbalance?

Chapter 20: Going further


20.1 Q&A: How do I read the scikit-learn documentation?
20.2 Q&A: How do I stay up-to-date with new scikit-learn features?
20.3 Q&A: How do I improve my Machine Learning skills?
20.4 Q&A: How do I learn Deep Learning?