Starting from a history of machine learning, we discuss why neural networks today perform so well in a variety of data science problems. We then discuss how to set up a supervised learning problem and find a good solution using gradient descent. This involves creating datasets that permit generalization; we talk about methods of doing so in a repeatable way that supports experimentation.
Objectives:
- Identify why deep learning is currently popular
- Optimize and evaluate models using loss functions and performance metrics
- Mitigate common problems that arise in machine learning
- Create repeatable and scalable training, evaluation, and test datasets
Labs and demos:
- Training Data Analyst
- Creating Repeatable Dataset Splits in BigQuery
- Exploring and Creating ML Datasets
- Why deep learning is currently popular
- Common problem is a lack of what is called generalization
- Qwiklabs
- Python notebooks in the Cloud
- Unsupervised and supervised learning are the two types of ML algorithms
- Supervised learning implies the data is already labeled
- Regression and Classification are supervised ML model types
- Use regression for predicting continuous label values
- Use classification for predicting categorical label values
- Inclusive ML
- Avoid creating or reinforcing unfair bias
- Short History of ML
- Linear regression
- Perceptron was a computational model of a neuron
- Neural networks combine layers of perceptrons, more powerful, harder to train
- Decision trees, easy to interpret by humans
- Support vector machines are nonlinear models that build maximum marginal boundaries in hyperspace
- SVMs maximize the margin between two classes
- Random forests, bagging, and boosting
- Deep neural networks
ML models are mathematical functions with parameters and hyper-parameters
- Linear models have two types of parameters: Bias and weight
- When an analytical solution is no longer an option, you use gradient descent
- Quantify model performance using loss functions
- Mean Squared Error
- Root Mean Squared Error
- Use cross-entropy loss for classification problems
- Use loss functions as the basis for an algorithm called gradient descent
- Search for a minima by descending the gradient
- Optimize gradient descent to be as efficient as possible
- Small step sizes can take a very long time to converge
- Large step sizes may never converge to the true minimum
- A correct and constant step size can be difficult to find
- Troubleshooting a Loss Curve
- Calculating the derivative on fewer data points
- Mini-batching reduces cost while preserving quality
- Checking loss with reduced frequency
- TensorFlow Playground
- Um, What Is a Neural Network?
- Non-linera data
- And more
- Hidden Layers
- Batching
- Loss curve troubleshooting
- Use performance metrics to make business decisions
- Advanced optimization techniques aim to improve training time and help models not to be seduced by local minima
- Data weighting, oversampling and synthetic data creation aim to remove inappropriate minima from the search space altogether
- Performance metrics change the way we think about the results of our search by aligning them more closely with what we actually care about
- Assess if your model is overfitting
- Does the model generalize to new data?
- Beware of overfitting as you increase model complexity
- Gauge when to stop model training
- Create repeatable training, evaluation, and test datasets
- Split the dataset and experiment with models
- Evaluate the final model with independent test data
- Evaluate the final model with cross-validation
- Sampling
- Establish performance benchmarks
Labs: Creating Repeatable Dataset Splits in BigQuery
train_and_eval_rand = """
#standardSQL
WITH
alldata AS (
SELECT
IF (RAND() < 0.8,
'train',
'eval') AS dataset,
arrival_delay,
departure_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_airport = 'DEN'
AND arrival_airport = 'LAX' ),
training AS (
SELECT
SAFE_DIVIDE( SUM(arrival_delay * departure_delay) , SUM(departure_delay * departure_delay)) AS alpha
FROM
alldata
WHERE
dataset = 'train' )
SELECT
MAX(alpha) AS alpha,
dataset,
SQRT(AVG((arrival_delay - alpha * departure_delay)*(arrival_delay - alpha * departure_delay))) AS rmse,
COUNT(arrival_delay) AS num_flights
FROM
alldata,
training
GROUP BY
dataset
"""
Lab: Exploring and Creating ML Datasets
def to_csv(df, filename):
outdf = df.copy(deep = False)
outdf.loc[:, 'key'] = np.arange(0, len(outdf)) # rownumber as key
# Reorder columns so that target is first column
cols = outdf.columns.tolist()
cols.remove('fare_amount')
cols.insert(0, 'fare_amount')
print(cols) # new order of columns
outdf = outdf[cols]
outdf.to_csv(filename, header = False, index_label = False, index = False)
print("Wrote {} to {}".format(len(outdf), filename))
for phase in ['train', 'valid', 'test']:
query = create_query(phase, 100000)
df = bigquery.Client().query(query).to_dataframe()
to_csv(df, 'taxi-{}.csv'.format(phase))