misho-kr/Launching into Machine Learning.md

## Launching into Machine Learning.md

      
    Raw
  

              Launching into Machine Learning.md
            
          
    Launching into Machine Learning

Starting from a history of machine learning, we discuss why neural networks today perform so well in a variety of data science problems. We then discuss how to set up a supervised learning problem and find a good solution using gradient descent. This involves creating datasets that permit generalization; we talk about methods of doing so in a repeatable way that supports experimentation.
Objectives:

Identify why deep learning is currently popular
Optimize and evaluate models using loss functions and performance metrics
Mitigate common problems that arise in machine learning
Create repeatable and scalable training, evaluation, and test datasets

TensorFlow Playground
Labs and demos:

Training Data Analyst
Creating Repeatable Dataset Splits in BigQuery
Exploring and Creating ML Datasets

Introduction


Why deep learning is currently popular
Common problem is a lack of what is called generalization
Qwiklabs

Practical ML


Python notebooks in the Cloud
Unsupervised and supervised learning are the two types of ML algorithms

Supervised learning implies the data is already labeled
Regression and Classification are supervised ML model types

Use regression for predicting continuous label values
Use classification for predicting categorical label values


Inclusive ML

Avoid creating or reinforcing unfair bias


Short History of ML

Linear regression
Perceptron was a computational model of a neuron
Neural networks combine layers of perceptrons, more powerful, harder to train
Decision trees, easy to interpret by humans
Support vector machines are nonlinear models that build maximum marginal boundaries in hyperspace

SVMs maximize the margin between two classes


Random forests, bagging, and boosting
Deep neural networks


Optimization

ML models are mathematical functions with parameters and hyper-parameters


Linear models have two types of parameters: Bias and weight

When an analytical solution is no longer an option, you use gradient descent


Quantify model performance using loss functions

Mean Squared Error
Root Mean Squared Error
Use cross-entropy loss for classification problems


Use loss functions as the basis for an algorithm called gradient descent

Search for a minima by descending the gradient


Optimize gradient descent to be as efficient as possible

Small step sizes can take a very long time to converge
Large step sizes may never converge to the true minimum
A correct and constant step size can be difficult to find
Troubleshooting a Loss Curve
Calculating the derivative on fewer data points

Mini-batching reduces cost while preserving quality
Checking loss with reduced frequency


TensorFlow Playground

Um, What Is a Neural Network?
Non-linera data
And more
Hidden Layers
Batching
Loss curve troubleshooting


Use performance metrics to make business decisions

Advanced optimization techniques aim to improve training time and help models not to be seduced by local minima
Data weighting, oversampling and synthetic data creation aim to remove inappropriate minima from the search space altogether
Performance metrics change the way we think about the results of our search by aligning them more closely with what we actually care about


Generalization and Sampling


Assess if your model is overfitting

Does the model generalize to new data?
Beware of overfitting as you increase model complexity


Gauge when to stop model training
Create repeatable training, evaluation, and test datasets

Split the dataset and experiment with models
Evaluate the final model with independent test data
Evaluate the final model with cross-validation


Sampling
Establish performance benchmarks

Labs: Creating Repeatable Dataset Splits in BigQuery
train_and_eval_rand = """
#standardSQL
WITH
  alldata AS (
  SELECT
    IF (RAND() < 0.8,
      'train',
      'eval') AS dataset,
    arrival_delay,
    departure_delay
  FROM
    `bigquery-samples.airline_ontime_data.flights`
  WHERE
    departure_airport = 'DEN'
    AND arrival_airport = 'LAX' ),
  training AS (
  SELECT
    SAFE_DIVIDE( SUM(arrival_delay * departure_delay) , SUM(departure_delay * departure_delay)) AS alpha
  FROM
    alldata
  WHERE
    dataset = 'train' )
SELECT
  MAX(alpha) AS alpha,
  dataset,
  SQRT(AVG((arrival_delay - alpha * departure_delay)*(arrival_delay - alpha * departure_delay))) AS rmse,
  COUNT(arrival_delay) AS num_flights
FROM
  alldata,
  training
GROUP BY
  dataset
"""
Lab: Exploring and Creating ML Datasets
def to_csv(df, filename):
  outdf = df.copy(deep = False)
  outdf.loc[:, 'key'] = np.arange(0, len(outdf)) # rownumber as key
  # Reorder columns so that target is first column
  cols = outdf.columns.tolist()
  cols.remove('fare_amount')
  cols.insert(0, 'fare_amount')
  print(cols)  # new order of columns
  outdf = outdf[cols]
  outdf.to_csv(filename, header = False, index_label = False, index = False)
  print("Wrote {} to {}".format(len(outdf), filename))

for phase in ['train', 'valid', 'test']:
  query = create_query(phase, 100000)
  df = bigquery.Client().query(query).to_dataframe()
  to_csv(df, 'taxi-{}.csv'.format(phase))
Summary