Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active January 20, 2020 04:22
Show Gist options
  • Save misho-kr/119647257918d2fd92895aee8d5bf7e4 to your computer and use it in GitHub Desktop.
Save misho-kr/119647257918d2fd92895aee8d5bf7e4 to your computer and use it in GitHub Desktop.
Summary of "Launching into Machine Learning" from Coursera.Org

Starting from a history of machine learning, we discuss why neural networks today perform so well in a variety of data science problems. We then discuss how to set up a supervised learning problem and find a good solution using gradient descent. This involves creating datasets that permit generalization; we talk about methods of doing so in a repeatable way that supports experimentation.

Objectives:

  • Identify why deep learning is currently popular
  • Optimize and evaluate models using loss functions and performance metrics
  • Mitigate common problems that arise in machine learning
  • Create repeatable and scalable training, evaluation, and test datasets

TensorFlow Playground

Labs and demos:

Introduction

  • Why deep learning is currently popular
  • Common problem is a lack of what is called generalization
  • Qwiklabs

Practical ML

  • Python notebooks in the Cloud
  • Unsupervised and supervised learning are the two types of ML algorithms
    • Supervised learning implies the data is already labeled
    • Regression and Classification are supervised ML model types
      • Use regression for predicting continuous label values
      • Use classification for predicting categorical label values
  • Inclusive ML
    • Avoid creating or reinforcing unfair bias
  • Short History of ML
    • Linear regression
    • Perceptron was a computational model of a neuron
    • Neural networks combine layers of perceptrons, more powerful, harder to train
    • Decision trees, easy to interpret by humans
    • Support vector machines are nonlinear models that build maximum marginal boundaries in hyperspace
      • SVMs maximize the margin between two classes
    • Random forests, bagging, and boosting
    • Deep neural networks

Optimization

ML models are mathematical functions with parameters and hyper-parameters
  • Linear models have two types of parameters: Bias and weight
    • When an analytical solution is no longer an option, you use gradient descent
  • Quantify model performance using loss functions
    • Mean Squared Error
    • Root Mean Squared Error
    • Use cross-entropy loss for classification problems
  • Use loss functions as the basis for an algorithm called gradient descent
    • Search for a minima by descending the gradient
  • Optimize gradient descent to be as efficient as possible
    • Small step sizes can take a very long time to converge
    • Large step sizes may never converge to the true minimum
    • A correct and constant step size can be difficult to find
    • Troubleshooting a Loss Curve
    • Calculating the derivative on fewer data points
      • Mini-batching reduces cost while preserving quality
      • Checking loss with reduced frequency
  • TensorFlow Playground
  • Use performance metrics to make business decisions
    • Advanced optimization techniques aim to improve training time and help models not to be seduced by local minima
    • Data weighting, oversampling and synthetic data creation aim to remove inappropriate minima from the search space altogether
    • Performance metrics change the way we think about the results of our search by aligning them more closely with what we actually care about

Generalization and Sampling

  • Assess if your model is overfitting
    • Does the model generalize to new data?
    • Beware of overfitting as you increase model complexity
  • Gauge when to stop model training
  • Create repeatable training, evaluation, and test datasets
    • Split the dataset and experiment with models
    • Evaluate the final model with independent test data
    • Evaluate the final model with cross-validation
  • Sampling
  • Establish performance benchmarks

Labs: Creating Repeatable Dataset Splits in BigQuery

train_and_eval_rand = """
#standardSQL
WITH
  alldata AS (
  SELECT
    IF (RAND() < 0.8,
      'train',
      'eval') AS dataset,
    arrival_delay,
    departure_delay
  FROM
    `bigquery-samples.airline_ontime_data.flights`
  WHERE
    departure_airport = 'DEN'
    AND arrival_airport = 'LAX' ),
  training AS (
  SELECT
    SAFE_DIVIDE( SUM(arrival_delay * departure_delay) , SUM(departure_delay * departure_delay)) AS alpha
  FROM
    alldata
  WHERE
    dataset = 'train' )
SELECT
  MAX(alpha) AS alpha,
  dataset,
  SQRT(AVG((arrival_delay - alpha * departure_delay)*(arrival_delay - alpha * departure_delay))) AS rmse,
  COUNT(arrival_delay) AS num_flights
FROM
  alldata,
  training
GROUP BY
  dataset
"""

Lab: Exploring and Creating ML Datasets

def to_csv(df, filename):
  outdf = df.copy(deep = False)
  outdf.loc[:, 'key'] = np.arange(0, len(outdf)) # rownumber as key
  # Reorder columns so that target is first column
  cols = outdf.columns.tolist()
  cols.remove('fare_amount')
  cols.insert(0, 'fare_amount')
  print(cols)  # new order of columns
  outdf = outdf[cols]
  outdf.to_csv(filename, header = False, index_label = False, index = False)
  print("Wrote {} to {}".format(len(outdf), filename))

for phase in ['train', 'valid', 'test']:
  query = create_query(phase, 100000)
  df = bigquery.Client().query(query).to_dataframe()
  to_csv(df, 'taxi-{}.csv'.format(phase))

Summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment