Skip to content

Instantly share code, notes, and snippets.

@sinogermany
Last active July 28, 2016 05:36
Show Gist options
  • Save sinogermany/0b2659e2f7f70aa5c550ad98ba64e530 to your computer and use it in GitHub Desktop.
Save sinogermany/0b2659e2f7f70aa5c550ad98ba64e530 to your computer and use it in GitHub Desktop.
Amazon Machine Learning Spike

AML Spike (Daniel Deng, 20160726 - 20160728)

Targeted Reader

This document is a spike to find out things about Amazon Machine Learning. It assumes that the reader has some basic knowledge on:

Machine Learning in general

General steps involved to train a model

  • Get enough samples from different data sources
  • Normalising the data and turn it into a proper dataset
  • (For supervised learning) Specify an algorithm e.g. Linear Regression
  • Split the dataset into two parts e.g. 80% vs 20% , one for training and the other for validation
  • Use the model to predict results with new input parameters

Most time-consuming part would probably be collecting and normalising data in most cases.

Measuring the quality of a trained model

  • For models that predict a category, we measure the quality by the percentage of correctness
  • For models that predict a number, we measure the quality by RMSE (root-mean-square error)

Amazon Machine Learning (AML) Specific

Steps involved

  • Collecting and normalising dataset is done outside AML
  • Upload your dataset to either S3, RedShift or RDS
  • Create a datasource in AML from one of the above sources
  • Create and train a model in AML with the datasource you created
  • Use the console to do some trial predicts
  • Potentially bind to an HTTP endpoint if you are happy with the model

Automation

  • There are many ways to automate the data collection process, like crawling a website etc.
  • The creation of datasource and model can be done either in AWS console, awscli or SDKs like boto3
  • Although you can automate the above step, you need to specify the json schema of your datasource
    • It's probably not helping a lot if you want a one-off training model
    • It will be worth it if you want to continuously train the model with on-going datasets
  • You can automate the creation of endpoint as well.
  • For more details on automation see awscli doc and boto3 doc

Pros

  • Managed service, no hassles of GPU EC2 instances etc.
  • Capability of training simple models without having to write a single line of code
  • Fast training speed - finished 120,000 samples leaning in 2 minutes
  • Trivial to bind a model to an endpoint
  • Batch prediction capability
  • In most cases cheaper than training on GPU EC2 with tools like TensorFlow

Cons

  • No option to export the training model i.e. the model can only be used in AWS
  • Very limited training result stats (only RMSE)
  • Options to manipulate training behaviour are quite limited
  • (Same as API Gateway) Potential bill shock as a result of DDoS attack after binding to an endpoint

Is it easy to learn?

  • You can easily get started and train a model in minutes by follwing the guide
  • To be very good at machine learning you need to be equipted with adequate mathmatics and statistics knowledge.
  • Relevant knowledge covers a wide range, including but not are limited to L0, L1, L2 Normalisation, RMSE etc.

Models trained in this spike

I've scraped 120,000 ads from carsales on 20170726. The following parameters has been chosen as datasource features:

Feature Type Input / Output Example
make Categorical Input Toyota
model Categorical Input Camry
badge Categorical Input Altise
year Categorical Input 2014
kilometers Numeric Input 20000
transmission Categorical Input Manual
state Categorical Input NSW
price Numeric Output 22000

The model is trained to predict the car price with the above input parameters. I've trained multiple models with either the complete or a subset of the 120,000 samples.

Datasource Scope # Samples RMSE (smaller is better)
Full Dataset (all brands) 113,195 $12,595.750
Toyota 15,211 $5,835.131
Toyota-Camry 1,897 $1,872.218
Toyota-Camry-Altise 873 $1,884.205
  • The results are quite good and all were able to predict the sell prices with a reasonable accuracy.
  • The accuracy is MUCH better when you remove more variables from the input parameters list.
  • All models could figure out that the price difference between Automatic and Manual is a few thousand dollars.
  • All models could figure out that NSW has a slightly higher price comparing to QLD.
  • When year is regarded as Categorical, the result is better than regarding it as Numeric or converting it to age.
  • It is hard to tell the exact accuracy because advertiser's strategies and car conditions vary from case to case.

Takeaways

  • Supervised training would require your domain knowledge (in this case assuming a linear distribution, knowing the impact of having multiple makes / models may affect your prediction accuracy etc.)
  • Having a larger samples size does NOT necessarily result in a more accurate model
  • Removing input parameters (by having more constraints) might increase the model accuracy by quite a bit
  • It is a good idea not to be too greedy and think about removing unnecessary variables before you start training
  • The most time-consuming step would be collecting and normalising datasets in most cases
  • AML is generally very fast at least at the scale of up to 120,000 samples
  • AML does supervised training i.e. you need to specify a model algorithm in advance
  • AML is probably suitable for proof of concept or finding out the level of relevance of a particular feature
  • We need to use more sophisticated tools that requires coding if we need to build complex and fine-tuned model

Related links of this spike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment