sinogermany/aml-spike-20160728.md

## aml-spike-20160728.md

      
    Raw
  

              aml-spike-20160728.md
            
          
    AML Spike (Daniel Deng, 20160726 - 20160728)

Targeted Reader

This document is a spike to find out things about Amazon Machine Learning.
It assumes that the reader has some basic knowledge on:

Machine Learning
What Amazone Machine Learning does

Machine Learning in general

General steps involved to train a model


Get enough samples from different data sources
Normalising the data and turn it into a proper dataset
(For supervised learning) Specify an algorithm e.g. Linear Regression
Split the dataset into two parts e.g. 80% vs 20% , one for training and the other for validation
Use the model to predict results with new input parameters

Most time-consuming part would probably be collecting and normalising data in most cases.
Measuring the quality of a trained model


For models that predict a category, we measure the quality by the percentage of correctness
For models that predict a number, we measure the quality by RMSE (root-mean-square error)

Amazon Machine Learning (AML) Specific

Steps involved


Collecting and normalising dataset is done outside AML
Upload your dataset to either S3, RedShift or RDS
Create a datasource in AML from one of the above sources
Create and train a model in AML with the datasource you created
Use the console to do some trial predicts
Potentially bind to an HTTP endpoint if you are happy with the model

Automation


There are many ways to automate the data collection process, like crawling a website etc.
The creation of datasource and model can be done either in AWS console, awscli or SDKs like boto3
Although you can automate the above step, you need to specify the json schema of your datasource

It's probably not helping a lot if you want a one-off training model
It will be worth it if you want to continuously train the model with on-going datasets


You can automate the creation of endpoint as well.
For more details on automation see awscli doc and boto3 doc

Pros


Managed service, no hassles of GPU EC2 instances etc.
Capability of training simple models without having to write a single line of code
Fast training speed - finished 120,000 samples leaning in 2 minutes
Trivial to bind a model to an endpoint
Batch prediction capability
In most cases cheaper than training on GPU EC2 with tools like TensorFlow

Cons


No option to export the training model i.e. the model can only be used in AWS
Very limited training result stats (only RMSE)
Options to manipulate training behaviour are quite limited
(Same as API Gateway) Potential bill shock as a result of DDoS attack after binding to an endpoint

Is it easy to learn?


You can easily get started and train a model in minutes by follwing the guide
To be very good at machine learning you need to be equipted with adequate mathmatics and statistics knowledge.
Relevant knowledge covers a wide range, including but not are limited to L0, L1, L2 Normalisation, RMSE etc.

Models trained in this spike

I've scraped 120,000 ads from carsales on 20170726.
The following parameters has been chosen as datasource features:


Feature
Type
Input / Output
Example


make
Categorical
Input
Toyota


model
Categorical
Input
Camry


badge
Categorical
Input
Altise


year
Categorical
Input
2014


kilometers
Numeric
Input
20000


transmission
Categorical
Input
Manual


state
Categorical
Input
NSW


price
Numeric
Output
22000


The model is trained to predict the car price with the above input parameters.
I've trained multiple models with either the complete or a subset of the 120,000 samples.


Datasource Scope
# Samples
RMSE (smaller is better)


Full Dataset (all brands)
113,195
$12,595.750


Toyota
15,211
$5,835.131


Toyota-Camry
1,897
$1,872.218


Toyota-Camry-Altise
873
$1,884.205


The results are quite good and all were able to predict the sell prices with a reasonable accuracy.
The accuracy is MUCH better when you remove more variables from the input parameters list.
All models could figure out that the price difference between Automatic and Manual is a few thousand dollars.
All models could figure out that NSW has a slightly higher price comparing to QLD.
When year is regarded as Categorical, the result is better than regarding it as Numeric or converting it to age.
It is hard to tell the exact accuracy because advertiser's strategies and car conditions vary from case to case.

Takeaways


Supervised training would require your domain knowledge (in this case assuming a linear distribution, knowing the impact of having multiple makes / models may affect your prediction accuracy etc.)
Having a larger samples size does NOT necessarily result in a more accurate model
Removing input parameters (by having more constraints) might increase the model accuracy by quite a bit
It is a good idea not to be too greedy and think about removing unnecessary variables before you start training
The most time-consuming step would be collecting and normalising datasets in most cases
AML is generally very fast at least at the scale of up to 120,000 samples
AML does supervised training i.e. you need to specify a model algorithm in advance
AML is probably suitable for proof of concept or finding out the level of relevance of a particular feature
We need to use more sophisticated tools that requires coding if we need to build complex and fine-tuned model

Related links of this spike


S3 Bucket
Datasets
Models
Tutorial
Feature	Type	Input / Output	Example
make	Categorical	Input	Toyota
model	Categorical	Input	Camry
badge	Categorical	Input	Altise
year	Categorical	Input	2014
kilometers	Numeric	Input	20000
transmission	Categorical	Input	Manual
state	Categorical	Input	NSW
price	Numeric	Output	22000
Datasource Scope	# Samples	RMSE (smaller is better)
Full Dataset (all brands)	113,195	$12,595.750
Toyota	15,211	$5,835.131
Toyota-Camry	1,897	$1,872.218
Toyota-Camry-Altise	873	$1,884.205