This document is a spike to find out things about Amazon Machine Learning. It assumes that the reader has some basic knowledge on:
- Machine Learning
- What Amazone Machine Learning does
- Get enough samples from different data sources
- Normalising the data and turn it into a proper dataset
- (For supervised learning) Specify an algorithm e.g.
Linear Regression
- Split the dataset into two parts e.g. 80% vs 20% , one for training and the other for validation
- Use the model to predict results with new input parameters
Most time-consuming part would probably be collecting and normalising data in most cases.
- For models that predict a category, we measure the quality by the percentage of correctness
- For models that predict a number, we measure the quality by
RMSE
(root-mean-square error)
- Collecting and normalising dataset is done outside
AML
- Upload your dataset to either
S3
,RedShift
orRDS
- Create a
datasource
inAML
from one of the above sources - Create and train a
model
inAML
with thedatasource
you created - Use the console to do some trial predicts
- Potentially bind to an HTTP endpoint if you are happy with the model
- There are many ways to automate the data collection process, like crawling a website etc.
- The creation of
datasource
andmodel
can be done either in AWS console,awscli
or SDKs likeboto3
- Although you can automate the above step, you need to specify the json schema of your datasource
- It's probably not helping a lot if you want a one-off training
model
- It will be worth it if you want to continuously train the
model
with on-goingdataset
s
- It's probably not helping a lot if you want a one-off training
- You can automate the creation of endpoint as well.
- For more details on automation see awscli doc and boto3 doc
- Managed service, no hassles of GPU EC2 instances etc.
- Capability of training simple models without having to write a single line of code
- Fast training speed - finished 120,000 samples leaning in 2 minutes
- Trivial to bind a model to an endpoint
- Batch prediction capability
- In most cases cheaper than training on GPU EC2 with tools like
TensorFlow
- No option to export the training model i.e. the model can only be used in AWS
- Very limited training result stats (only
RMSE
) - Options to manipulate training behaviour are quite limited
- (Same as
API Gateway
) Potential bill shock as a result ofDDoS
attack after binding to an endpoint
- You can easily get started and train a model in minutes by follwing the guide
- To be very good at machine learning you need to be equipted with adequate mathmatics and statistics knowledge.
- Relevant knowledge covers a wide range, including but not are limited to
L0, L1, L2 Normalisation
,RMSE
etc.
I've scraped 120,000 ads from carsales on 20170726. The following parameters has been chosen as datasource features:
Feature | Type | Input / Output | Example |
---|---|---|---|
make | Categorical | Input | Toyota |
model | Categorical | Input | Camry |
badge | Categorical | Input | Altise |
year | Categorical | Input | 2014 |
kilometers | Numeric | Input | 20000 |
transmission | Categorical | Input | Manual |
state | Categorical | Input | NSW |
price | Numeric | Output | 22000 |
The model is trained to predict the car price with the above input parameters. I've trained multiple models with either the complete or a subset of the 120,000 samples.
Datasource Scope | # Samples | RMSE (smaller is better) |
---|---|---|
Full Dataset (all brands) | 113,195 | $12,595.750 |
Toyota | 15,211 | $5,835.131 |
Toyota-Camry | 1,897 | $1,872.218 |
Toyota-Camry-Altise | 873 | $1,884.205 |
- The results are quite good and all were able to predict the sell prices with a reasonable accuracy.
- The accuracy is MUCH better when you remove more variables from the input parameters list.
- All models could figure out that the price difference between
Automatic
andManual
is a few thousand dollars. - All models could figure out that
NSW
has a slightly higher price comparing toQLD
. - When
year
is regarded asCategorical
, the result is better than regarding it asNumeric
or converting it toage
. - It is hard to tell the exact accuracy because advertiser's strategies and car conditions vary from case to case.
- Supervised training would require your domain knowledge (in this case assuming a linear distribution, knowing the impact of having multiple makes / models may affect your prediction accuracy etc.)
- Having a larger samples size does NOT necessarily result in a more accurate model
- Removing input parameters (by having more constraints) might increase the model accuracy by quite a bit
- It is a good idea not to be too greedy and think about removing unnecessary variables before you start training
- The most time-consuming step would be collecting and normalising
dataset
s in most cases AML
is generally very fast at least at the scale of up to 120,000 samplesAML
does supervised training i.e. you need to specify a model algorithm in advanceAML
is probably suitable for proof of concept or finding out the level of relevance of a particular feature- We need to use more sophisticated tools that requires coding if we need to build complex and fine-tuned model