Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Language-agnostic instructions for Digits Recognizer machine learning dojo
This dojo is directly inspired by the Digit Recognizer competition from Kaggle.com:
http://www.kaggle.com/c/digit-recognizer
The datasets below are simply shorter versions of the training dataset from Kaggle.
The dataset
*************
2 datasets can be downloaded here:
1) a training set of 5,000 examples
http://brandewinder.blob.core.windows.net/public/trainingsample.csv
2) a validation set of 500 examples; this dataset is supplied so that
you can evaluate the performance of your classification model on
"fresh" data that hasn't been used to construct the classifier.
http://brandewinder.blob.core.windows.net/public/validationsample.csv
The files are CSV files; the first line contains column labels, and
each subsequent row represents a scanned hand-written digit:
* the first element is the actual digit (0 to 9)
* the next 784 elements are the 28 x 28 pixels of the image, flattened
into a single vector. Each pixel is gray scale, from 0 (pure black) to 255 (pure white).
The full 50,000 training examples dataset is available at http://www.kaggle.com/c/digit-recognizer.
Naive KNN (K Nearest Neighbors) algorithm
********************************************
Naive KNN in Pseudo Code:
Given a Target = an image with unknown Label (28x28 Pixels),
Given a Training Set of Examples = images with known Label (Label:Actual Digit, Observation: 28x28 Pixels),
Given a Distance function between Observations, measuring how "similar" two images are,
For every Example, compute Distance between Example and Target,
Find the Neighbors of the Target = the K Examples with smallest distance to Target,
Return the most frequent Label among the Neighbors
Simplest thing that could possibly work: 1-Nearest Neighbor ("Closest Neighbor")
From the training set,
Find the closest example to the Target,
Return its Label
Suggested plan of attack
**************************
- Read the training set into a collection of "examples", with a Label (the actual number) and their pixels.
- Start with the Euclidean Distance to measure similarity between images:
X = [ x1; x2; .. xn ]
Y = [ y1; y2; .. yn ]
Dist(X, Y) = (x1-y1)^2 + (x2-y2)^2 .. + (xn-yn)^2
- Build a "Closest Neighbor" classifier, a function which, given an unlabeled image as input,
returns the Label of the closest neighbor from the Training Set
- Check what proportion of the Validation Set gets properly classified using that model
... now find ways to improve the model :)
More information
******************
More on the KNN algorithm:
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Discussion on algorithm approaches and improvements
http://www.kaggle.com/c/digit-recognizer
Slides:
http://www.slideshare.net/mathias-brandewinder/fsharp-and-machine-learning-dojo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment