Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Language-agnostic instructions for Digits Recognizer machine learning dojo
This dojo is directly inspired by the Digit Recognizer competition from
The datasets below are simply shorter versions of the training dataset from Kaggle.
The dataset
2 datasets can be downloaded here:
1) a training set of 5,000 examples
2) a validation set of 500 examples; this dataset is supplied so that
you can evaluate the performance of your classification model on
"fresh" data that hasn't been used to construct the classifier.
The files are CSV files; the first line contains column labels, and
each subsequent row represents a scanned hand-written digit:
* the first element is the actual digit (0 to 9)
* the next 784 elements are the 28 x 28 pixels of the image, flattened
into a single vector. Each pixel is gray scale, from 0 (pure black) to 255 (pure white).
The full 50,000 training examples dataset is available at
Naive KNN (K Nearest Neighbors) algorithm
Naive KNN in Pseudo Code:
Given a Target = an image with unknown Label (28x28 Pixels),
Given a Training Set of Examples = images with known Label (Label:Actual Digit, Observation: 28x28 Pixels),
Given a Distance function between Observations, measuring how "similar" two images are,
For every Example, compute Distance between Example and Target,
Find the Neighbors of the Target = the K Examples with smallest distance to Target,
Return the most frequent Label among the Neighbors
Simplest thing that could possibly work: 1-Nearest Neighbor ("Closest Neighbor")
From the training set,
Find the closest example to the Target,
Return its Label
Suggested plan of attack
- Read the training set into a collection of "examples", with a Label (the actual number) and their pixels.
- Start with the Euclidean Distance to measure similarity between images:
X = [ x1; x2; .. xn ]
Y = [ y1; y2; .. yn ]
Dist(X, Y) = (x1-y1)^2 + (x2-y2)^2 .. + (xn-yn)^2
- Build a "Closest Neighbor" classifier, a function which, given an unlabeled image as input,
returns the Label of the closest neighbor from the Training Set
- Check what proportion of the Validation Set gets properly classified using that model
... now find ways to improve the model :)
More information
More on the KNN algorithm:
Discussion on algorithm approaches and improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment