Skip to content

Instantly share code, notes, and snippets.

@ayulockin
Created April 13, 2019 05:41
Show Gist options
  • Save ayulockin/7fd2c39cb15baf2d2815b35fce114063 to your computer and use it in GitHub Desktop.
Save ayulockin/7fd2c39cb15baf2d2815b35fce114063 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Utility to prepare get MNIST dataset as csv\n",
"\n",
"### Instructions:\n",
"\n",
"- Download dataset from [here](http://yann.lecun.com/exdb/mnist/).\n",
"- Extract them into `MNIST_dataset/data` dir.\n",
"- Set the path to the files if required.\n",
"\n",
"Hurray you have the csv dataset. Now enjoy!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gzip\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### About MNIST Hand Written Dataset\n",
"\n",
"- train-images-idx3-ubyte: training set images<br>\n",
"- train-labels-idx1-ubyte: training set labels<br>\n",
"- t10k-images-idx3-ubyte: test set images<br>\n",
"- t10k-labels-idx1-ubyte: test set labels<br>\n",
"\n",
"The training set contains 60000 examples, and the test set 10000 examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_to_dataset = os.getcwd()+'/MNIST_dataset/data'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"files = os.listdir(path_to_dataset)\n",
"files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Helper function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def MNISTtocsv(path, img_file, label_file, out_csv, n):\n",
" img = open(path+'/'+img_file, \"rb\")\n",
" label = open(path+'/'+label_file, \"rb\")\n",
" out = open(out_csv, \"w\")\n",
"\n",
" img.read(16)\n",
" label.read(8)\n",
" images = []\n",
"\n",
" for i in range(n):\n",
" image = [ord(label.read(1))]\n",
" for j in range(28*28):\n",
" image.append(ord(img.read(1)))\n",
" images.append(image)\n",
"\n",
" for image in images:\n",
" out.write(\",\".join(str(pix) for pix in image)+\"\\n\")\n",
" \n",
" img.close()\n",
" label.close()\n",
" out.close()\n",
"\n",
"MNISTtocsv(path_to_dataset, files[0], files[1], 'mnist_test.csv',10000)\n",
"MNISTtocsv(path_to_dataset, files[2], files[3], 'mnist_train.csv', 60000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@sayakpaul
Copy link

sayakpaul commented Apr 13, 2019

Beautifully structured @ayulockin. Clean code. Consider plotting some of the images. It will give a good sense of the data to the viewers.

@ayulockin
Copy link
Author

Thank you..surely will plot some..point taken 100%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment