Skip to content

Instantly share code, notes, and snippets.

@adpoe
Last active December 24, 2016 22:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save adpoe/4e95e9e3c4b8706fae8e7ab5541f80a2 to your computer and use it in GitHub Desktop.
Save adpoe/4e95e9e3c4b8706fae8e7ab5541f80a2 to your computer and use it in GitHub Desktop.
SVM and SIFT Nature Conservatory Kaggle Entry
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0,
"cells": [
{
"cell_type": "markdown",
"source": "# Code for a Support Vector Machine Entry with Scikit-Learn ~ LB 1.6 (Benchmark)",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Why do this?\nConvolutional Neural Networks are demonstrated to be highly effective at solving vision problems. Why use an SVM?\n\n - Excellent way to learn and understand the problem\n - Better results than you'd expect\n - Great way to learn how to use the SVM library in SKLearn (that's why I did it)\n - Will learn some handy tricks in opencv (especially regarding feature extraction), along the way\n\nPersonally, I think the CNN's (when well optimized) will perform better, but there's a lot to learn from this process. Particularly with regard to feature extraction. One thing I'm wondering is whether we could gain performance improvements by feeding extracted features to our CNNs instead of the raw image. Something to experiment with and explore.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Overview of the process\nThe process for using an SVM for this problem is a little different than using a ConvNet.\n\nMost importantly, we will need to extract features from the imagery ourselves, rather than letting the CNN do it. So we get to pull back the curtain a bit, and see what's happening behind the scenes (if you'll forgive the metaphor).\n\n**The Steps We'll Take**\n - Load Data\n - Extract Features\n - Feed Features into SVM Model\n - Fit the SVM\n - Predict with our SVM\n\nFrom there, we can iterate and improve the model. Just as with ConvNets, if we send better data into the model, we will get better results. ",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Loading data\nThe first step in this process is loading data. Here is a script to get our images loaded into memory.",
"metadata": {}
},
{
"cell_type": "code",
"source": "# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nfrom sklearn import svm\nimport cv2\nimport csv\nimport os\nfrom os import listdir\nfrom os.path import isfile, join\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n\nfrom subprocess import check_output\nprint(check_output([\"ls\", \"../input\"]).decode(\"utf8\"))\n\n# Any results you write to the current directory are saved as output.\nmydirvalues = [d for d in os.listdir(os.path.dirname(os.path.abspath(__file__)))]\nprint(mydirvalues)\nonlyfiles = [f for f in listdir(\"../input/train/\") if isfile(join(\"../input/train/\", f))]\nprint(onlyfiles)\n\ndir_names = [d for d in listdir(\"../input/train/\") if not isfile(join(\"../input/train/\", d))]\nprint(dir_names)\n\nfile_paths = {}\nclass_num = 0\nfor d in dir_names:\n fnames = [f for f in listdir(\"../input/train/\"+d+\"/\") if isfile(join(\"../input/train/\"+d+\"/\", f))]\n print(fnames)\n file_paths[(d, class_num, \"../input/train/\"+d+\"/\")] = fnames\n class_num += 1",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Feature Extraction\nThis is probably the most important, and the most interesting part of the process--it's where we have the most options. \n\nWe need to extract features from the image in some way.\n\n### Here are some good options\n 1. **SIFT (Scale Invariant Feature Transform)** - This is what I chose, but it doesn't work on the Kaggle servers. You'll need to run code for this on your computer. It yields an Nx128 set of features, each extracted from detected keypoints in the image. The problem with SIFT, however, is that it is NOT open source. So, technically you need to pay royalties to its inventors in commercial applications.\n 2. **ORB** - OpenCV's open source alternative to SIFT. I have code for using ORB in the section below, as well. But the code may need to be restructured from Nx128 to NxM, depending on how many dimensions ORB gives you back for each feature. More on ORB: http://www.willowgarage.com/sites/default/files/orb_final.pdf \n 3. **HoG (Histogram of Gradients)** - Another good option, this often used to find both people and animals in photos. More info: https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients \n\nThe point here is: once features are extracted, we have essentially taken the the \"most important\" or \"distinctive\" keypoints from each image and mapped them into a higher dimensional space via our feature extraction method.\n\nWe then feed our higher dimensional information vector into the SVM, along with a label, and we train the SVM on these data in our high-dimensional space. The SVM learns what boundaries mark distinctions between each class type, and so: when we extract the features from a test image--using the same method--the hope is that the description vector takes us to an area in the feature space that is common to vectors taken from images of the same class-type. \n\nWhether you use an SVM for this problem or not, I do believe that some of these feature extraction techniques could be useful, if fed into a neural network, rather than just a raw image. ",
"metadata": {}
},
{
"cell_type": "code",
"source": " # General steps:\n # Extract feature from each file as HOG or similar... or SIFT... or Similar...\n # map each to feature space... and train some kind of classifier on that. SVM is a good choice.\n # do the same for each feature in test set...\n training_data = np.array([])\n training_labels = np.array([])\n\n for key in file_paths:\n category = key[1]\n directory_path = key[2]\n file_list = file_paths[key]\n\n # shuffle this list, so we get random examples\n np.random.shuffle(file_list)\n \n # Stop early, while testing, so it doesn't take FOR-EV-ER (FOR-EV-ER)\n i = 0\n\n # read in the file and get its SIFT features\n for fname in file_list:\n fpath = directory_path + fname\n print(fpath)\n print(\"Category = \" + str(category))\n # extract features!\n gray = cv2.imread(fpath,0)\n gray = cv2.resize(gray, (400, 250)) # resize so we're always comparing same-sized images\n # Could also make images larger/smaller\n # to tune for greater accuracy / more speedd\n \n \"\"\" My Choice: SIFT (Scale Invariant Feature Transform)\"\"\"\n # However, this does not work on the Kaggle server\n # because it's in a separate package in the opencv version used on the Kaggle server.\n # This is a very robust method however, worth trying when it's reasonable to do so. \n detector = cv2.SIFT()\n kp1, des1 = detector.detectAndCompute(gray, None)\n \n \"\"\" Another option that will work on Kaggle server is ORB\"\"\"\n # find the keypoints with ORB\n #kp = cv2.orb.detect(img,None)\n # compute the descriptors with ORB\n #kp1, des1 = cv2.orb.compute(img, kp)\n \n \"\"\" Histogram of Gradients - often used to for detected people/animals in photos\"\"\"\n # Havent' tried this one in the SVM yet, but here's how to get the HoG, using openCV\n # hog = cv2.HOGDescriptor()\n #img = cv2.imread(sample)\n # h = hog.compute(im)\n\n # This is to make sure we have at least 100 keypoints to analyze\n # could also duplicate a few features if needed to hit a higher value\n if len(kp1) < 100:\n continue\n \n # transform the data to float and shuffle all keypoints\n # so we get a random sampling from each image\n des1 = des1.astype(np.float64)\n np.random.shuffle(des1)\n des1 = des1[0:100,:] # trim vector so all are same size\n vector_data = des1.reshape(1,12800) \n list_data = vector_data.tolist()\n\n # We need to concatenate ont the full list of features extracted from each image\n if len(training_data) == 0:\n training_data = np.append(training_data, vector_data)\n training_data = training_data.reshape(1,12800)\n else:\n training_data = np.concatenate((training_data, vector_data), axis=0)\n \n training_labels = np.append(training_labels,category)\n\n # early stop\n i += 1\n if i > 50:\n break",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "# Fit the SVM\nNow comes the training step. May take a few minutes to fit the SVM itself.",
"metadata": {}
},
{
"cell_type": "code",
"source": " # Alright! Now we've got features extracted and labels\n X = training_data\n y = training_labels\n y = y.reshape(y.shape[0],)\n\n # Create and fit the SVM\n # Fitting should take a few minutes\n clf = svm.SVC(kernel='linear', C = 1.0, probability=True)\n clf.fit(X,y)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "# Make a Prediction\nMake a prediction on an example fish, just to test it out. Should be LAG --> or class #3, if you get it right. Picked LAG because it is very distinctive to human eye.",
"metadata": {}
},
{
"cell_type": "code",
"source": " # Now, extract one of the images and predict it\n gray = cv2.imread('../inputtest_stg1/img_00071.jpg', 0) # Correct is LAG --> Class 3\n kp1, des1 = detector.detectAndCompute(gray, None)\n\n des1 = des1[0:100, :] # trim vector so all are same size\n vector_data = des1.reshape(1, 12800)\n\n print(\"Linear SVM Prediction:\")\n print(clf.predict(vector_data)) # prints highest probability class, only\n print(clf.predict_proba(vector_data)) # shows all probabilities for each class. \n # need this for the competition",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Save the SVM for later Use\nThis code can be use to save (and load) your SVM for later, to avoid re-doing expensive computations. I've commented it out because we can't save onto the Kaggle server. But would encourage using it on your own computer. It's a timesaver.",
"metadata": {}
},
{
"cell_type": "code",
"source": " # save SVM model\n # joblib.dump(clf, 'filename.pkl')\n # to load SVM model, use: clf = joblib.load('filename.pkl')",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Predict the whole Data Set\nMake a prediction for all data in the prediction set.",
"metadata": {}
},
{
"cell_type": "code",
"source": " # early stoppage...\n # only do 10\n i = 0\n for f in fnames:\n file_name = \"test_stg1/\" + f\n print(\"---Evaluating File at: \" + file_name)\n gray = cv2.imread(file_name, 0) # Correct is LAG --> Class 3\n gray = cv2.resize(gray, (400, 250)) # resize so we're always comparing same-sized images\n kp1, des1 = detector.detectAndCompute(gray, None)\n\n # ensure we have at least 100 keypoints to analyze\n if len(kp1) < 100:\n # and duplicate some points if necessary\n current_len = len(kp1)\n vectors_needed = 100 - current_len\n repeated_vectors = des1[0:vectors_needed, :]\n # concatenate repeats onto des1\n while len(des1) < 100:\n des1 = np.concatenate((des1, repeated_vectors), axis=0)\n # duplicate data just so we can run the model.\n des1[current_len:100, :] = des1[0:vectors_needed, :]\n\n np.random.shuffle(des1) # shuffle the vector so we get a representative sample\n des1 = des1[0:100, :] # trim vector so all are same size\n vector_data = des1.reshape(1, 12800)\n print(\"Linear SVM Prediction:\")\n print(clf.predict(vector_data))\n svm_prediction = clf.predict_proba(vector_data)\n print(svm_prediction)\n \n # format list for csv output\n csv_output_list = []\n csv_output_list.append(f)\n for elem in svm_prediction: \n for value in elem:\n csv_output_list.append(value)\n\n # append filename to make sure we have right format to write to csv\n print(\"CSV Output List Formatted:\")\n print(csv_output_list)\n\n # and append this file to the output_list (of lists)\n prediction_output_list.append(csv_output_list)\n\n # Uncomment to stop early\n if i > 10:\n break\n i += 1",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Format CSV for Output",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": " # Write to csv\n print(prediction_output_list[0:5])\n \"\"\" Uncomment to write to your CSV. Can't do this on Kaggle server directly.\n try:\n with open(\"sift_and_svm_submission.csv\", \"wb\") as f:\n writer = csv.writer(f)\n headers = ['image', 'ALB', 'BET', 'DOL', 'LAG', 'NoF', 'OTHER', 'SHARK', 'YFT']\n writer.writerow(headers)\n writer.writerows(prediction_output_list)\n finally:\n f.close()\n \"\"\"",
"execution_count": null,
"outputs": [],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment