/sc-2-classification.ipynb

## sc-2-classification.ipynb
{
  "cells": [
    {
      "metadata": {
        "toc": "true"
      },
      "cell_type": "markdown",
      "source": "# Table of Contents\n <p><div class=\"lev1\"><a href=\"#SC-2---Can-We-Distinguish-&quot;Close&quot;-Model-Variants?\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>SC-2 - Can We Distinguish \"Close\" Model Variants?</a></div><div class=\"lev2\"><a href=\"#Feature-Engineering\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Feature Engineering</a></div><div class=\"lev2\"><a href=\"#Classifier\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Classifier</a></div><div class=\"lev2\"><a href=\"#Finding-Optimal-Hyperparameters\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Finding Optimal Hyperparameters</a></div><div class=\"lev2\"><a href=\"#Observations\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Observations</a></div>"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# SC-2 - Can We Distinguish \"Close\" Model Variants? #"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "In this second experiment in seriation classification, the question is whether we can distinguish variants of the same basic structural model which differ in their details.  In this case, I look at \"lineage\" models with four variants:  lineages that split early and evolve for a longer period, split late and have more time as a single unified lineage, lineages that coalesce early and evolve for a longer period, and late coalescence with a longer period of separate evolution.  These models are clearly interesting from an archaeological perspective, but since distinguishing them visually in seriations may rely upon additional information to orient things, they may be topologically equivalent.  Thus, my expectation in this experiment is very low classification performance, close to chance.  \n\nI use the same approach as sc-1:  a gradient boosted classifier with the Laplacian eigenvalues as features.  "
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "import numpy as np\nimport networkx as nx\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport cPickle as pickle\nfrom copy import deepcopy\nfrom sklearn.utils import shuffle\n\n%matplotlib inline\nplt.style.use(\"fivethirtyeight\")\nsns.set()",
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "Couldn't import dot_parser, loading of dot files will not be possible.\n"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "all_graphs = pickle.load(open(\"train-cont-graphs.pkl\",'r'))\nall_labels = pickle.load(open(\"train-cont-labels.pkl\",'r'))",
      "execution_count": 2,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "In addition to needing a train/test split, we need to ensure reasonable class balance.  A simple approach to this is simply to shuffle both lists before taking a random sample without replacement."
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "def train_test_split(graph_list, label_list, test_fraction=0.20):\n    \"\"\"\n    Randomly splits a set of graphs and labels into training and testing data sets.  We need a custom function\n    because the dataset isn't a numeric matrix, but a list of NetworkX Graph objects.  In case there is class \n    structure (i.e., we filled the arrays first with instances of one class, then another class...) we consistently\n    shuffle both lists.\n    \"\"\"\n    \n    graph_list, label_list = shuffle(graph_list, label_list)\n    \n    rand_ix = np.random.random_integers(0, len(graph_list), size=int(len(graph_list) * test_fraction))\n    print \"random indices: %s\" % rand_ix\n    \n    test_graphs = []\n    test_labels = []\n    \n    train_graphs = []\n    train_labels = []\n    \n    # first copy the chosen test values, without deleting anything since that would alter the indices\n    for ix in rand_ix:\n        test_graphs.append(graph_list[ix])\n        test_labels.append(label_list[ix])\n        \n    # now copy the indices that are NOT in the test index list\n    for ix in range(0, len(graph_list)):\n        if ix in rand_ix:\n            continue\n        train_graphs.append(graph_list[ix])\n        train_labels.append(label_list[ix])\n        \n    return (train_graphs, train_labels, test_graphs, test_labels)",
      "execution_count": 3,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Feature Engineering ##"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "The goal here is to construct a standard training and test data matrix of numeric values, which will contain the sorted Laplacian eigenvalues of the graphs in each data set.  One feature will thus represent the largest eigenvalue for each graph, a second feature will represent the second largest eigenvalue, and so on.  \n\nWe do not necessarily assume that all of the graphs have the same number of vertices, although if there are marked differences, we would need to handle missing data for those graphs which had many fewer eigenvalues (or restrict our slice of the spectrum to the smallest number of eigenvalues present)."
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "train_graphs, train_labels, test_graphs, test_labels = train_test_split(all_graphs, all_labels, test_fraction=0.1)\nprint \"train size: %s\" % len(train_graphs)\nprint \"test size: %s\" % len(test_graphs)",
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "random indices: [164 188 191  40  48  11  87 119 105 171 156  47 164 158 185  50 139  32\n 101 158]\ntrain size: 182\ntest size: 20\n"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "def graphs_to_eigenvalue_matrix(graph_list, num_eigenvalues = None):\n    \"\"\"\n    Given a list of NetworkX graphs, returns a numeric matrix where rows represent graphs, \n    and columns represent the reverse sorted eigenvalues of the Laplacian matrix for each graph,\n    possibly trimmed to only use the num_eigenvalues largest values.  If num_eigenvalues is \n    unspecified, all eigenvalues are used.\n    \"\"\"    \n    # we either use all of the eigenvalues, or the number requested (and zero-pad if needed)\n    if num_eigenvalues is None:\n        ev_used = n\n    else:\n        ev_used = num_eigenvalues\n    \n    data_mat = np.zeros((len(graph_list),ev_used))\n    \n    for ix in range(0, len(graph_list)):\n        spectrum = sorted(nx.spectrum.laplacian_spectrum(graph_list[ix], weight=None), reverse=True)\n        # if the spectrum is shorter than the number of eigenvalues used (due to multiplicity), zero pad the result\n        if len(spectrum) < ev_used:\n            spectrum = np.lib.pad(spectrum, (0,ev_used-len(spectrum)), 'constant', constant_values=(0,0))\n        data_mat[ix,:] = spectrum[0:ev_used]\n        \n    return data_mat\n    ",
      "execution_count": 5,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Classifier ##"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "We're going to be using a gradient boosted classifier, which has some of best accuracy of any of the standard classifier methods.  Ultimately we'll figure out the best hyperparameters using cross-validation, but first we just want to see whether the approach gets us anywhere in the right ballpark -- remember, we can 80% accuracy with just eigenvalue distance, so we have to be in that neighborhood or higher to be worth the effort of switching to a more complex model.  "
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "from sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrix",
      "execution_count": 6,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "train_matrix = graphs_to_eigenvalue_matrix(train_graphs, num_eigenvalues=20)\ntest_matrix = graphs_to_eigenvalue_matrix(test_graphs, num_eigenvalues=20)\nprint train_matrix.shape\nprint test_matrix.shape",
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "(182, 20)\n(20, 20)\n"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "clf = GradientBoostingClassifier(n_estimators = 250)\nclf.fit(train_matrix, train_labels)",
      "execution_count": 8,
      "outputs": [
        {
          "execution_count": 8,
          "output_type": "execute_result",
          "data": {
            "text/plain": "GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n              max_depth=3, max_features=None, max_leaf_nodes=None,\n              min_samples_leaf=1, min_samples_split=2,\n              min_weight_fraction_leaf=0.0, n_estimators=250,\n              random_state=None, subsample=1.0, verbose=0,\n              warm_start=False)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "pred_label = clf.predict(test_matrix)",
      "execution_count": 9,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "cm = confusion_matrix(test_labels, pred_label)\ncmdf = pd.DataFrame(cm)\ncmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)\ncmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)\n\nprint cmdf\nprint classification_report(test_labels, pred_label)\nprint \"Accuracy on test: %0.3f\" % accuracy_score(test_labels, pred_label)",
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "          predicted 0  predicted 1  predicted 2  predicted 3\nactual 0            3            2            0            0\nactual 1            1            1            0            0\nactual 2            0            0            2            2\nactual 3            0            0            4            5\n             precision    recall  f1-score   support\n\n          0       0.75      0.60      0.67         5\n          1       0.33      0.50      0.40         2\n          2       0.33      0.50      0.40         4\n          3       0.71      0.56      0.63         9\n\navg / total       0.61      0.55      0.57        20\n\nAccuracy on test: 0.550\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Overall, the accuracy is low, but interestingly, there is a pattern.  We never mistake seriations which have an \"early\" event from those with a \"late\" event, but we have trouble telling a early split from an early coalescence, and trouble telling a late split from a late coalescence.  This is a slightly weird result, actually."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Finding Optimal Hyperparameters ##"
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "from sklearn.pipeline import Pipeline\nfrom sklearn.grid_search import GridSearchCV",
      "execution_count": 11,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "pipeline = Pipeline([\n        ('clf', GradientBoostingClassifier())\n    ])\n\nparams = {\n     'clf__learning_rate': [5.0,2.0,1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005],\n     'clf__n_estimators': [10,25,50,100,250,500,1000]\n}\n\ngrid_search = GridSearchCV(pipeline, params, cv=5, n_jobs = -1, verbose = 1)",
      "execution_count": 12,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "grid_search.fit(train_matrix, train_labels)",
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "Fitting 5 folds for each of 70 candidates, totalling 350 fits\n"
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": "[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    0.1s\n[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:    4.8s\n[Parallel(n_jobs=-1)]: Done 200 jobs       | elapsed:   23.5s\n[Parallel(n_jobs=-1)]: Done 350 out of 350 | elapsed:   48.6s finished\n"
        },
        {
          "execution_count": 13,
          "output_type": "execute_result",
          "data": {
            "text/plain": "GridSearchCV(cv=5, error_score='raise',\n       estimator=Pipeline(steps=[('clf', GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n              max_depth=3, max_features=None, max_leaf_nodes=None,\n              min_samples_leaf=1, min_samples_split=2,\n              min_weight_fraction_leaf=0.0, n_estimators=100,\n              random_state=None, subsample=1.0, verbose=0,\n              warm_start=False))]),\n       fit_params={}, iid=True, loss_func=None, n_jobs=-1,\n       param_grid={'clf__learning_rate': [5.0, 2.0, 1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005], 'clf__n_estimators': [10, 25, 50, 100, 250, 500, 1000]},\n       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,\n       verbose=1)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "print(\"Best score: %0.3f\" % grid_search.best_score_)\nprint(\"Best parameters:\")\nbest_params = grid_search.best_estimator_.get_params()\nfor param in sorted(params.keys()):\n    print(\"param: %s: %r\" % (param, best_params[param]))",
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "Best score: 0.593\nBest parameters:\nparam: clf__learning_rate: 1.0\nparam: clf__n_estimators: 50\n"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "pred_label = grid_search.predict(test_matrix)",
      "execution_count": 15,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "cm = confusion_matrix(test_labels, pred_label)\ncmdf = pd.DataFrame(cm)\ncmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)\ncmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)\n\nprint cmdf\nprint classification_report(test_labels, pred_label)\nprint \"Accuracy on test: %0.3f\" % accuracy_score(test_labels, pred_label)",
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "          predicted 0  predicted 1  predicted 2  predicted 3\nactual 0            3            2            0            0\nactual 1            1            1            0            0\nactual 2            0            0            3            1\nactual 3            0            0            3            6\n             precision    recall  f1-score   support\n\n          0       0.75      0.60      0.67         5\n          1       0.33      0.50      0.40         2\n          2       0.50      0.75      0.60         4\n          3       0.86      0.67      0.75         9\n\navg / total       0.71      0.65      0.66        20\n\nAccuracy on test: 0.650\n"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": true
      },
      "cell_type": "markdown",
      "source": "## Observations ##"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "The results are actually more encouraging than I expected, but also stranger in terms of the ability to tell early lineage events from late lineage events, but to have significant difficulty telling splitting from coalescence.  If anything would be distinguishable, I'd have thought early split/late coalescence or early coalescence/late split might be distinguishable.  I need to look at more of the actual seriation graphs and see why this might be the case.  "
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python2",
      "display_name": "Python 2",
      "language": "python"
    },
    "language_info": {
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "name": "python",
      "pygments_lexer": "ipython2",
      "version": "2.7.10",
      "file_extension": ".py",
      "codemirror_mode": {
        "version": 2,
        "name": "ipython"
      }
    },
    "latex_envs": {
      "eqNumInitial": 0,
      "eqLabelWithNumbers": true,
      "current_citInitial": 1,
      "cite_by": "apalike",
      "bibliofile": "biblio.bib"
    },
    "toc": {
      "toc_threshold": 6,
      "toc_number_sections": true,
      "toc_cell": true,
      "toc_window_display": true
    },
    "gist": {
      "id": "",
      "data": {
        "description": "sc-2/sc-2-classification.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
	{
	"cells": [
	{
	"metadata": {
	"toc": "true"
	},
	"cell_type": "markdown",
	"source": "# Table of Contents\n <p><div class=\"lev1\"><a href=\"#SC-2---Can-We-Distinguish-"Close"-Model-Variants?\"><span class=\"toc-item-num\">1  </span>SC-2 - Can We Distinguish \"Close\" Model Variants?</a></div><div class=\"lev2\"><a href=\"#Feature-Engineering\"><span class=\"toc-item-num\">1.1  </span>Feature Engineering</a></div><div class=\"lev2\"><a href=\"#Classifier\"><span class=\"toc-item-num\">1.2  </span>Classifier</a></div><div class=\"lev2\"><a href=\"#Finding-Optimal-Hyperparameters\"><span class=\"toc-item-num\">1.3  </span>Finding Optimal Hyperparameters</a></div><div class=\"lev2\"><a href=\"#Observations\"><span class=\"toc-item-num\">1.4  </span>Observations</a></div>"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "# SC-2 - Can We Distinguish \"Close\" Model Variants? #"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "In this second experiment in seriation classification, the question is whether we can distinguish variants of the same basic structural model which differ in their details. In this case, I look at \"lineage\" models with four variants: lineages that split early and evolve for a longer period, split late and have more time as a single unified lineage, lineages that coalesce early and evolve for a longer period, and late coalescence with a longer period of separate evolution. These models are clearly interesting from an archaeological perspective, but since distinguishing them visually in seriations may rely upon additional information to orient things, they may be topologically equivalent. Thus, my expectation in this experiment is very low classification performance, close to chance. \n\nI use the same approach as sc-1: a gradient boosted classifier with the Laplacian eigenvalues as features. "
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "import numpy as np\nimport networkx as nx\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport cPickle as pickle\nfrom copy import deepcopy\nfrom sklearn.utils import shuffle\n\n%matplotlib inline\nplt.style.use(\"fivethirtyeight\")\nsns.set()",
	"execution_count": 1,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "Couldn't import dot_parser, loading of dot files will not be possible.\n"
	}
	]
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "all_graphs = pickle.load(open(\"train-cont-graphs.pkl\",'r'))\nall_labels = pickle.load(open(\"train-cont-labels.pkl\",'r'))",
	"execution_count": 2,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "In addition to needing a train/test split, we need to ensure reasonable class balance. A simple approach to this is simply to shuffle both lists before taking a random sample without replacement."
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": false
	},
	"cell_type": "code",
	"source": "def train_test_split(graph_list, label_list, test_fraction=0.20):\n \"\"\"\n Randomly splits a set of graphs and labels into training and testing data sets. We need a custom function\n because the dataset isn't a numeric matrix, but a list of NetworkX Graph objects. In case there is class \n structure (i.e., we filled the arrays first with instances of one class, then another class...) we consistently\n shuffle both lists.\n \"\"\"\n \n graph_list, label_list = shuffle(graph_list, label_list)\n \n rand_ix = np.random.random_integers(0, len(graph_list), size=int(len(graph_list) * test_fraction))\n print \"random indices: %s\" % rand_ix\n \n test_graphs = []\n test_labels = []\n \n train_graphs = []\n train_labels = []\n \n # first copy the chosen test values, without deleting anything since that would alter the indices\n for ix in rand_ix:\n test_graphs.append(graph_list[ix])\n test_labels.append(label_list[ix])\n \n # now copy the indices that are NOT in the test index list\n for ix in range(0, len(graph_list)):\n if ix in rand_ix:\n continue\n train_graphs.append(graph_list[ix])\n train_labels.append(label_list[ix])\n \n return (train_graphs, train_labels, test_graphs, test_labels)",
	"execution_count": 3,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Feature Engineering ##"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "The goal here is to construct a standard training and test data matrix of numeric values, which will contain the sorted Laplacian eigenvalues of the graphs in each data set. One feature will thus represent the largest eigenvalue for each graph, a second feature will represent the second largest eigenvalue, and so on. \n\nWe do not necessarily assume that all of the graphs have the same number of vertices, although if there are marked differences, we would need to handle missing data for those graphs which had many fewer eigenvalues (or restrict our slice of the spectrum to the smallest number of eigenvalues present)."
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "train_graphs, train_labels, test_graphs, test_labels = train_test_split(all_graphs, all_labels, test_fraction=0.1)\nprint \"train size: %s\" % len(train_graphs)\nprint \"test size: %s\" % len(test_graphs)",
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "random indices: [164 188 191 40 48 11 87 119 105 171 156 47 164 158 185 50 139 32\n 101 158]\ntrain size: 182\ntest size: 20\n"
	}
	]
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "def graphs_to_eigenvalue_matrix(graph_list, num_eigenvalues = None):\n \"\"\"\n Given a list of NetworkX graphs, returns a numeric matrix where rows represent graphs, \n and columns represent the reverse sorted eigenvalues of the Laplacian matrix for each graph,\n possibly trimmed to only use the num_eigenvalues largest values. If num_eigenvalues is \n unspecified, all eigenvalues are used.\n \"\"\" \n # we either use all of the eigenvalues, or the number requested (and zero-pad if needed)\n if num_eigenvalues is None:\n ev_used = n\n else:\n ev_used = num_eigenvalues\n \n data_mat = np.zeros((len(graph_list),ev_used))\n \n for ix in range(0, len(graph_list)):\n spectrum = sorted(nx.spectrum.laplacian_spectrum(graph_list[ix], weight=None), reverse=True)\n # if the spectrum is shorter than the number of eigenvalues used (due to multiplicity), zero pad the result\n if len(spectrum) < ev_used:\n spectrum = np.lib.pad(spectrum, (0,ev_used-len(spectrum)), 'constant', constant_values=(0,0))\n data_mat[ix,:] = spectrum[0:ev_used]\n \n return data_mat\n ",
	"execution_count": 5,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Classifier ##"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We're going to be using a gradient boosted classifier, which has some of best accuracy of any of the standard classifier methods. Ultimately we'll figure out the best hyperparameters using cross-validation, but first we just want to see whether the approach gets us anywhere in the right ballpark -- remember, we can 80% accuracy with just eigenvalue distance, so we have to be in that neighborhood or higher to be worth the effort of switching to a more complex model. "
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": false
	},
	"cell_type": "code",
	"source": "from sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrix",
	"execution_count": 6,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "train_matrix = graphs_to_eigenvalue_matrix(train_graphs, num_eigenvalues=20)\ntest_matrix = graphs_to_eigenvalue_matrix(test_graphs, num_eigenvalues=20)\nprint train_matrix.shape\nprint test_matrix.shape",
	"execution_count": 7,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "(182, 20)\n(20, 20)\n"
	}
	]
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "clf = GradientBoostingClassifier(n_estimators = 250)\nclf.fit(train_matrix, train_labels)",
	"execution_count": 8,
	"outputs": [
	{
	"execution_count": 8,
	"output_type": "execute_result",
	"data": {
	"text/plain": "GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n max_depth=3, max_features=None, max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=250,\n random_state=None, subsample=1.0, verbose=0,\n warm_start=False)"
	},
	"metadata": {}
	}
	]
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "pred_label = clf.predict(test_matrix)",
	"execution_count": 9,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "cm = confusion_matrix(test_labels, pred_label)\ncmdf = pd.DataFrame(cm)\ncmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)\ncmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)\n\nprint cmdf\nprint classification_report(test_labels, pred_label)\nprint \"Accuracy on test: %0.3f\" % accuracy_score(test_labels, pred_label)",
	"execution_count": 10,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": " predicted 0 predicted 1 predicted 2 predicted 3\nactual 0 3 2 0 0\nactual 1 1 1 0 0\nactual 2 0 0 2 2\nactual 3 0 0 4 5\n precision recall f1-score support\n\n 0 0.75 0.60 0.67 5\n 1 0.33 0.50 0.40 2\n 2 0.33 0.50 0.40 4\n 3 0.71 0.56 0.63 9\n\navg / total 0.61 0.55 0.57 20\n\nAccuracy on test: 0.550\n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Overall, the accuracy is low, but interestingly, there is a pattern. We never mistake seriations which have an \"early\" event from those with a \"late\" event, but we have trouble telling a early split from an early coalescence, and trouble telling a late split from a late coalescence. This is a slightly weird result, actually."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Finding Optimal Hyperparameters ##"
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": false
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": false
	},
	"cell_type": "code",
	"source": "from sklearn.pipeline import Pipeline\nfrom sklearn.grid_search import GridSearchCV",
	"execution_count": 11,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "pipeline = Pipeline([\n ('clf', GradientBoostingClassifier())\n ])\n\nparams = {\n 'clf__learning_rate': [5.0,2.0,1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005],\n 'clf__n_estimators': [10,25,50,100,250,500,1000]\n}\n\ngrid_search = GridSearchCV(pipeline, params, cv=5, n_jobs = -1, verbose = 1)",
	"execution_count": 12,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "grid_search.fit(train_matrix, train_labels)",
	"execution_count": 13,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "Fitting 5 folds for each of 70 candidates, totalling 350 fits\n"
	},
	{
	"output_type": "stream",
	"name": "stderr",
	"text": "[Parallel(n_jobs=-1)]: Done 1 jobs \| elapsed: 0.1s\n[Parallel(n_jobs=-1)]: Done 50 jobs \| elapsed: 4.8s\n[Parallel(n_jobs=-1)]: Done 200 jobs \| elapsed: 23.5s\n[Parallel(n_jobs=-1)]: Done 350 out of 350 \| elapsed: 48.6s finished\n"
	},
	{
	"execution_count": 13,
	"output_type": "execute_result",
	"data": {
	"text/plain": "GridSearchCV(cv=5, error_score='raise',\n estimator=Pipeline(steps=[('clf', GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n max_depth=3, max_features=None, max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=100,\n random_state=None, subsample=1.0, verbose=0,\n warm_start=False))]),\n fit_params={}, iid=True, loss_func=None, n_jobs=-1,\n param_grid={'clf__learning_rate': [5.0, 2.0, 1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005], 'clf__n_estimators': [10, 25, 50, 100, 250, 500, 1000]},\n pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,\n verbose=1)"
	},
	"metadata": {}
	}
	]
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "print(\"Best score: %0.3f\" % grid_search.best_score_)\nprint(\"Best parameters:\")\nbest_params = grid_search.best_estimator_.get_params()\nfor param in sorted(params.keys()):\n print(\"param: %s: %r\" % (param, best_params[param]))",
	"execution_count": 14,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "Best score: 0.593\nBest parameters:\nparam: clf__learning_rate: 1.0\nparam: clf__n_estimators: 50\n"
	}
	]
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": false
	},
	"cell_type": "code",
	"source": "pred_label = grid_search.predict(test_matrix)",
	"execution_count": 15,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": false
	},
	"cell_type": "code",
	"source": "cm = confusion_matrix(test_labels, pred_label)\ncmdf = pd.DataFrame(cm)\ncmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)\ncmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)\n\nprint cmdf\nprint classification_report(test_labels, pred_label)\nprint \"Accuracy on test: %0.3f\" % accuracy_score(test_labels, pred_label)",
	"execution_count": 16,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": " predicted 0 predicted 1 predicted 2 predicted 3\nactual 0 3 2 0 0\nactual 1 1 1 0 0\nactual 2 0 0 3 1\nactual 3 0 0 3 6\n precision recall f1-score support\n\n 0 0.75 0.60 0.67 5\n 1 0.33 0.50 0.40 2\n 2 0.50 0.75 0.60 4\n 3 0.86 0.67 0.75 9\n\navg / total 0.71 0.65 0.66 20\n\nAccuracy on test: 0.650\n"
	}
	]
	},
	{
	"metadata": {
	"collapsed": true
	},
	"cell_type": "markdown",
	"source": "## Observations ##"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "The results are actually more encouraging than I expected, but also stranger in terms of the ability to tell early lineage events from late lineage events, but to have significant difficulty telling splitting from coalescence. If anything would be distinguishable, I'd have thought early split/late coalescence or early coalescence/late split might be distinguishable. I need to look at more of the actual seriation graphs and see why this might be the case. "
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": false
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"kernelspec": {
	"name": "python2",
	"display_name": "Python 2",
	"language": "python"
	},
	"language_info": {
	"mimetype": "text/x-python",
	"nbconvert_exporter": "python",
	"name": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.10",
	"file_extension": ".py",
	"codemirror_mode": {
	"version": 2,
	"name": "ipython"
	}
	},
	"latex_envs": {
	"eqNumInitial": 0,
	"eqLabelWithNumbers": true,
	"current_citInitial": 1,
	"cite_by": "apalike",
	"bibliofile": "biblio.bib"
	},
	"toc": {
	"toc_threshold": 6,
	"toc_number_sections": true,
	"toc_cell": true,
	"toc_window_display": true
	},
	"gist": {
	"id": "",
	"data": {
	"description": "sc-2/sc-2-classification.ipynb",
	"public": true
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}