vukk/lfr.ipynb

## lfr.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Install and load required packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Nothing to be done\n",
      "INFO: Nothing to be done\n"
     ]
    }
   ],
   "source": [
    "Pkg.add(\"Dataframes\")\n",
    "#Pkg.add(\"Optim\") # For L-BFGS <https://github.com/JuliaOpt/Optim.jl#basic-api-introduction>\n",
    "Pkg.add(\"NLopt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "using DataArrays, DataFrames\n",
    "using NLopt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: Load the original Adult dataset CSV and preprocess it right here\n",
    "- one-hot encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the \"Adult\" dataset\n",
    "The Adult dataset is from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/).\n",
    "I have preprocessed the data. In categorica data missing data is handled as just another category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15-element Array{Symbol,1}:\n",
       " :age           \n",
       " :workclass     \n",
       " :fnlwgt        \n",
       " :education     \n",
       " :education_num \n",
       " :marital_status\n",
       " :occupation    \n",
       " :relationship  \n",
       " :race          \n",
       " :sex           \n",
       " :capital_gain  \n",
       " :capital_loss  \n",
       " :hours_per_week\n",
       " :native_country\n",
       " :classification"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#df_data = readtable(\"data/adult.data\", nastrings = [\"\", \"NA\", \"?\"]);\n",
    "#df_test = readtable(\"data/adult.test\", nastrings = [\"\", \"NA\", \"?\"]);\n",
    "df_data = readtable(\"data/adult.processed.data\");\n",
    "df_test = readtable(\"data/adult.processed.test\");\n",
    "names(df_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3-element DataArrays.DataArray{Int64,1}:\n",
       " 25\n",
       " 38\n",
       " 28"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test[ 1:3, :age ] # Selecting subset of rows, by column name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Set parameters\n",
    "$K$ will be set to the number of prototypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Size or original matrix -1 for target column and -1 for sensitive column\n",
    "sensitive_column_name = :sex # Sensitive column is \"gender\" as in the paper\n",
    "classification_column_name = :classification\n",
    "K = size(df_data,2) - 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Auxiliary helper functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "partition (generic function with 6 methods)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# A nice Unicode named summation function\n",
    "∑(from::Integer, to::Integer, inner::Function) = sum(inner, colon(from,to))\n",
    "# Test:\n",
    "#f(x, k) = x*k\n",
    "#∑(1, 3, (k) -> f(1, k))\n",
    "\n",
    "# Partition a matrix two, according to given indices or indicator vector.\n",
    "# Your matrices need to be column-major, as this is the Julia memory layout.\n",
    "function partition{T<:Integer}(x::Vector, indices::Vector{T})\n",
    "    return x[ indices ], x[ setdiff(1:length(X), indices) ]\n",
    "end\n",
    "function partition{T<:Integer}(X::Matrix, indices::Vector{T})\n",
    "    return X[ :, indices ], X[ :, setdiff(1:size(X,2), indices) ]\n",
    "end\n",
    "function partition{T<:Integer,U<:Any}(X::SharedArray{U,2}, indices::Vector{T})\n",
    "    return X[ :, indices ], X[ :, setdiff(1:size(X,2), indices) ]\n",
    "end\n",
    "function partition{T<:Bool}(x::Vector, indicator::Vector{T})\n",
    "    return partition(x, find(indicator))\n",
    "end\n",
    "function partition{T<:Bool}(X::Matrix, indicator::Vector{T})\n",
    "    return partition(X, find(indicator))\n",
    "end\n",
    "function partition{T<:Bool,U<:Any}(X::SharedArray{U,2}, indicator::Vector{T})\n",
    "    return partition(X, find(indicator))\n",
    "end\n",
    "# Test:\n",
    "# @which partition([:first, :second, :third, :fourth], [true, false, true, false])\n",
    "# @which partition([:first, :second, :third, :fourth], [1,3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Code for the model\n",
    "\n",
    "## Definitions\n",
    "We will differ a bit from the definitions in the paper. Notably in the definition of the random variables $Z$. The definitions we use are: TODO MAKE THESE MATCH THE PAPER, THEN TELL WHAT IS CHANGED\n",
    "- $\\mathbf{X}$ denotes the entire data set, a $(N \\times D)$ matrix. The rows of the matrix are the feature vectors $\\mathbf{x}_n$ representing attributes of an individual. $\\mathbf{X}$ contains neither the classification information (target) column, nor the sensitive column.\n",
    "- $S$ is a binary random variable representing whether or not a given individual is a member of the \"protected set\". For the user of the algorithm, this is a decision that is done before running the algorithm.\n",
    "- $\\mathbf{X}_{train}$ denotes the training set.\n",
    "- $\\mathbf{X}_{test}$ denotes the test set.\n",
    "- $\\mathbf{X}^+$ denotes the subset of individuals that are members of the \"protected group\" i.e. individuals for whom $S=1$. Similarly $\\mathbf{X}^-$ denotes the subset of individuals for whom $S=0$.\n",
    "- Define $\\mathbf{X}_{train}^+$, $\\mathbf{X}_{train}^-$, $\\mathbf{X}_{test}^+$ and $\\mathbf{X}_{test}^-$ similarly as above.\n",
    "- $d$ is a distance measure on $\\mathbf{X}$ (e.g. euclidean distance).\n",
    "- $Z$ is a random integer from the set $\\left\\{1,\\dots,K\\right\\}$.\n",
    "- $Y$ is a binary random variable representing the classification decision (we consider binary classification only).\n",
    "\n",
    "Let $Z$ be a random integer from the set $\\left\\{1,\\dots,K\\right\\}$. Now we can denote the probability that a datapoint $\\mathbf{x}$ maps to a particular prototype $k$ with $\\mathbb{P}(Z=k \\mid \\mathbf{x})$ i.e. given the datapoint $\\mathbf{x}$, the probability that $Z$, the index of the prototype, is $k$.\n",
    "\n",
    "## Definitions in code\n",
    "\n",
    "From the definitions above, we map some to code and also define additional stuff.\n",
    "\n",
    "Julia is Column-Majored, so our matrices will be altered accordingly.\n",
    "\n",
    "- $\\mathbf{X}$ is just `𝐗`, but $(D \\times N)$ instead of $(N \\times D)$.\n",
    "- $\\mathbf{X}_{train}$ is `𝐗train`, $\\mathbf{X}_{test}$ is `𝐗test`\n",
    "- Grouped versions are `𝐗⁺train`, `𝐗⁻train`, `𝐗⁺test`, and `𝐗⁻test` respectively\n",
    "- $S$ maps to different vectors `S_<someset>` containing the sensitive column for `<someset>`, e.g. `S_𝐗`\n",
    "- $d$ is defined as a lambda function `d`\n",
    "- $Z$ is replaced by $\\mathbf{Z}$, a matrix of probability vectors $\\mathbf{z}$\n",
    "- $Y$ TODO\n",
    "\n",
    "Additionally, denote the tuple of prototypes $\\mathbf{V} = \\left(\\mathbf{v}_1,...,\\mathbf{v}_K\\right)$. Since a single prototype $\\mathbf{v}_k$ is a vector of length $D$, $\\mathbf{V}$ can be expressed as a ($D \\times K$) matrix. This is our optimization variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done."
     ]
    }
   ],
   "source": [
    "# Symbols: 𝐗 ⁺ ⁻ ∑ 𝐕 𝐱\n",
    "\n",
    "# Sensitive indices for training and test data\n",
    "S_𝐗train = convert(Array{Bool}, df_data[ sensitive_column_name ])\n",
    "S_𝐗test = convert(Array{Bool}, df_test[ sensitive_column_name ])\n",
    "\n",
    "# Classification vectors for training and test data\n",
    "𝐲train = convert(Array, df_data[ classification_column_name ])\n",
    "𝐲test = convert(Array, df_test[ classification_column_name ])\n",
    "\n",
    "# Drop sensitive and classification columns\n",
    "idxs_left = setdiff(names(df_data), [sensitive_column_name, classification_column_name])\n",
    "𝐗train = transpose(convert(Matrix, df_data[idxs_left]))\n",
    "𝐗test = transpose(convert(Matrix, df_test[idxs_left]))\n",
    "\n",
    "# Standardize the features with mean 0 variance 1\n",
    "# TODO: tell why, exponentiation goes quickly out of hand, e^-800 is already NaN on Float64\n",
    "train_mean = mean(𝐗train,2)[:]\n",
    "train_var = var(𝐗train,2)[:]\n",
    "𝐗train = 𝐗train .- train_mean # substraction of vector\n",
    "𝐗train = 𝐗train ./ train_var # division by vector\n",
    "𝐗test = 𝐗test .- train_mean # same mean as in training, on purpose\n",
    "𝐗test = 𝐗test ./ train_var # same variance as in training, on purpose\n",
    "\n",
    "#𝐗test .- train_mean\n",
    "\n",
    "# Reconstruct full dataset\n",
    "𝐗 = hcat(𝐗train, 𝐗test)\n",
    "S_𝐗 = vcat(S_𝐗train, S_𝐗test)\n",
    "𝐲 = vcat(𝐲train, 𝐲test)\n",
    "\n",
    "# Dimensions\n",
    "D = size(𝐗, 1)\n",
    "N = size(𝐗, 2)\n",
    "Ntrain = size(𝐗train, 2)\n",
    "Ntest = size(𝐗test, 2)\n",
    "\n",
    "### Distance function\n",
    "# Lambda\n",
    "d = (𝐚::Vector, 𝐛::Vector) -> vecnorm(𝐚 - 𝐛) # Euclidean distance\n",
    "# Non-lambda is slightly faster for calculations, but has to be defined\n",
    "# for all processes with @everywhere\n",
    "@everywhere de(𝐚::Vector{Float64}, 𝐛::Vector{Float64}) = vecnorm(𝐚 - 𝐛)\n",
    "\n",
    "### Optimization variables\n",
    "# Main optimization variable, matrix holding the prototype vectors.\n",
    "# Initialized to random Float64 matrix, normal distribution 0 mean 1 variance\n",
    "𝐕 = randn(D, K)\n",
    "# Weights or \"prototype label predictions\" (probabilities)\n",
    "𝐰 = rand(K) # floats in [0,1)\n",
    "\n",
    "### Hyperparameters\n",
    "A = Dict(:z=> 1000, :x=> 0.0001, :y=> 0.1)\n",
    "\n",
    "print(\"Done.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1.9488022835672973,-5.088141178187719,1.9488022835672973,-5.088141178187719,3.100483403138496,-2.3361137936564247)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "maximum(𝐗train), minimum(𝐗train), maximum(𝐗test), minimum(𝐗test), maximum(𝐕), minimum(𝐕)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Divide training and test sets to groups according to whether the individuals are \"protected\" or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(\"Training:\",(13,21790),(13,10771),\"Test:\",(13,10860),(13,5421))"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(𝐗⁺train, 𝐗⁻train) = partition(𝐗train, S_𝐗train)\n",
    "(𝐗⁺test, 𝐗⁻test) = partition(𝐗test, S_𝐗test)\n",
    "\"Training:\",size(𝐗⁺train), size(𝐗⁻train), \"Test:\", size(𝐗⁺test), size(𝐗⁻test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mapping $\\mathbf{X} \\rightarrow \\mathbf{Z}$\n",
    "Now we can define a mapping from the original dataset $\\mathbf{X}$ to probabilities via the *softmax function*. [Wikipedia](https://en.wikipedia.org/wiki/Softmax_function):\n",
    "> Softmax function \"squashes\" a $K$-dimensional vector $\\mathbf{z}$ of arbitrary real values to a $K$-dimensional vector $\\sigma(\\mathbf{z})$ of real values in the range $[0, 1]$ that add up to 1.\n",
    "\n",
    "Most notably `softmax` returns a probability vector. We will define a modified version that maps a $D$-dimensional vector $\\mathbf{x}$ to a $K$-dimensional vector $\\sigma(\\mathbf{x})$, i.e. the mapping won't necessarily preserve the dimensionality of $\\mathbf{x}$.\n",
    "\n",
    "Also from [Wikipedia](https://en.wikipedia.org/wiki/Multinomial_logistic_regression):\n",
    "> $$\\operatorname{softmax}(k,x_1,\\ldots,x_n) = \\frac{e^{x_k}}{\\sum_{i=1}^n e^{x_i}}$$\n",
    "> is referred to as the [*softmax function*](https://en.wikipedia.org/wiki/Softmax_function).  The reason is that the effect of exponentiating the values $x_1,\\ldots,x_n$ is to exaggerate the differences between them.  As a result, $\\operatorname{softmax}(k,x_1,\\ldots,x_n)$ will return a value close to 0 whenever $x_k$ is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value.  Thus, the softmax function can be used to construct a weighted average that behaves as a smooth function (which can be conveniently differentiated, etc.) and which approximates the [indicator function](https://en.wikipedia.org/wiki/Indicator_function).\n",
    "\n",
    "So we define, as in the paper equation (2), $$\\mathbb{P}(Z=k \\mid \\mathbf{x}) = \\frac{e^{-d(\\mathbf{x}, \\mathbf{v}_k)}}{\\sum_{j=1}^K e^{-d(\\mathbf{x}, \\mathbf{v}_j)}}$$\n",
    "where\n",
    "- $\\mathbb{P}(Z=k \\mid \\mathbf{x})$ is described in [definitions](#Definitions)\n",
    "- $\\mathbf{x}$ is the datapoint\n",
    "- $\\mathbf{v}_k$ is a vector associated with the $k$th prototype\n",
    "- $d$ is a distance measure between $\\mathbf{x}$ and $\\mathbf{v}_k$ (e.g. the euclidean distance)\n",
    "\n",
    "This means that since we have replaced $x_k$ in the softmax with the negative distance between $\\mathbf{x}$ and prototype $\\mathbf{v}_k$, the softmax returns a value close to 0 whenever the distance from $\\mathbf{x}$ to the prototype $\\mathbf{v}_k$ is significantly higher than $\\min_{j\\in\\{1,\\dots,K\\}}\\, d(\\mathbf{x}, \\mathbf{v}_i)$, and close to 1 when applied to the minimum value.\n",
    "\n",
    "\"Mapping from X to Z\" in the paper means mapping the vector $\\mathbf{x}$ to a probability vector $\\mathbf{z}$ of length $K$ via the softmax function. These probability vectors are then used directly for training the classifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**In code** this means that $\\mathbb{P}(Z=k \\mid \\mathbf{x})$ is represented by a function taking\n",
    "- the data point $\\mathbf{x}$ which is a `Vector` of length $D$\n",
    "- $(D \\times K)$ `Matrix` of prototypes $\\mathbf{V}$ containing all $K$ prototypes $\\mathbf{v}_k$ i.e. each prototype is a `Vector` (remember, Column-Major order on matrices)\n",
    "- a distance measure on $\\mathbf{X}$\n",
    "\n",
    "and returning\n",
    "- a `Vector` $\\mathbf{z}$ of length $K$ representing a [probability vector](https://en.wikipedia.org/wiki/Probability_vector), where each value $z_k$ of the probability vector $\\mathbf{z}$ tells how probable it is that $\\mathbf{x}$ maps to $\\mathbf{v}_k$. Since $\\mathbf{z}$ is a probability vector, $\\sum_{i=k}^K z_k = 1$.\n",
    "\n",
    "We will name this function `softmax` and define it as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "softmax_euclidean_par (generic function with 1 method)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# TODO: explain, Give in 𝐗, get 𝐙; give in 𝐱, get 𝐳\n",
    "# TODO: clean up duplicate code\n",
    "\n",
    "### FOR VECTORS\n",
    "\n",
    "# This is same as Eq (2) in paper.\n",
    "function softmax_dist{T<:Number}(𝐱::Vector{T}, 𝐕::Matrix{T}, distanceMeasure::Function)\n",
    "    K = size(𝐕, 2)\n",
    "    res = Vector{Float64}(K)\n",
    "    denominator = Float64(0.0)\n",
    "    # Use one loop to calculate both numerator and denominator\n",
    "    for k in 1:K\n",
    "        res[k] = exp(- distanceMeasure(𝐱, 𝐕[:,k]) )\n",
    "        denominator += res[k]\n",
    "    end\n",
    "    denom = inv(denominator)\n",
    "    res .* denom\n",
    "end\n",
    "\n",
    "function softmax_euclidean{T<:Number}(𝐱::Vector{T}, 𝐕::Matrix{T})\n",
    "    K = size(𝐕, 2)\n",
    "    res = Vector{Float64}(K)\n",
    "    denominator = Float64(0.0)\n",
    "    # Use one loop to calculate both numerator and denominator\n",
    "    for k in 1:K\n",
    "        res[k] = exp(- vecnorm(𝐱 - 𝐕[:,k]) )\n",
    "        denominator += res[k]\n",
    "    end\n",
    "    denom = inv(denominator)\n",
    "    res .* denom\n",
    "end\n",
    "\n",
    "### FOR MATRICES\n",
    "\n",
    "function softmax_dist{T<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{T}, distanceMeasure::Function)\n",
    "    N = size(𝐗, 2)\n",
    "    K = size(𝐕, 2)\n",
    "    # Preallocate result matrix, no need to zero it\n",
    "    res = Matrix{Float64}(K, N)\n",
    "    for n in 1:N\n",
    "        res[:,n] = softmax_dist(𝐗[:,n], 𝐕, distanceMeasure)\n",
    "    end\n",
    "    res\n",
    "end\n",
    "\n",
    "# Parallel version\n",
    "function softmax_dist_par{T<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{T}, distanceMeasure::Function)\n",
    "    nprocs()==CPU_CORES || addprocs(CPU_CORES-1)    \n",
    "    N = size(𝐗, 2)\n",
    "    K = size(𝐕, 2)\n",
    "    # Preallocate result matrix, no need to zero it\n",
    "    res = SharedArray(Float64, (K, N))\n",
    "    @sync @parallel for n in 1:N\n",
    "        for k in 1:K\n",
    "            res[k,n] = exp(- distanceMeasure(𝐗[:,n], 𝐕[:,k]) )\n",
    "        end\n",
    "        res[:,n] = res[:,n] .* inv(sum(res[:,n]))\n",
    "    end\n",
    "    res\n",
    "end\n",
    "\n",
    "function softmax_euclidean{T<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{T})\n",
    "    N = size(𝐗, 2)\n",
    "    K = size(𝐕, 2)\n",
    "    # Preallocate result matrix, no need to zero it\n",
    "    res = Matrix{Float64}(K, N)\n",
    "    for n in 1:N\n",
    "        denominator = Float64(0.0)\n",
    "        # Use one loop to calculate both numerator and denominator\n",
    "        for k in 1:K\n",
    "            res[k,n] = exp(- vecnorm(𝐗[:,n] - 𝐕[:,k]) )\n",
    "            denominator += res[k,n]\n",
    "        end\n",
    "        res[:,n] = res[:,n] .* inv(denominator)\n",
    "    end\n",
    "    res\n",
    "end\n",
    "\n",
    "# Parallel version\n",
    "function softmax_euclidean_par{T<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{T})\n",
    "    nprocs()==CPU_CORES || addprocs(CPU_CORES-1)\n",
    "    N = size(𝐗, 2)\n",
    "    K = size(𝐕, 2)\n",
    "    # Preallocate result matrix, no need to zero it\n",
    "    res = SharedArray(Float64, (K, N))\n",
    "    @sync @parallel for n in 1:N\n",
    "        for k in 1:K\n",
    "            res[k,n] = exp(- vecnorm(𝐗[:,n] - 𝐕[:,k]) )\n",
    "        end\n",
    "        res[:,n] = res[:,n] .* inv(sum(res[:,n]))\n",
    "    end\n",
    "    res\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For development\n",
    "𝐙pre = softmax_euclidean_par(𝐗train, 𝐕)\n",
    "𝐙pre⁺ = softmax_euclidean_par(𝐗⁺train, 𝐕)\n",
    "𝐙pre⁻ = softmax_euclidean_par(𝐗⁻train, 𝐕);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach is akin to using a funky [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) to \"predict the prototype\" (category) where a data point $\\mathbf{x}$ maps to.\n",
    "Wikipedia:\n",
    "> These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items which cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logit regression is a particular solution to the classification problem that assumes that a linear combination of the observed features and some problem-specific parameters can be used to determine the probability of each particular outcome of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parts of the optimization objective\n",
    "\n",
    "### $L_z$ &mdash; statistical parity\n",
    "In code, we will denote $L_z$ with function `Lz`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Lz (generic function with 3 methods)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### \"NAIVE\" VERSION FOR UNDERSTANDING THE IMPLEMENTATION\n",
    "\n",
    "function LzNaive{T<:Number}(𝐗⁺::Matrix{T}, 𝐗⁻::Matrix{T}, 𝐕::Matrix{T}, dist::Function)\n",
    "    # Operate on matrices and take mean from sample dimension N\n",
    "    meanp = mean( softmax_dist_par(𝐗⁺, 𝐕, dist), 2 ) # Eq (6)\n",
    "    meann = mean( softmax_dist_par(𝐗⁻, 𝐕, dist), 2 ) # Similarly for M_k^-\n",
    "    sum(abs(meanp - meann)) # Eq (7), sum is from k=1 to K\n",
    "end\n",
    "\n",
    "### VERSIONS TAKING IN THE DATA SET AND PROTOTYPES\n",
    "# Optionally a distance measure function can be passed as an argument\n",
    "\n",
    "function Lz{T<:Number}(𝐗⁺::Matrix{T}, 𝐗⁻::Matrix{T}, 𝐕::Matrix{T})\n",
    "    return Lz(sdata(softmax_euclidean_par(𝐗⁺, 𝐕)), sdata(softmax_euclidean_par(𝐗⁻, 𝐕)))\n",
    "end\n",
    "\n",
    "function Lz{T<:Number}(𝐗⁺::Matrix{T}, 𝐗⁻::Matrix{T}, 𝐕::Matrix{T}, dist::Function)\n",
    "    return Lz(sdata(softmax_dist_par(𝐗⁺, 𝐕, dist)), sdata(softmax_dist_par(𝐗⁻, 𝐕, dist)))\n",
    "end\n",
    "\n",
    "### VERSION FOR PRECALCULATED 𝐙⁺ and 𝐙⁻\n",
    "# Note we have only a version for matrices. This is because during performance\n",
    "# testing I noticed that\n",
    "#\n",
    "#   ZZZp = sdata(𝐙shared⁺)\n",
    "#   ZZZn = sdata(𝐙shared⁻)\n",
    "#   Lz(ZZZp, ZZZn)\n",
    "#\n",
    "# is faster than\n",
    "#\n",
    "#   Lz(𝐙shared⁺, 𝐙shared⁻)\n",
    "#\n",
    "# So use the Matrix version always and if necessary lift the matrices out\n",
    "# of the SharedArray with sdata().\n",
    "#\n",
    "# TODO: is there way to make a faster parallel version?\n",
    "\n",
    "function Lz{T<:Number}(𝐙⁺::Matrix{T}, 𝐙⁻::Matrix{T})\n",
    "    # Operate on matrices and take mean from sample dimension N\n",
    "    meanp = mean( 𝐙⁺, 2 ) # Eq (6)\n",
    "    meann = mean( 𝐙⁻, 2 ) # Similarly for M_k^-\n",
    "    sum(abs(meanp - meann)) # Eq (7), sum is from k=1 to K\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.11900938943793998"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Test\n",
    "LZtrain = Lz(𝐗⁺train, 𝐗⁻train, 𝐕)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### $L_x$ &mdash; information loss\n",
    "In code, we will denote $L_x$ with function `Lx`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Symbols: 𝐗 ⁺ ⁻ ∑ 𝐕 𝐱 𝐲"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note .* elementwise multiplication of softmax_dist() and V, there is no \\cdot in the paper in Eq (9), dot product would return a scalar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Lx (generic function with 3 methods)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### NAIVE VERSION FOR UNDERSTANDING THE IMPLEMENTATION\n",
    "\n",
    "function LxNaive{T<:Number,U<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{U}, dist::Function)\n",
    "    D = size(𝐗, 1)\n",
    "    N = size(𝐗, 2)\n",
    "    K = size(𝐕, 2)\n",
    "    𝐗hat = zeros(Float64, (D,N))\n",
    "    sum = Float64(0.0)\n",
    "    for n in 1:N\n",
    "        for k in 1:K # Eq (9)\n",
    "            𝐗hat[:,n] = 𝐗hat[:,n] + (softmax_dist(𝐗[:,n], 𝐕, dist) .* 𝐕[:,k])\n",
    "        end\n",
    "        sum += (𝐗[:,n] - 𝐗hat[:,n]) ⋅ (𝐗[:,n] - 𝐗hat[:,n]) # Eq (8)\n",
    "    end\n",
    "    return sum, 𝐗hat\n",
    "end\n",
    "\n",
    "### VERSIONS TAKING IN THE DATA SET AND PROTOTYPES\n",
    "# Optionally a distance measure function can be passed as an argument\n",
    "\n",
    "function Lx{T<:Number,U<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{U})\n",
    "    return Lx(softmax_euclidean_par(𝐗, 𝐕), 𝐗, 𝐕)\n",
    "end\n",
    "\n",
    "function Lx{T<:Number,U<:Number}(𝐗::Matrix{T}, 𝐕::Matrix{U}, dist::Function)\n",
    "    return Lx(softmax_dist_par(𝐗, 𝐕, dist), 𝐗, 𝐕)\n",
    "end\n",
    "\n",
    "### VERSION FOR PRECALCULATED 𝐙\n",
    "\n",
    "function Lx{T<:Number}(𝐙::SharedArray{T,2}, 𝐗::Matrix{T}, 𝐕::Matrix{T})\n",
    "    D = size(𝐕, 1)\n",
    "    K = size(𝐕, 2)\n",
    "    N = size(𝐙, 2)\n",
    "    # Keep 𝐙 as SharedArray, will be faster than taking sdata() when fed to the following @parallel loop\n",
    "    sum = @parallel (+) for n in 1:N # Eq (8)\n",
    "        𝐱hat_n = zeros(Float64, D)\n",
    "        for k in 1:K # Eq (9)\n",
    "            𝐱hat_n = 𝐱hat_n + (𝐙[:,n] .* 𝐕[:,k]) # We are constructing a vector of length D\n",
    "        end\n",
    "        (𝐗[:,n] - 𝐱hat_n) ⋅ (𝐗[:,n] - 𝐱hat_n) # \"simple squared error\"\n",
    "    end\n",
    "    return sum\n",
    "end\n",
    "\n",
    "# TODO: test if making V into sharedarray increases performance, probably not since V is usually small"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "139035.65252282936"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Test\n",
    "LXtrain = Lx(𝐗train, 𝐕)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### $L_y$ &mdash; prediction accuracy\n",
    "In code, we will denote $L_y$ with function `Ly`.\n",
    "\n",
    "Essentially here we are letting the optimization pick both the prototypes (i.e. feature vectors) and their predictions (i.e. labels), and the predictions don't have to be discrete 0 and 1, but can be from the range $[0,1]$ and thus themselves can be viewed as probabilities. E.g. let's say that for prototype $v_k$ it's prediction $w_k = 0.82$, then \"there is a 82% chance prototype $\\mathbf{v}_k$ gets label 1\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ly (generic function with 3 methods)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### NAIVE VERSION TO HELP UNDERSTAND THE IMPLEMENTATION\n",
    "\n",
    "function LyNaive{T1<:Number,T2<:Number,T3<:Number}(\n",
    "        𝐗::Matrix{T1}, 𝐕::Matrix{T1}, 𝐲::Vector{T2}, 𝐰::Vector{T3}, dist::Function\n",
    "    )\n",
    "    D = size(𝐗, 1)\n",
    "    N = size(𝐗, 2)\n",
    "    K = size(𝐕, 2)\n",
    "    𝐲hat = zeros(Float64, N)\n",
    "    sum = Float64(0.0)\n",
    "    # Replace 𝐲hat in Eq (10) with Eq (11), then you get this for loop\n",
    "    for n in 1:N\n",
    "        𝐙_n = softmax_dist(𝐗[:,n], 𝐕, dist) # Vector of length K\n",
    "        for k in 1:K\n",
    "            𝐲hat[n] = 𝐲hat[n] + (𝐙_n[k] * 𝐰[k])\n",
    "        end\n",
    "        # The following line could be replaced with\n",
    "        # if 𝐲[n] == 1\n",
    "        #    sum -= log(𝐲hat[n])\n",
    "        # else # 𝐲[n] == 0\n",
    "        #    sum -= log(1 - 𝐲hat[n])\n",
    "        # end\n",
    "        sum += -𝐲[n] * log(𝐲hat[n])  -  (1 - 𝐲[n]) * log(1 - 𝐲hat[n])\n",
    "    end\n",
    "    #return sum, 𝐲hat\n",
    "    return sum\n",
    "end\n",
    "\n",
    "### VERSIONS TAKING IN THE DATA SET AND PROTOTYPES\n",
    "# Optionally a distance measure function can be passed as an argument\n",
    "\n",
    "function Ly{T1<:Number,T2<:Number,T3<:Number}(\n",
    "        𝐗::Matrix{T1}, 𝐕::Matrix{T1}, 𝐲::Vector{T2}, 𝐰::Vector{T3}\n",
    "    )\n",
    "    # 𝐙 = softmax_euclidean_par(𝐗, 𝐕)\n",
    "    return Ly(softmax_euclidean_par(𝐗, 𝐕), 𝐲, 𝐰)\n",
    "end\n",
    "\n",
    "function Ly{T1<:Number,T2<:Number,T3<:Number}(\n",
    "        𝐗::Matrix{T1}, 𝐕::Matrix{T1}, 𝐲::Vector{T2}, 𝐰::Vector{T3}, dist::Function\n",
    "    )\n",
    "    # 𝐙 = softmax_dist_par(𝐗, 𝐕, dist)\n",
    "    return Ly(softmax_dist_par(𝐗, 𝐕, dist), 𝐲, 𝐰)\n",
    "end\n",
    "\n",
    "### VERSION FOR PRECALCULATED 𝐙\n",
    "\n",
    "# function Ly{T1<:Number,T2<:Number,T3<:Number}(\n",
    "#         𝐙::Matrix{T1}, 𝐲::Vector{T2}, 𝐰::Vector{T3}\n",
    "#     )\n",
    "#     # Copy to shared memory\n",
    "#     𝐙shared = convert(SharedArray{T1, 2}, 𝐙)\n",
    "#     return Ly(𝐙shared, 𝐲, 𝐰)\n",
    "# end\n",
    "\n",
    "function Ly{T1<:Number,T2<:Number,T3<:Number}(\n",
    "        𝐙::SharedArray{T1,2}, 𝐲::Vector{T2}, 𝐰::Vector{T3}\n",
    "    )\n",
    "    N = size(𝐙, 2)\n",
    "    # Keep 𝐙 as SharedArray, will be faster than taking sdata() when fed to the following @parallel loop\n",
    "    sum = @parallel (+) for n in 1:N # Eq (10)\n",
    "        yhat_n = 𝐙[:,n] ⋅ 𝐰 # Eq (11)\n",
    "        - 𝐲[n] * log(yhat_n) - (1 - 𝐲[n]) * log(1 - yhat_n)\n",
    "    end\n",
    "    return sum\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "23914.776443370003"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Test\n",
    "LYtrain = Ly(𝐗train, 𝐕, 𝐲train, 𝐰)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TODO Alphas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization objective function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "objective_pre (generic function with 1 method)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Overall objective function\n",
    "objective_euclidean(𝐗, 𝐗⁺, 𝐗⁻, 𝐕, 𝐲, 𝐰, A) = A[:z]*Lz(𝐗⁺, 𝐗⁻, 𝐕) + A[:x]*Lx(𝐗, 𝐕) + A[:y]*Ly(𝐗, 𝐕, 𝐲, 𝐰)\n",
    "objective_dist(𝐗, 𝐗⁺, 𝐗⁻, 𝐕, 𝐲, 𝐰, A, dist) = A[:z]*Lz(𝐗⁺, 𝐗⁻, 𝐕, dist) + A[:x]*Lx(𝐗, 𝐕, dist) + A[:y]*Ly(𝐗, 𝐕, 𝐲, 𝐰, dist)\n",
    "function objective_pre(𝐗::Matrix, S::Vector{Bool}, 𝐕::Matrix, 𝐲::Vector, 𝐰::Vector, A::Dict)\n",
    "    # Calculate 𝐙 and partitions once\n",
    "    𝐙 = softmax_euclidean_par(𝐗, 𝐕)\n",
    "    (𝐙⁺, 𝐙⁻) = partition(𝐙, S)\n",
    "    # Use functions that accept precalculated 𝐙\n",
    "    return A[:z]*Lz(𝐙⁺, 𝐙⁻) + A[:x]*Lx(𝐙, 𝐗, 𝐕) + A[:y]*Ly(𝐙, 𝐲, 𝐰)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2524.3905990272233"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "objective_euclidean(𝐗train, 𝐗⁺train, 𝐗⁻train, 𝐕, 𝐲train, 𝐰, A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2524.3905990272233"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "objective_pre(𝐗train, S_𝐗train, 𝐕, 𝐲train, 𝐰, A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# @time for i in 1:10\n",
    "#     objective_euclidean(𝐗train, 𝐗⁺train, 𝐗⁻train, 𝐕, 𝐲train, 𝐰, A)\n",
    "# end\n",
    "# @time for i in 1:10\n",
    "#     objective_pre(𝐗train, S_𝐗train, 𝐕, 𝐲train, 𝐰, A)\n",
    "# end\n",
    "#4.119981 seconds (1.24 M allocations: 95.438 MB, 0.42% gc time)\n",
    "#2.420592 seconds (599.02 k allocations: 87.681 MB, 0.72% gc time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1320.035987782403"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "objective_euclidean(𝐗test, 𝐗⁺test, 𝐗⁻test, 𝐕, 𝐲test, 𝐰, A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "myfunc (generic function with 2 methods)"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count = 0 # keep track of # function evaluations\n",
    "function myfunc(Params::Vector, grad::Vector)\n",
    "    All = reshape(Params, (D, K+1))\n",
    "    my_𝐕 = All[:,1:K]\n",
    "    my_𝐰 = All[:,K+1]\n",
    "    # TODO Alphas\n",
    "    global 𝐗train, S_𝐗train, 𝐲train, A\n",
    "    val = objective_pre(𝐗train, S_𝐗train, my_𝐕, 𝐲train, my_𝐰, A)\n",
    "    \n",
    "    global count\n",
    "    count::Int += 1\n",
    "    if count % 100 == 0\n",
    "        println(\"f_$count(...) = $val\")\n",
    "    end\n",
    "    return val\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "f_1(...) = 3793.735108509088\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "3793.735108509088"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "myfunc(hcat(𝐕, 𝐰), [])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test optimization run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "f_100(...) = 3498.2998577857156\n",
      "f_200(...) = 3417.2079460503082\n",
      "f_300(...) = 3172.0276400993926\n",
      "Got 3073.780749562011 after 300 iterations (returned MAXEVAL_REACHED)\n"
     ]
    }
   ],
   "source": [
    "count = 0 # keep track of # function evaluations\n",
    "# TODO: need to constrain \n",
    "#opt = Opt(:LD_LBFGS, K*(K+1))\n",
    "opt = Opt(:LN_NELDERMEAD, K*(K+1))\n",
    "#lower_bounds!(opt, [-Inf, 0.])\n",
    "#xtol_rel!(opt, 1e-4)\n",
    "maxeval!(opt, 300) # TODO rounds number increase\n",
    "maxtime!(opt, 240) # 60 seconds\n",
    "min_objective!(opt, myfunc)\n",
    "(minf,minx,ret) = optimize(opt, vec(hcat(𝐕, 𝐰)))\n",
    "println(\"Got $minf after $count iterations (returned $ret)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-2.3361137936564247"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "minimum(𝐕)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "LoadError",
     "evalue": "LoadError: syntax: extra token \")\" after end of expression\nwhile loading In[25], in expression starting on line 22",
     "output_type": "error",
     "traceback": [
      "LoadError: syntax: extra token \")\" after end of expression\nwhile loading In[25], in expression starting on line 22",
      ""
     ]
    }
   ],
   "source": [
    "let\n",
    "    global objective_wrap\n",
    "    round = 0\n",
    "    prevres = 0.0\n",
    "    function objective_wrap(Params::Vector{Float64})\n",
    "        if round > 1\n",
    "            return prevres\n",
    "        end\n",
    "    \n",
    "        All = reshape(Params, (D, K+1))\n",
    "        my_𝐕 = All[:,1:K]\n",
    "        my_𝐰 = All[:,K+1]\n",
    "        # TODO Alphas\n",
    "        prevres = objective_pre(𝐗train, S_𝐗train, my_𝐕, 𝐲train, my_𝐰, A)\n",
    "    end\n",
    "end\n",
    "\n",
    "# Call L-BFGS\n",
    "@elapsed result = #optimize(\n",
    "    objective_wrap,\n",
    "    vec(hcat(𝐕, 𝐰)), # hcat returns a concatenated matrix, vec makes it a vector\n",
    "    method = :l_bfgs)#,\n",
    "    #iterations = 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optimization, running the algorithm\n",
    "Hyperparameters for the objective function.\n",
    "In the paper they use grid search to find the parameters. The sets defined here are the same as in the paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dict{Symbol,Any} with 3 entries:\n",
       "  :y => 0.1\n",
       "  :z => 100\n",
       "  :x => 0.01"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Sets of hyperparameters as in paper, for grid search\n",
    "gridA = Dict(:z => Set([0.1, 0.5, 1.0, 5.0, 10.0]), :x => Set([0, 0.01]), :y => Set([0.1, 0.5, 1.0, 5.0, 10.0]))\n",
    "# An example of selected hyperparameters, for development\n",
    "#A = Dict(:z => 0.01, :x => 0.5, :y => 1.0)\n",
    "A = Dict(:z => 100, :x => 0.01, :y => 0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO\n",
    "- Overall process with pictures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Problems/Cons/Notes:\n",
    "- Let's say that there is a column/feature \"Religion\" in the dataset.\n",
    "- Now this paper says we can only say that \"Is a member of protected group\" or \"Is not a member of protected group\".\n",
    "- You have to decide what is the \"protected\" group, and what is the \"normal/non protected\" group. You have to decide based on some external criteria who are discriminated against and who are not.\n",
    "- Let's say we have a dataset with a feature \"Religion\" and we have 5 different religions represented.\n",
    "- Now we have to choose which ones are protected and which ones are not.\n",
    "- The problem of course is that some might be in general discriminated against more than others. There is not necessary even split between the different groups that are discriminated against.\n",
    "\n",
    "- What we would like to say is that \"Religion\" is a sensitive feature, and we should not infer _anything_ from it, regardless what it is.\n",
    "\n",
    "- Does running the algo multiple times help, changing the binary classification each time? Can we extend it so that $S \\in {1,...,C}$ where $C$ is the number of categories in the sensitive column.\n",
    "  - We can extend, just split $L_z$ to multiple cases and the optimization is done to all of them. There will be $c = \\frac{(C-1)C}{2} \\approx O(C^2)$ pairs. Whether this is computationally still feasible is another question. In the objective function $L_z$ is replaced by $A_{z_1} \\cdot L_{z_1} + A_{z_2} \\cdot L_{z_2} + \\dots + A_{z_c} \\cdot L_{z_c}$.\n",
    "\n",
    "- On the current case where $S \\in \\left\\{0,1\\right\\}$ once we have set for which rows $S=1$ and $S=0$, we can flip them around without changing anything. This is because we are using statistical parity. This means that from the algorithm's perspective saying that group0 is non-protected and groups 1..4 are protected is the same as saying group1 is protected and other non-protected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.4.0",
   "language": "julia",
   "name": "julia-0.4"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}