benvandyke/partial.ipynb

## partial.ipynb
{
 "metadata": {
  "name": "Partial Application in Python"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Partial Application in Python\n",
      "-----------------------------\n",
      "\n",
      "Ben Van Dyke, February 2014"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "from __future__ import division\n",
      "from __future__ import print_function\n",
      "import functools"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Similarity functions\n",
      "# Cosine sim\n",
      "def cosine_sim(a, b):\n",
      "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n",
      "\n",
      "# Dice's coefficient sim\n",
      "def dice_sim(a, b):\n",
      "    return 2 * np.sum(a * b) / (np.sum(np.square(a)) + np.sum(np.square(b)))\n",
      "\n",
      "# Jaccard's coefficient sim\n",
      "def jaccard_sim(a, b):\n",
      "    return np.dot(a, b) / (np.sum(np.square(a)) + np.sum(np.square(b)) - np.sum(a * b)) "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Term Frequency and query data\n",
      "tf = np.genfromtxt('tf.csv',skiprows=1,delimiter=',')\n",
      "query = np.array([2,1,1,0,2,0,3,0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The term-frequency matrix is represented with row vectors modeling a document and column vectors modeling a term. The numeric values are the number of occurrences of each term in the corresponding document. Next, the term frequncies are transformed into new weights using the inverse document frequency. This method reduces the weights of terms that occur frequently in a lot of documents and increases the weights of terms that appear often in a small number of documents. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Convert term-frequency matrix to TFxIDF\n",
      "# Total number of documents\n",
      "N = np.shape(tf)[0]\n",
      "\n",
      "# Document frequency of each term\n",
      "n = np.sum(tf > 0, axis=0)\n",
      "\n",
      "# Inverse document fequency\n",
      "idf = np.log2(N/n)\n",
      "\n",
      "# Multiply orignal TF matrix by IDF matrix\n",
      "tfidf = tf * idf\n",
      "\n",
      "# Apply IDF weighting to the query vector\n",
      "query_idf = query * idf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Similarity function allows multiple functions defined above to be passed in\n",
      "# Partially applies the transformed query vector\n",
      "def sim_func(f):\n",
      "    return functools.partial(f, query_idf)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that the similarity function has been partially applied with the query vector, the resulting function can be applied on the TFxIDF matrix. Documents are represented as row vectors so the function will applied along the rows, comparing the query vector weights to each document vector's weights. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Cosine scores and document rankings\n",
      "cos_scores = np.apply_along_axis(sim_func(cosine_sim), 1, tfidf)\n",
      "print(\"Cosine similarity scores:\")\n",
      "print(np.round(cos_scores,2))\n",
      "print()\n",
      "print(\"Document retrieval rankings:\")\n",
      "print(np.argsort(cos_scores)[::-1] + 1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Cosine similarity scores:\n",
        "[ 0.37  0.94  0.78  0.45  0.15  0.41  0.54  0.2   0.5   0.21]\n",
        "\n",
        "Document rankings:\n",
        "[ 2  3  7  9  4  6  1 10  8  5]\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Document 2 has the highest similarity score to the query vector and is the top-ranked result. Documents 3, 7 and 9 are the only other documents with similarity scores >= 0.5.\n",
      "\n",
      "To perform a similar analysis using other similarity functions, the only modification required is to change the function passed into the NumPy array apply method."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Jaccard scores and document rankings\n",
      "jaccard_scores = np.apply_along_axis(sim_func(jaccard_sim), 1, tfidf)\n",
      "print(\"Jaccard similarity scores:\")\n",
      "print(np.round(jaccard_scores,2))\n",
      "print()\n",
      "print(\"Document retrieval rankings:\")\n",
      "print(np.argsort(jaccard_scores)[::-1] + 1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Jaccard similarity scores:\n",
        "[ 0.22  0.56  0.44  0.26  0.08  0.25  0.28  0.11  0.3   0.11]\n",
        "\n",
        "Document retrieval rankings:\n",
        "[ 2  3  9  7  4  6  1 10  8  5]\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The results for Jaccard similarity are very similar to cosine similarity for this particular query on this corpus. \n",
      "\n",
      "In this example, using partial application enables writing less repetitive code and with a larger corpus would allow efficient comparison of multiple similarity metrics. A lot of functional programming concepts are demonstrated using exceedingly simple toy examples, this example hopefully maintains simplicity and understandability while demonstrating utility in an actual application."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "Partial Application in Python"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Partial Application in Python\n",
	"-----------------------------\n",
	"\n",
	"Ben Van Dyke, February 2014"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import numpy as np\n",
	"from __future__ import division\n",
	"from __future__ import print_function\n",
	"import functools"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Similarity functions\n",
	"# Cosine sim\n",
	"def cosine_sim(a, b):\n",
	" return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n",
	"\n",
	"# Dice's coefficient sim\n",
	"def dice_sim(a, b):\n",
	" return 2 * np.sum(a * b) / (np.sum(np.square(a)) + np.sum(np.square(b)))\n",
	"\n",
	"# Jaccard's coefficient sim\n",
	"def jaccard_sim(a, b):\n",
	" return np.dot(a, b) / (np.sum(np.square(a)) + np.sum(np.square(b)) - np.sum(a * b)) "
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 2
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Term Frequency and query data\n",
	"tf = np.genfromtxt('tf.csv',skiprows=1,delimiter=',')\n",
	"query = np.array([2,1,1,0,2,0,3,0])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The term-frequency matrix is represented with row vectors modeling a document and column vectors modeling a term. The numeric values are the number of occurrences of each term in the corresponding document. Next, the term frequncies are transformed into new weights using the inverse document frequency. This method reduces the weights of terms that occur frequently in a lot of documents and increases the weights of terms that appear often in a small number of documents. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Convert term-frequency matrix to TFxIDF\n",
	"# Total number of documents\n",
	"N = np.shape(tf)[0]\n",
	"\n",
	"# Document frequency of each term\n",
	"n = np.sum(tf > 0, axis=0)\n",
	"\n",
	"# Inverse document fequency\n",
	"idf = np.log2(N/n)\n",
	"\n",
	"# Multiply orignal TF matrix by IDF matrix\n",
	"tfidf = tf * idf\n",
	"\n",
	"# Apply IDF weighting to the query vector\n",
	"query_idf = query * idf"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 10
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Similarity function allows multiple functions defined above to be passed in\n",
	"# Partially applies the transformed query vector\n",
	"def sim_func(f):\n",
	" return functools.partial(f, query_idf)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 11
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now that the similarity function has been partially applied with the query vector, the resulting function can be applied on the TFxIDF matrix. Documents are represented as row vectors so the function will applied along the rows, comparing the query vector weights to each document vector's weights. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Cosine scores and document rankings\n",
	"cos_scores = np.apply_along_axis(sim_func(cosine_sim), 1, tfidf)\n",
	"print(\"Cosine similarity scores:\")\n",
	"print(np.round(cos_scores,2))\n",
	"print()\n",
	"print(\"Document retrieval rankings:\")\n",
	"print(np.argsort(cos_scores)[::-1] + 1)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Cosine similarity scores:\n",
	"[ 0.37 0.94 0.78 0.45 0.15 0.41 0.54 0.2 0.5 0.21]\n",
	"\n",
	"Document rankings:\n",
	"[ 2 3 7 9 4 6 1 10 8 5]\n"
	]
	}
	],
	"prompt_number": 18
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Document 2 has the highest similarity score to the query vector and is the top-ranked result. Documents 3, 7 and 9 are the only other documents with similarity scores >= 0.5.\n",
	"\n",
	"To perform a similar analysis using other similarity functions, the only modification required is to change the function passed into the NumPy array apply method."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Jaccard scores and document rankings\n",
	"jaccard_scores = np.apply_along_axis(sim_func(jaccard_sim), 1, tfidf)\n",
	"print(\"Jaccard similarity scores:\")\n",
	"print(np.round(jaccard_scores,2))\n",
	"print()\n",
	"print(\"Document retrieval rankings:\")\n",
	"print(np.argsort(jaccard_scores)[::-1] + 1)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Jaccard similarity scores:\n",
	"[ 0.22 0.56 0.44 0.26 0.08 0.25 0.28 0.11 0.3 0.11]\n",
	"\n",
	"Document retrieval rankings:\n",
	"[ 2 3 9 7 4 6 1 10 8 5]\n"
	]
	}
	],
	"prompt_number": 19
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The results for Jaccard similarity are very similar to cosine similarity for this particular query on this corpus. \n",
	"\n",
	"In this example, using partial application enables writing less repetitive code and with a larger corpus would allow efficient comparison of multiple similarity metrics. A lot of functional programming concepts are demonstrated using exceedingly simple toy examples, this example hopefully maintains simplicity and understandability while demonstrating utility in an actual application."
	]
	}
	],
	"metadata": {}
	}
	]
	}