pilipolio/PlayingWIthTheTfIdf.ipynb

## PlayingWIthTheTfIdf.ipynb
{
 "metadata": {
  "name": "PlayingWithTheTdIdf"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Wikipedia's [page](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [Levente's tutorial p14](https://drive.google.com/a/we7.com/?tab=mo#folders/0BwsmmX4SxUXiS1EycEY3a3VZcU0) :\n",
      "\n",
      "TFIDF: term frequency\u2013inverse document frequency\n",
      "\n",
      "$ \\text{tf-idf}_ {ij} = \\text{tf}_ {ij} \\times \\text{idf}_i $\n",
      "\n",
      "$ \\text{tf}_ {ij} =  n_ {ij} / \\sum_j n_ {ij} $   , where $ t_i $ refers to the $i$th term, $ d_j $ denotes the  $j$th document.\n",
      "NB. : normalizing for the length of the document.\n",
      "\n",
      "$$ \\text{idf}_i = \\frac{\\log |D|}{1 + | \\{d:t_i \\in d\\} | }$$\n",
      "\n",
      "where $ |D| $ is the number of documents in the corpus and the denumerator is the number of documents in which the term $t_j$ appeared.\n",
      "\n",
      "Linked-in's _skills and expertises_ of [Levente](http://hu.linkedin.com/in/toroklev), [Krishna](http://uk.linkedin.com/in/krishnajrao), [Barak](http://uk.linkedin.com/in/barakschiller), [me](http://www.linkedin.com/pub/allain-guillaume/2/233/5ba) and [Miklos](http://uk.linkedin.com/in/miklosparrag) the 28th of May 2013:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "people_to_skills = documents_to_terms = {\n",
      "    'Levente': ['Machine Learning', 'Data Mining', 'C++', 'Algorithms', 'Recommender Systems', 'Octave', 'Java'],\n",
      "    'Krishna': ['Java', 'Python', 'C#', 'Hibernate', 'XML', 'Software Engineering', 'Agile', 'TDD',\n",
      "                 'Object Oriented Design', 'Software Development', 'SQL'],\n",
      "    'Barak': ['Java', 'OOP', 'Eclipse', 'Python', 'Multithreading', 'Embedded Systems', 'Software Engineering', 'SQL',\n",
      "                'Agile'],\n",
      "    'Guillaume':['Statistics', 'C#', 'Data Mining', 'Machine Learning', 'Algorithms', 'Python', 'Applied Mathematics'],\n",
      "    'Miklos':['Agile', 'Software Development', 'Software Engineering', 'Object Oriented Design', 'Scrum',\n",
      "                'XML', 'Python', 'Java']\n",
      "}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 42
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "documents = sorted([d for d in documents_to_terms.keys()])\n",
      "n_D = len(documents)\n",
      "import itertools\n",
      "all_terms = list(itertools.chain(*[doc_terms for doc_terms in documents_to_terms.values()]))\n",
      "terms = sorted(set(all_terms))\n",
      "n_T = len(terms)\n",
      "\n",
      "print '{} unique terms from {} documents with a total of {} terms (sparsity = {}%)'.format(\n",
      "    n_T, n_D, len(all_terms), 100 * len(all_terms) / (n_D * n_T))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "24 unique terms from 5 documents with a total of 42 terms (sparsity = 35%)\n"
       ]
      }
     ],
     "prompt_number": 43
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tf = np.array([[t in documents_to_terms[d] for d in documents] for t in terms])  \n",
      "tf[0:4,:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 44,
       "text": [
        "array([[ True, False,  True, False,  True],\n",
        "       [False,  True, False,  True, False],\n",
        "       [False,  True, False, False, False],\n",
        "       [False,  True,  True, False, False]], dtype=bool)"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "idf = np.log(n_D) / (1 + np.sum(tf==1,axis=1))\n",
      "from operator import itemgetter\n",
      "print sorted(zip(terms, idf), key=itemgetter(1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[('Java', 0.32188758248682003), ('Python', 0.32188758248682003), ('Agile', 0.40235947810852507), ('Software Engineering', 0.40235947810852507), ('Algorithms', 0.53647930414470013), ('C#', 0.53647930414470013), ('Data Mining', 0.53647930414470013), ('Machine Learning', 0.53647930414470013), ('Object Oriented Design', 0.53647930414470013), ('SQL', 0.53647930414470013), ('Software Development', 0.53647930414470013), ('XML', 0.53647930414470013), ('Applied Mathematics', 0.80471895621705014), ('C++', 0.80471895621705014), ('Eclipse', 0.80471895621705014), ('Embedded Systems', 0.80471895621705014), ('Hibernate', 0.80471895621705014), ('Multithreading', 0.80471895621705014), ('OOP', 0.80471895621705014), ('Octave', 0.80471895621705014), ('Recommender Systems', 0.80471895621705014), ('Scrum', 0.80471895621705014), ('Statistics', 0.80471895621705014), ('TDD', 0.80471895621705014)]\n"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "documents"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 46,
       "text": [
        "['Barak', 'Guillaume', 'Krishna', 'Levente', 'Miklos']"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "terms"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 34,
       "text": [
        "['Agile',\n",
        " 'Algorithms',\n",
        " 'Applied Mathematics',\n",
        " 'C#',\n",
        " 'C++',\n",
        " 'Data Analysis',\n",
        " 'Data Mining',\n",
        " 'Eclipse',\n",
        " 'Embedded Systems',\n",
        " 'Hibernate',\n",
        " 'Java',\n",
        " 'Machine Learning',\n",
        " 'Multithreading',\n",
        " 'OOP',\n",
        " 'Object Oriented Design',\n",
        " 'Octave',\n",
        " 'Python',\n",
        " 'Recommender Systems',\n",
        " 'SQL',\n",
        " 'Scrum',\n",
        " 'Software Development',\n",
        " 'Software Engineering',\n",
        " 'Statistics',\n",
        " 'TDD',\n",
        " 'XML']"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "PlayingWithTheTdIdf"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Wikipedia's [page](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [Levente's tutorial p14](https://drive.google.com/a/we7.com/?tab=mo#folders/0BwsmmX4SxUXiS1EycEY3a3VZcU0) :\n",
	"\n",
	"TFIDF: term frequency\u2013inverse document frequency\n",
	"\n",
	"$ \\text{tf-idf}_ {ij} = \\text{tf}_ {ij} \\times \\text{idf}_i $\n",
	"\n",
	"$ \\text{tf}_ {ij} = n_ {ij} / \\sum_j n_ {ij} $ , where $ t_i $ refers to the $i$th term, $ d_j $ denotes the $j$th document.\n",
	"NB. : normalizing for the length of the document.\n",
	"\n",
	"$$ \\text{idf}_i = \\frac{\\log \|D\|}{1 + \| \\{d:t_i \\in d\\} \| }$$\n",
	"\n",
	"where $ \|D\| $ is the number of documents in the corpus and the denumerator is the number of documents in which the term $t_j$ appeared.\n",
	"\n",
	"Linked-in's _skills and expertises_ of [Levente](http://hu.linkedin.com/in/toroklev), [Krishna](http://uk.linkedin.com/in/krishnajrao), [Barak](http://uk.linkedin.com/in/barakschiller), [me](http://www.linkedin.com/pub/allain-guillaume/2/233/5ba) and [Miklos](http://uk.linkedin.com/in/miklosparrag) the 28th of May 2013:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"people_to_skills = documents_to_terms = {\n",
	" 'Levente': ['Machine Learning', 'Data Mining', 'C++', 'Algorithms', 'Recommender Systems', 'Octave', 'Java'],\n",
	" 'Krishna': ['Java', 'Python', 'C#', 'Hibernate', 'XML', 'Software Engineering', 'Agile', 'TDD',\n",
	" 'Object Oriented Design', 'Software Development', 'SQL'],\n",
	" 'Barak': ['Java', 'OOP', 'Eclipse', 'Python', 'Multithreading', 'Embedded Systems', 'Software Engineering', 'SQL',\n",
	" 'Agile'],\n",
	" 'Guillaume':['Statistics', 'C#', 'Data Mining', 'Machine Learning', 'Algorithms', 'Python', 'Applied Mathematics'],\n",
	" 'Miklos':['Agile', 'Software Development', 'Software Engineering', 'Object Oriented Design', 'Scrum',\n",
	" 'XML', 'Python', 'Java']\n",
	"}"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 42
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"documents = sorted([d for d in documents_to_terms.keys()])\n",
	"n_D = len(documents)\n",
	"import itertools\n",
	"all_terms = list(itertools.chain(*[doc_terms for doc_terms in documents_to_terms.values()]))\n",
	"terms = sorted(set(all_terms))\n",
	"n_T = len(terms)\n",
	"\n",
	"print '{} unique terms from {} documents with a total of {} terms (sparsity = {}%)'.format(\n",
	" n_T, n_D, len(all_terms), 100 * len(all_terms) / (n_D * n_T))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"24 unique terms from 5 documents with a total of 42 terms (sparsity = 35%)\n"
	]
	}
	],
	"prompt_number": 43
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"tf = np.array([[t in documents_to_terms[d] for d in documents] for t in terms]) \n",
	"tf[0:4,:]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 44,
	"text": [
	"array([[ True, False, True, False, True],\n",
	" [False, True, False, True, False],\n",
	" [False, True, False, False, False],\n",
	" [False, True, True, False, False]], dtype=bool)"
	]
	}
	],
	"prompt_number": 44
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"idf = np.log(n_D) / (1 + np.sum(tf==1,axis=1))\n",
	"from operator import itemgetter\n",
	"print sorted(zip(terms, idf), key=itemgetter(1))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[('Java', 0.32188758248682003), ('Python', 0.32188758248682003), ('Agile', 0.40235947810852507), ('Software Engineering', 0.40235947810852507), ('Algorithms', 0.53647930414470013), ('C#', 0.53647930414470013), ('Data Mining', 0.53647930414470013), ('Machine Learning', 0.53647930414470013), ('Object Oriented Design', 0.53647930414470013), ('SQL', 0.53647930414470013), ('Software Development', 0.53647930414470013), ('XML', 0.53647930414470013), ('Applied Mathematics', 0.80471895621705014), ('C++', 0.80471895621705014), ('Eclipse', 0.80471895621705014), ('Embedded Systems', 0.80471895621705014), ('Hibernate', 0.80471895621705014), ('Multithreading', 0.80471895621705014), ('OOP', 0.80471895621705014), ('Octave', 0.80471895621705014), ('Recommender Systems', 0.80471895621705014), ('Scrum', 0.80471895621705014), ('Statistics', 0.80471895621705014), ('TDD', 0.80471895621705014)]\n"
	]
	}
	],
	"prompt_number": 45
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"documents"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 46,
	"text": [
	"['Barak', 'Guillaume', 'Krishna', 'Levente', 'Miklos']"
	]
	}
	],
	"prompt_number": 46
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"terms"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 34,
	"text": [
	"['Agile',\n",
	" 'Algorithms',\n",
	" 'Applied Mathematics',\n",
	" 'C#',\n",
	" 'C++',\n",
	" 'Data Analysis',\n",
	" 'Data Mining',\n",
	" 'Eclipse',\n",
	" 'Embedded Systems',\n",
	" 'Hibernate',\n",
	" 'Java',\n",
	" 'Machine Learning',\n",
	" 'Multithreading',\n",
	" 'OOP',\n",
	" 'Object Oriented Design',\n",
	" 'Octave',\n",
	" 'Python',\n",
	" 'Recommender Systems',\n",
	" 'SQL',\n",
	" 'Scrum',\n",
	" 'Software Development',\n",
	" 'Software Engineering',\n",
	" 'Statistics',\n",
	" 'TDD',\n",
	" 'XML']"
	]
	}
	],
	"prompt_number": 34
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": []
	}
	],
	"metadata": {}
	}
	]
	}