Skip to content

Instantly share code, notes, and snippets.

@pilipolio
Created May 28, 2013 21:28
Show Gist options
  • Save pilipolio/5666298 to your computer and use it in GitHub Desktop.
Save pilipolio/5666298 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "PlayingWithTheTdIdf"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wikipedia's [page](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [Levente's tutorial p14](https://drive.google.com/a/we7.com/?tab=mo#folders/0BwsmmX4SxUXiS1EycEY3a3VZcU0) :\n",
"\n",
"TFIDF: term frequency\u2013inverse document frequency\n",
"\n",
"$ \\text{tf-idf}_ {ij} = \\text{tf}_ {ij} \\times \\text{idf}_i $\n",
"\n",
"$ \\text{tf}_ {ij} = n_ {ij} / \\sum_j n_ {ij} $ , where $ t_i $ refers to the $i$th term, $ d_j $ denotes the $j$th document.\n",
"NB. : normalizing for the length of the document.\n",
"\n",
"$$ \\text{idf}_i = \\frac{\\log |D|}{1 + | \\{d:t_i \\in d\\} | }$$\n",
"\n",
"where $ |D| $ is the number of documents in the corpus and the denumerator is the number of documents in which the term $t_j$ appeared.\n",
"\n",
"Linked-in's _skills and expertises_ of [Levente](http://hu.linkedin.com/in/toroklev), [Krishna](http://uk.linkedin.com/in/krishnajrao), [Barak](http://uk.linkedin.com/in/barakschiller), [me](http://www.linkedin.com/pub/allain-guillaume/2/233/5ba) and [Miklos](http://uk.linkedin.com/in/miklosparrag) the 28th of May 2013:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"people_to_skills = documents_to_terms = {\n",
" 'Levente': ['Machine Learning', 'Data Mining', 'C++', 'Algorithms', 'Recommender Systems', 'Octave', 'Java'],\n",
" 'Krishna': ['Java', 'Python', 'C#', 'Hibernate', 'XML', 'Software Engineering', 'Agile', 'TDD',\n",
" 'Object Oriented Design', 'Software Development', 'SQL'],\n",
" 'Barak': ['Java', 'OOP', 'Eclipse', 'Python', 'Multithreading', 'Embedded Systems', 'Software Engineering', 'SQL',\n",
" 'Agile'],\n",
" 'Guillaume':['Statistics', 'C#', 'Data Mining', 'Machine Learning', 'Algorithms', 'Python', 'Applied Mathematics'],\n",
" 'Miklos':['Agile', 'Software Development', 'Software Engineering', 'Object Oriented Design', 'Scrum',\n",
" 'XML', 'Python', 'Java']\n",
"}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"documents = sorted([d for d in documents_to_terms.keys()])\n",
"n_D = len(documents)\n",
"import itertools\n",
"all_terms = list(itertools.chain(*[doc_terms for doc_terms in documents_to_terms.values()]))\n",
"terms = sorted(set(all_terms))\n",
"n_T = len(terms)\n",
"\n",
"print '{} unique terms from {} documents with a total of {} terms (sparsity = {}%)'.format(\n",
" n_T, n_D, len(all_terms), 100 * len(all_terms) / (n_D * n_T))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"24 unique terms from 5 documents with a total of 42 terms (sparsity = 35%)\n"
]
}
],
"prompt_number": 43
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tf = np.array([[t in documents_to_terms[d] for d in documents] for t in terms]) \n",
"tf[0:4,:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 44,
"text": [
"array([[ True, False, True, False, True],\n",
" [False, True, False, True, False],\n",
" [False, True, False, False, False],\n",
" [False, True, True, False, False]], dtype=bool)"
]
}
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"idf = np.log(n_D) / (1 + np.sum(tf==1,axis=1))\n",
"from operator import itemgetter\n",
"print sorted(zip(terms, idf), key=itemgetter(1))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[('Java', 0.32188758248682003), ('Python', 0.32188758248682003), ('Agile', 0.40235947810852507), ('Software Engineering', 0.40235947810852507), ('Algorithms', 0.53647930414470013), ('C#', 0.53647930414470013), ('Data Mining', 0.53647930414470013), ('Machine Learning', 0.53647930414470013), ('Object Oriented Design', 0.53647930414470013), ('SQL', 0.53647930414470013), ('Software Development', 0.53647930414470013), ('XML', 0.53647930414470013), ('Applied Mathematics', 0.80471895621705014), ('C++', 0.80471895621705014), ('Eclipse', 0.80471895621705014), ('Embedded Systems', 0.80471895621705014), ('Hibernate', 0.80471895621705014), ('Multithreading', 0.80471895621705014), ('OOP', 0.80471895621705014), ('Octave', 0.80471895621705014), ('Recommender Systems', 0.80471895621705014), ('Scrum', 0.80471895621705014), ('Statistics', 0.80471895621705014), ('TDD', 0.80471895621705014)]\n"
]
}
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"documents"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 46,
"text": [
"['Barak', 'Guillaume', 'Krishna', 'Levente', 'Miklos']"
]
}
],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"terms"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 34,
"text": [
"['Agile',\n",
" 'Algorithms',\n",
" 'Applied Mathematics',\n",
" 'C#',\n",
" 'C++',\n",
" 'Data Analysis',\n",
" 'Data Mining',\n",
" 'Eclipse',\n",
" 'Embedded Systems',\n",
" 'Hibernate',\n",
" 'Java',\n",
" 'Machine Learning',\n",
" 'Multithreading',\n",
" 'OOP',\n",
" 'Object Oriented Design',\n",
" 'Octave',\n",
" 'Python',\n",
" 'Recommender Systems',\n",
" 'SQL',\n",
" 'Scrum',\n",
" 'Software Development',\n",
" 'Software Engineering',\n",
" 'Statistics',\n",
" 'TDD',\n",
" 'XML']"
]
}
],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment