jestinjoy/gist:a1b04a1427d499b884439a6edb941f38

## gistfile1.txt
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TF-IDF\n",
    "\n",
    "Documentation: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction\n",
    "\n",
    "The examples assumes that we have two documents with each having a single sentence.\n",
    "\n",
    "*Doc1: This is a sample*\n",
    "\n",
    "*Doc2: This is another example*\n",
    "\n",
    "Column details of the matrix can be printed using *vocabulary_*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following example calculated tf-idf using **TfidfTransformer**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.45329466,  0.45329466,  0.76749457,  0.        ,  0.        ],\n",
       "       [ 0.35959372,  0.35959372,  0.        ,  0.6088451 ,  0.6088451 ]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "transformer = TfidfTransformer(smooth_idf=False)\n",
    "\n",
    "counts = [[1,1,1,0,0],\n",
    "          [1,1,0,1,1]]\n",
    "\n",
    "tfidf = transformer.fit_transform(counts)\n",
    "tfidf.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following example calculated tf-idf using **TfidfVectorizer**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.          0.          0.50154891  0.70490949  0.50154891]\n",
      " [ 0.57615236  0.57615236  0.40993715  0.          0.40993715]]\n",
      "{u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "sent=[\"This is a sample\", \"This is another example\"]\n",
    "tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)\n",
    "tfidf_matrix =  tf.fit_transform(sent)\n",
    "print tfidf_matrix.toarray()\n",
    "print tf.vocabulary_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "THe following example just prints the count, which is used to calculate tf-idf values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 0 1 1 1]\n",
      " [1 1 1 0 1]]\n",
      "{u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vectorizer = CountVectorizer(min_df=0)\n",
    "corpus = [\n",
    "    'This is a sample',\n",
    "    'This is another example'\n",
    "]\n",
    "X = vectorizer.fit_transform(corpus)\n",
    "print X.toarray()\n",
    "print vectorizer.vocabulary_"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## TF-IDF\n",
	"\n",
	"Documentation: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction\n",
	"\n",
	"The examples assumes that we have two documents with each having a single sentence.\n",
	"\n",
	"Doc1: This is a sample\n",
	"\n",
	"Doc2: This is another example\n",
	"\n",
	"Column details of the matrix can be printed using vocabulary_"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The following example calculated tf-idf using TfidfTransformer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 0.45329466, 0.45329466, 0.76749457, 0. , 0. ],\n",
	" [ 0.35959372, 0.35959372, 0. , 0.6088451 , 0.6088451 ]])"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.feature_extraction.text import TfidfTransformer\n",
	"transformer = TfidfTransformer(smooth_idf=False)\n",
	"\n",
	"counts = [[1,1,1,0,0],\n",
	" [1,1,0,1,1]]\n",
	"\n",
	"tfidf = transformer.fit_transform(counts)\n",
	"tfidf.toarray()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The following example calculated tf-idf using TfidfVectorizer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[ 0. 0. 0.50154891 0.70490949 0.50154891]\n",
	" [ 0.57615236 0.57615236 0.40993715 0. 0.40993715]]\n",
	"{u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}\n"
	]
	}
	],
	"source": [
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"sent=[\"This is a sample\", \"This is another example\"]\n",
	"tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)\n",
	"tfidf_matrix = tf.fit_transform(sent)\n",
	"print tfidf_matrix.toarray()\n",
	"print tf.vocabulary_"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"THe following example just prints the count, which is used to calculate tf-idf values."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[0 0 1 1 1]\n",
	" [1 1 1 0 1]]\n",
	"{u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}\n"
	]
	}
	],
	"source": [
	"from sklearn.feature_extraction.text import CountVectorizer\n",
	"vectorizer = CountVectorizer(min_df=0)\n",
	"corpus = [\n",
	" 'This is a sample',\n",
	" 'This is another example'\n",
	"]\n",
	"X = vectorizer.fit_transform(corpus)\n",
	"print X.toarray()\n",
	"print vectorizer.vocabulary_"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}