vijayanandrp/tutorial_with_solutions.md

## tutorial_with_solutions.md

      
    Raw
  

              tutorial_with_solutions.md
            
          
    Tutorial: Machine Learning with Text in scikit-learn

Agenda


Model building in scikit-learn (refresher)
Representing text as numerical data
Reading a text-based dataset into pandas
Vectorizing our dataset
Building and evaluating a model
Comparing models
Examining a model for further insight
Practicing this workflow on another dataset
Tuning the vectorizer (discussion)

# for Python 2: use print only as a function
from __future__ import print_function
Part 1: Model building in scikit-learn (refresher)

# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
"Features" are also known as predictors, inputs, or attributes. The "response" is also known as the target, label, or output.
# check the shapes of X and y
print(X.shape)
print(y.shape)
(150L, 4L)
(150L,)

"Observations" are also known as samples, instances, or records.
# examine the first 5 rows of the feature matrix (including the feature names)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()


      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
  
      0
      5.1
      3.5
      1.4
      0.2
    
    
      1
      4.9
      3.0
      1.4
      0.2
    
    
      2
      4.7
      3.2
      1.3
      0.2
    
    
      3
      4.6
      3.1
      1.5
      0.2
    
    
      4
      5.0
      3.6
      1.4
      0.2
    
  
# examine the response vector
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

In order to build a model, the features must be numeric, and every observation must have the same features in the same order.
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])
array([1])

Part 2: Representing text as numerical data

# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer to "convert text into a matrix of token counts":
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

# examine the fitted vocabulary
vect.get_feature_names()
[u'cab', u'call', u'me', u'please', u'tonight', u'you']

# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
<3x6 sparse matrix of type '<type 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())


      cab
      call
      me
      please
      tonight
      you
    
  
      0
      0
      1
      0
      0
      1
      1
    
    
      1
      1
      1
      1
      0
      0
      0
    
    
      2
      0
      1
      1
      2
      0
      0
    
  
From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:


Each individual token occurrence frequency (normalized or not) is treated as a feature.
The vector of all the token frequencies for a given document is considered a multivariate sample.


A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.


We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

# check the type of the document-term matrix
type(simple_train_dtm)
scipy.sparse.csr.csr_matrix

# examine the sparse matrix contents
print(simple_train_dtm)
  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).


For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.


In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

# example text for model testing
simple_test = ["please don't call me"]
In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
array([[0, 1, 1, 1, 0, 0]], dtype=int64)

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())


      cab
      call
      me
      please
      tonight
      you
    
  
      0
      0
      1
      1
      1
      0
      0
    
  
Summary:

vect.fit(train) learns the vocabulary of the training data
vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Part 3: Reading a text-based dataset into pandas

# read file into pandas using a relative path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])
# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
# sms = pd.read_table(url, header=None, names=['label', 'message'])
# examine the shape
sms.shape
(5572, 2)

# examine the first 10 rows
sms.head(10)


      label
      message
    
  
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's n...
    
    
      6
      ham
      Even my brother is not like to speak with me. ...
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnamin...
    
    
      8
      spam
      WINNER!! As a valued network customer you have...
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitle...
    
  
# examine the class distribution
sms.label.value_counts()
ham     4825
spam     747
Name: label, dtype: int64

# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
# check that the conversion worked
sms.head(10)


      label
      message
      label_num
    
  
      0
      ham
      Go until jurong point, crazy.. Available only ...
      0
    
    
      1
      ham
      Ok lar... Joking wif u oni...
      0
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
      1
    
    
      3
      ham
      U dun say so early hor... U c already then say...
      0
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
      0
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's n...
      1
    
    
      6
      ham
      Even my brother is not like to speak with me. ...
      0
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnamin...
      0
    
    
      8
      spam
      WINNER!! As a valued network customer you have...
      1
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitle...
      1
    
  
# how to define X and y (from the iris data) for use with a MODEL
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)
(150L, 4L)
(150L,)

# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)
(5572L,)
(5572L,)

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(4179L,)
(1393L,)
(4179L,)
(1393L,)

Part 4: Vectorizing our dataset

# instantiate the vectorizer
vect = CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
# examine the document-term matrix
X_train_dtm
<4179x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
<1393x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

Part 5: Building and evaluating a model

We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
Wall time: 3 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
0.98851399856424982

# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
array([[1203,    5],
       [  11,  174]])

# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]
574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]
3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - �It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

# example false negative
X_test[3132]
"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.98664310005369604

Part 6: Comparing models

We will compare multinomial Naive Bayes with logistic regression:

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
Wall time: 39 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,
        0.99725053,  0.00157706])

# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.9877961234745154

# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.99368176123143015

Part 7: Examining a model for further insight

We will examine the our trained Naive Bayes model to calculate the approximate "spamminess" of each token.
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)
7456

# examine the first 50 tokens
print(X_train_tokens[0:50])
[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']

# examine the last 50 tokens
print(X_train_tokens[-50:])
[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\xe8n', u'\u3028ud']

# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_
array([[  0.,   0.,   0., ...,   1.,   1.,   1.],
       [  5.,  23.,   2., ...,   0.,   0.,   0.]])

# rows represent classes, columns represent tokens
nb.feature_count_.shape
(2L, 7456L)

# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count
array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count
array([  5.,  23.,   2., ...,   0.,   0.,   0.])

# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()


      ham
      spam
    
    
      token
      
      
      00
      0
      5
    
    
      000
      0
      23
    
    
      008704050406
      0
      2
    
    
      0121
      0
      1
    
    
      01223585236
      0
      1
    
  
# examine 5 random DataFrame rows
tokens.sample(5, random_state=6)


      ham
      spam
    
    
      token
      
      
      very
      64
      2
    
    
      nasty
      1
      1
    
    
      villa
      0
      1
    
    
      beloved
      1
      0
    
    
      textoperator
      0
      2
    
  
# Naive Bayes counts the number of observations in each class
nb.class_count_
array([ 3617.,   562.])

Before we can calculate the "spamminess" of each token, we need to avoid dividing by zero and account for the class imbalance.
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)


      ham
      spam
    
    
      token
      
      
      very
      65
      3
    
    
      nasty
      2
      2
    
    
      villa
      1
      2
    
    
      beloved
      2
      1
    
    
      textoperator
      1
      3
    
  
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)


      ham
      spam
    
    
      token
      
      
      very
      0.017971
      0.005338
    
    
      nasty
      0.000553
      0.003559
    
    
      villa
      0.000276
      0.003559
    
    
      beloved
      0.000553
      0.001779
    
    
      textoperator
      0.000276
      0.005338
    
  
# calculate the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)


      ham
      spam
      spam_ratio
    
    
      token
      
      
      very
      0.017971
      0.005338
      0.297044
    
    
      nasty
      0.000553
      0.003559
      6.435943
    
    
      villa
      0.000276
      0.003559
      12.871886
    
    
      beloved
      0.000553
      0.001779
      3.217972
    
    
      textoperator
      0.000276
      0.005338
      19.307829
    
  
# examine the DataFrame sorted by spam_ratio
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('spam_ratio', ascending=False)


      ham
      spam
      spam_ratio
    
    
      token
      
      
      claim
      0.000276
      0.158363
      572.798932
    
    
      prize
      0.000276
      0.135231
      489.131673
    
    
      150p
      0.000276
      0.087189
      315.361210
    
    
      tone
      0.000276
      0.085409
      308.925267
    
    
      guaranteed
      0.000276
      0.076512
      276.745552
    
    
      18
      0.000276
      0.069395
      251.001779
    
    
      cs
      0.000276
      0.065836
      238.129893
    
    
      www
      0.000553
      0.129893
      234.911922
    
    
      1000
      0.000276
      0.056940
      205.950178
    
    
      awarded
      0.000276
      0.053381
      193.078292
    
    
      150ppm
      0.000276
      0.051601
      186.642349
    
    
      uk
      0.000553
      0.099644
      180.206406
    
    
      500
      0.000276
      0.048043
      173.770463
    
    
      ringtone
      0.000276
      0.044484
      160.898577
    
    
      000
      0.000276
      0.042705
      154.462633
    
    
      mob
      0.000276
      0.042705
      154.462633
    
    
      co
      0.000553
      0.078292
      141.590747
    
    
      collection
      0.000276
      0.039146
      141.590747
    
    
      valid
      0.000276
      0.037367
      135.154804
    
    
      2000
      0.000276
      0.037367
      135.154804
    
    
      800
      0.000276
      0.037367
      135.154804
    
    
      10p
      0.000276
      0.037367
      135.154804
    
    
      8007
      0.000276
      0.035587
      128.718861
    
    
      16
      0.000553
      0.067616
      122.282918
    
    
      weekly
      0.000276
      0.033808
      122.282918
    
    
      tones
      0.000276
      0.032028
      115.846975
    
    
      land
      0.000276
      0.032028
      115.846975
    
    
      http
      0.000276
      0.032028
      115.846975
    
    
      national
      0.000276
      0.030249
      109.411032
    
    
      5000
      0.000276
      0.030249
      109.411032
    
    
      ...
      ...
      ...
      ...
    
    
      went
      0.012718
      0.001779
      0.139912
    
    
      ll
      0.052530
      0.007117
      0.135494
    
    
      told
      0.013824
      0.001779
      0.128719
    
    
      feel
      0.013824
      0.001779
      0.128719
    
    
      gud
      0.014100
      0.001779
      0.126195
    
    
      cos
      0.014929
      0.001779
      0.119184
    
    
      but
      0.090683
      0.010676
      0.117731
    
    
      amp
      0.015206
      0.001779
      0.117017
    
    
      something
      0.015206
      0.001779
      0.117017
    
    
      sure
      0.015206
      0.001779
      0.117017
    
    
      ok
      0.061100
      0.007117
      0.116488
    
    
      said
      0.016312
      0.001779
      0.109084
    
    
      morning
      0.016865
      0.001779
      0.105507
    
    
      yeah
      0.017694
      0.001779
      0.100562
    
    
      lol
      0.017694
      0.001779
      0.100562
    
    
      anything
      0.017971
      0.001779
      0.099015
    
    
      my
      0.150401
      0.014235
      0.094646
    
    
      doing
      0.019077
      0.001779
      0.093275
    
    
      way
      0.019630
      0.001779
      0.090647
    
    
      ask
      0.019630
      0.001779
      0.090647
    
    
      already
      0.019630
      0.001779
      0.090647
    
    
      too
      0.021841
      0.001779
      0.081468
    
    
      come
      0.048936
      0.003559
      0.072723
    
    
      later
      0.030688
      0.001779
      0.057981
    
    
      lor
      0.032900
      0.001779
      0.054084
    
    
      da
      0.032900
      0.001779
      0.054084
    
    
      she
      0.035665
      0.001779
      0.049891
    
    
      he
      0.047000
      0.001779
      0.037858
    
    
      lt
      0.064142
      0.001779
      0.027741
    
    
      gt
      0.064971
      0.001779
      0.027387
    
  
7456 rows × 3 columns

# look up the spam_ratio for a given token
tokens.loc['dating', 'spam_ratio']
83.667259786476862

Part 8: Practicing this workflow on another dataset

Please open the exercise.ipynb notebook (or the exercise.py script).
Part 9: Tuning the vectorizer (discussion)

Thus far, we have been using the default parameters of CountVectorizer:
# show default parameters for CountVectorizer
vect
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:

stop_words: string {'english'}, list, or None (default)

If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used.


# remove English stop words
vect = CountVectorizer(stop_words='english')

ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different n-grams to be extracted.
All values of n such that min_n <= n <= max_n will be used.


# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents.
If integer, the parameter represents an absolute count.


# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
If float, the parameter represents a proportion of documents.
If integer, the parameter represents an absolute count.


# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
Guidelines for tuning CountVectorizer:

Use your knowledge of the problem and the text, and your understanding of the tuning parameters, to help you decide what parameters to tune and how to tune them.
Experiment, and let the data tell you the best approach!
	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...
	ham	spam
token
very	0.017971	0.005338
nasty	0.000553	0.003559
villa	0.000276	0.003559
beloved	0.000553	0.001779
textoperator	0.000276	0.005338
	ham	spam	spam_ratio
token
claim	0.000276	0.158363	572.798932
prize	0.000276	0.135231	489.131673
150p	0.000276	0.087189	315.361210
tone	0.000276	0.085409	308.925267
guaranteed	0.000276	0.076512	276.745552
18	0.000276	0.069395	251.001779
cs	0.000276	0.065836	238.129893
www	0.000553	0.129893	234.911922
1000	0.000276	0.056940	205.950178
awarded	0.000276	0.053381	193.078292
150ppm	0.000276	0.051601	186.642349
uk	0.000553	0.099644	180.206406
500	0.000276	0.048043	173.770463
ringtone	0.000276	0.044484	160.898577
000	0.000276	0.042705	154.462633
mob	0.000276	0.042705	154.462633
co	0.000553	0.078292	141.590747
collection	0.000276	0.039146	141.590747
valid	0.000276	0.037367	135.154804
2000	0.000276	0.037367	135.154804
800	0.000276	0.037367	135.154804
10p	0.000276	0.037367	135.154804
8007	0.000276	0.035587	128.718861
16	0.000553	0.067616	122.282918
weekly	0.000276	0.033808	122.282918
tones	0.000276	0.032028	115.846975
land	0.000276	0.032028	115.846975
http	0.000276	0.032028	115.846975
national	0.000276	0.030249	109.411032
5000	0.000276	0.030249	109.411032
...	...	...	...
went	0.012718	0.001779	0.139912
ll	0.052530	0.007117	0.135494
told	0.013824	0.001779	0.128719
feel	0.013824	0.001779	0.128719
gud	0.014100	0.001779	0.126195
cos	0.014929	0.001779	0.119184
but	0.090683	0.010676	0.117731
amp	0.015206	0.001779	0.117017
something	0.015206	0.001779	0.117017
sure	0.015206	0.001779	0.117017
ok	0.061100	0.007117	0.116488
said	0.016312	0.001779	0.109084
morning	0.016865	0.001779	0.105507
yeah	0.017694	0.001779	0.100562
lol	0.017694	0.001779	0.100562
anything	0.017971	0.001779	0.099015
my	0.150401	0.014235	0.094646
doing	0.019077	0.001779	0.093275
way	0.019630	0.001779	0.090647
ask	0.019630	0.001779	0.090647
already	0.019630	0.001779	0.090647
too	0.021841	0.001779	0.081468
come	0.048936	0.003559	0.072723
later	0.030688	0.001779	0.057981
lor	0.032900	0.001779	0.054084
da	0.032900	0.001779	0.054084
she	0.035665	0.001779	0.049891
he	0.047000	0.001779	0.037858
lt	0.064142	0.001779	0.027741
gt	0.064971	0.001779	0.027387