Skip to content

Instantly share code, notes, and snippets.

@ogrisel
Created October 2, 2011 19:46
Show Gist options
  • Save ogrisel/1257843 to your computer and use it in GitHub Desktop.
Save ogrisel/1257843 to your computer and use it in GitHub Desktop.
CountVectorizer profiling
Timer unit: 1e-06 s
File: sklearn/feature_extraction/text.py
Function: fit_transform at line 290
Total time: 16.3795 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
290 def fit_transform(self, raw_documents, y=None):
291 """Learn the vocabulary dictionary and return the count vectors
292
293 This is more efficient than calling fit followed by transform.
294
295 Parameters
296 ----------
297 raw_documents: iterable
298 an iterable which yields either str, unicode or file objects
299
300 Returns
301 -------
302 vectors: array, [n_samples, n_features]
303 """
304 1 9 9.0 0.0 if not self.fit_vocabulary:
305 return self.transform(raw_documents)
306
307 # result of document conversion to term count dicts
308 1 8 8.0 0.0 term_counts_per_doc = []
309 1 32 32.0 0.0 term_counts = Counter()
310
311 # term counts across entire corpus (count each term maximum once per
312 # document)
313 1 21 21.0 0.0 document_counts = Counter()
314
315 1 6 6.0 0.0 max_df = self.max_df
316 1 7 7.0 0.0 max_features = self.max_features
317
318 # TODO: parallelize the following loop with joblib?
319 # (see XXX up ahead)
320 501 1849 3.7 0.0 for doc in raw_documents:
321 500 391586 783.2 2.4 term_count_current = Counter(self.analyzer.analyze(doc))
322 500 15772081 31544.2 96.3 term_counts += term_count_current
323
324 500 3185 6.4 0.0 if max_df < 1.0:
325 document_counts.update(term_count_current)
326
327 500 2347 4.7 0.0 term_counts_per_doc.append(term_count_current)
328
329 1 4 4.0 0.0 n_doc = len(term_counts_per_doc)
330
331 # filter out stop words: terms that occur in almost all documents
332 1 2 2.0 0.0 if max_df < 1.0:
333 max_document_count = max_df * n_doc
334 stop_words = set(t for t, dc in document_counts.iteritems()
335 if dc > max_document_count)
336 else:
337 1 5 5.0 0.0 stop_words = set()
338
339 # list the terms that should be part of the vocabulary
340 1 3 3.0 0.0 if max_features is None:
341 1 4668 4668.0 0.0 terms = set(term_counts) - stop_words
342 else:
343 # extract the most frequent terms for the vocabulary
344 terms = set()
345 for t, tc in term_counts.most_common():
346 if t not in stop_words:
347 terms.add(t)
348 if len(terms) >= max_features:
349 break
350
351 # convert to a document-token matrix
352 1 13323 13323.0 0.1 self.vocabulary = dict(((t, i) for i, t in enumerate(terms)))
353
354 # the term_counts and document_counts might be useful statistics, are
355 # we really sure want we want to drop them? They take some memory but
356 # can be useful for corpus introspection
357
358 1 190330 190330.0 1.2 return self._term_count_dicts_to_matrix(term_counts_per_doc)
@larsmans
Copy link

larsmans commented Oct 3, 2011

Can you post the command used to produce this, so I can try it myself?

@ogrisel
Copy link
Author

ogrisel commented Oct 3, 2011

@ogrisel
Copy link
Author

ogrisel commented Oct 3, 2011

The command I used itself must be something like:

 from sklearn.datasets import load_20newsgroups
 from sklearn.features_extraction import text
 twenty_newsgroups = load_20newsgroups()
 %lprun -f text.CountVectorizer.fit_transform text.Vectorizer().fit(twenty_newsgroups.data)

But with you changeset applied, running text.Vectorizer().fit(twenty_newsgroups.data[:1000]) would take more than 10sec so the issue was very visible (on my machine at least). If you cannot reproduce on yours I will investigate.

@larsmans
Copy link

larsmans commented Oct 3, 2011

Python 2.6.6 as shipped by Debian 6.0.2 for x86-64: 54.2% in line 321 (term_count_current = Counter(self.analyzer.analyze(doc))), 9.9% in the following line (where you have 96.3%), 34.1% in the last line, i.e. in _term_count_dicts_to_matrix.

That's 2.6, so using my version of Counter. I'll see if I can find a Python 2.7 somewhere around here.

@larsmans
Copy link

larsmans commented Oct 3, 2011

I've found the cause. I must have been profiling on Python 2.6. The problem is that collections.Counter doesn't actually implement __iadd__, like my version, so the += reduces to __add__, copying the entire left-hand Counter, then an assignment:

>>> from collections import Counter
>>> a = Counter([1,2,3])
>>> b = a
>>> a += Counter([1,2,3])
>>> a is b
False

I should have been using the update method instead. Will push a patch.

@ogrisel
Copy link
Author

ogrisel commented Oct 3, 2011

Interesting. Do you think it's intentional or is this a bug of the Counter impl in python 2.7? If this is a missing optim in the stdlib it would be worth contributing your __iadd__ implementation to the python stdlib.

@larsmans
Copy link

larsmans commented Oct 3, 2011

I don't know, but it's still the case in the most recent Python 3. I already sent an email to python-dev. I can forward the answer to you if you're interested.

@ogrisel
Copy link
Author

ogrisel commented Oct 3, 2011

Ok great. Don't worry about forwarding, I will read the archives directly.

@ogrisel
Copy link
Author

ogrisel commented Oct 3, 2011

BTW, I forgot to say, thank you very much for fixing this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment