-
-
Save ogrisel/1257843 to your computer and use it in GitHub Desktop.
Timer unit: 1e-06 s | |
File: sklearn/feature_extraction/text.py | |
Function: fit_transform at line 290 | |
Total time: 16.3795 s | |
Line # Hits Time Per Hit % Time Line Contents | |
============================================================== | |
290 def fit_transform(self, raw_documents, y=None): | |
291 """Learn the vocabulary dictionary and return the count vectors | |
292 | |
293 This is more efficient than calling fit followed by transform. | |
294 | |
295 Parameters | |
296 ---------- | |
297 raw_documents: iterable | |
298 an iterable which yields either str, unicode or file objects | |
299 | |
300 Returns | |
301 ------- | |
302 vectors: array, [n_samples, n_features] | |
303 """ | |
304 1 9 9.0 0.0 if not self.fit_vocabulary: | |
305 return self.transform(raw_documents) | |
306 | |
307 # result of document conversion to term count dicts | |
308 1 8 8.0 0.0 term_counts_per_doc = [] | |
309 1 32 32.0 0.0 term_counts = Counter() | |
310 | |
311 # term counts across entire corpus (count each term maximum once per | |
312 # document) | |
313 1 21 21.0 0.0 document_counts = Counter() | |
314 | |
315 1 6 6.0 0.0 max_df = self.max_df | |
316 1 7 7.0 0.0 max_features = self.max_features | |
317 | |
318 # TODO: parallelize the following loop with joblib? | |
319 # (see XXX up ahead) | |
320 501 1849 3.7 0.0 for doc in raw_documents: | |
321 500 391586 783.2 2.4 term_count_current = Counter(self.analyzer.analyze(doc)) | |
322 500 15772081 31544.2 96.3 term_counts += term_count_current | |
323 | |
324 500 3185 6.4 0.0 if max_df < 1.0: | |
325 document_counts.update(term_count_current) | |
326 | |
327 500 2347 4.7 0.0 term_counts_per_doc.append(term_count_current) | |
328 | |
329 1 4 4.0 0.0 n_doc = len(term_counts_per_doc) | |
330 | |
331 # filter out stop words: terms that occur in almost all documents | |
332 1 2 2.0 0.0 if max_df < 1.0: | |
333 max_document_count = max_df * n_doc | |
334 stop_words = set(t for t, dc in document_counts.iteritems() | |
335 if dc > max_document_count) | |
336 else: | |
337 1 5 5.0 0.0 stop_words = set() | |
338 | |
339 # list the terms that should be part of the vocabulary | |
340 1 3 3.0 0.0 if max_features is None: | |
341 1 4668 4668.0 0.0 terms = set(term_counts) - stop_words | |
342 else: | |
343 # extract the most frequent terms for the vocabulary | |
344 terms = set() | |
345 for t, tc in term_counts.most_common(): | |
346 if t not in stop_words: | |
347 terms.add(t) | |
348 if len(terms) >= max_features: | |
349 break | |
350 | |
351 # convert to a document-token matrix | |
352 1 13323 13323.0 0.1 self.vocabulary = dict(((t, i) for i, t in enumerate(terms))) | |
353 | |
354 # the term_counts and document_counts might be useful statistics, are | |
355 # we really sure want we want to drop them? They take some memory but | |
356 # can be useful for corpus introspection | |
357 | |
358 1 190330 190330.0 1.2 return self._term_count_dicts_to_matrix(term_counts_per_doc) |
I used http://packages.python.org/line_profiler/ (that I wrapped as a ipython plugin as explained here:
http://scikit-learn.sourceforge.net/dev/developers/performance.html#profiling-python-code
The command I used itself must be something like:
from sklearn.datasets import load_20newsgroups
from sklearn.features_extraction import text
twenty_newsgroups = load_20newsgroups()
%lprun -f text.CountVectorizer.fit_transform text.Vectorizer().fit(twenty_newsgroups.data)
But with you changeset applied, running text.Vectorizer().fit(twenty_newsgroups.data[:1000])
would take more than 10sec so the issue was very visible (on my machine at least). If you cannot reproduce on yours I will investigate.
Python 2.6.6 as shipped by Debian 6.0.2 for x86-64: 54.2% in line 321 (term_count_current = Counter(self.analyzer.analyze(doc))
), 9.9% in the following line (where you have 96.3%), 34.1% in the last line, i.e. in _term_count_dicts_to_matrix
.
That's 2.6, so using my version of Counter
. I'll see if I can find a Python 2.7 somewhere around here.
I've found the cause. I must have been profiling on Python 2.6. The problem is that collections.Counter
doesn't actually implement __iadd__
, like my version, so the +=
reduces to __add__
, copying the entire left-hand Counter
, then an assignment:
>>> from collections import Counter
>>> a = Counter([1,2,3])
>>> b = a
>>> a += Counter([1,2,3])
>>> a is b
False
I should have been using the update
method instead. Will push a patch.
Interesting. Do you think it's intentional or is this a bug of the Counter impl in python 2.7? If this is a missing optim in the stdlib it would be worth contributing your __iadd__
implementation to the python stdlib.
I don't know, but it's still the case in the most recent Python 3. I already sent an email to python-dev
. I can forward the answer to you if you're interested.
Ok great. Don't worry about forwarding, I will read the archives directly.
BTW, I forgot to say, thank you very much for fixing this :)
Can you post the command used to produce this, so I can try it myself?