Skip to content

Instantly share code, notes, and snippets.

@codez266
Last active December 13, 2017 16:26
Show Gist options
  • Save codez266/1bf7ca71442071c3f290e05b3a4f23ba to your computer and use it in GitHub Desktop.
Save codez266/1bf7ca71442071c3f290e05b3a4f23ba to your computer and use it in GitHub Desktop.
Profiling of the _cross_score runs in revscoring when run with multilable random forests with WikiProjects labeled dataset.
Wed Dec 13 04:56:17 2017 stats [93/1765]
3844350339 function calls (3843797676 primitive calls) in 30453.783 seconds
Ordered by: cumulative time
List reduced from 260 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$
s_validate)
1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
1 0.912 0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$
tcomp>)
11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
33442 1665.965 0.050 29019.662 0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
33443 58.356 0.002 28459.225 0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)
16754760 195.438 0.000 28307.637 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:596(dispatch_one_batch)
16721318 133.835 0.000 27057.859 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:554(_dispatch)
16721318 55.127 0.000 26841.160 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:177(__init__)
16721318 43.096 0.000 26786.033 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:71(__call__)
16721318 74.759 0.000 26742.937 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:72(<listcomp>)
16720818 74.964 0.000 25531.124 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:123(_parallel_helper)
16720818 17306.384 0.001 25434.040 0.002 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/tree/tree.py:648(predict_proba)
11148 92.672 0.008 9775.910 0.877 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:478(predict)
769191039 1151.734 0.000 7108.682 0.000 {method 'sum' of 'numpy.ndarray' objects}
769191039 655.695 0.000 5956.948 0.000 /home/codezee/ai/venv/lib/python3.4/site-packages/numpy/core/_methods.py:31(_sum)
769192039 5301.307 0.000 5301.307 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.038 0.038 1144.438 1144.438 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:87(trai$
)
1 0.006 0.006 1141.016 1141.016 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:185(fit)
500 0.177 0.000 1137.054 2.274 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:92(_parallel_build_trees)
500 16.404 0.033 1136.043 2.272 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/tree/tree.py:113(fit)
16754760 96.994 0.000 1046.251 0.000 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:67(__init__)
500 994.727 1.989 994.729 1.989 {method 'build' of 'sklearn.tree._tree.DepthFirstTreeBuilder' objects}
16754259 88.192 0.000 944.274 0.000 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:545(<genexpr>)
16721318 139.342 0.000 848.303 0.000 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:144(delayed)
16720818 615.495 0.000 673.860 0.000 {method 'predict' of 'sklearn.tree._tree.Tree' objects}
@adamwight
Copy link

The line that jumped out on our call is https://gist.github.com/codez266/1bf7ca71442071c3f290e05b3a4f23ba#file-cross_score_profile-L25, and it's a mystery what actually takes time in there. Note that the cumulative time of the numpy subroutines is less than half of the total time spent in tree.predict_proba.

However, I think the biggest culprit is simply the number of times we call dispatch_one_batch. 16M is a lot of times to call anything, is this expected? Or maybe we're accidentally turning something into NxN complexity?

Try printing n_jobs here, it seems to be enormnous: https://github.com/scikit-learn/scikit-learn/blob/0.17.1/sklearn/ensemble/forest.py#L540

How big is n_outputs_ here? https://github.com/scikit-learn/scikit-learn/blob/0.17.1/sklearn/tree/tree.py#L686

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment