ltiao/scikit-learn_issues.md

## scikit-learn_issues.md

      
    Raw
  

              scikit-learn_issues.md
            
          
CountVectorizer seems a bit dense and tries to do too much. This is not a bad thing but from a design viewpoint, it could benefit from making use of the Pipeline framework and separately define the preprocessor and tokenizer or even just analyzer as transformers in and of themselves and define CountVectorizer as a pipeline of these operatons. Should conform or make use of existing frameworks where possible rather than extending functionality to allow for exceptions.
Functions from the metric module such as confusion_matrix and classification_report ought to support cross-validation by perhaps allowing a cv parameter. This might be awkward given that we already have cross_val_score and learning_curve and those are supposed to be the functions which take care of cross validation scoring. It might make sense to use confusion_matrix as the scoring function for cross_val_score but the latter only accepts scoring functions which return a single value so that won't work. The other thing is that cross_val_score can only work with one scoring function at a time for if we wanted to define our own cross-validated classification_report or just make use of cross_val_score for differenct metrics, we'd need to train the model again each time for every metric we want to use to evaluate the under cross-validation.
The learning_curve method should behave consistently with cross_val_score in that it should return a value (or in this case a learning curve) for each cross-validation fold, rather the just the mean.
The learning_curve method still has the bug where there might only be a single class when the training sets are partitioned into the defined training sizes. _safe_split or whichever helper function used to partition the training data should ensure that no partition has only one class. It might not be a good idea to reorder the instances of the training data so that the targets arr ordered like ['class_0', 'class_1', 'class_2', 'class_0', 'class_1', 'class_2', ..., 'class_0', 'class_1', 'class_2'] (as has been suggested scikit-learn/scikit-learn#2701 (comment)) as for many learning algorithms, the search through the hypothese space is very much affected or even sometimes guided by the order of the training data. It is worth noting that a shuffle might be enough but we just need to calculate and document the probability that a partition contains only one class after a uniformly distributed shuffle.
The chi2 univariate feature selection metric is still somewhat of a mystery as it is not clear what it does with frequencies vs. categorical variables. In fact, it is not even clear what occurs in the case of categorical vs. categorical variables since the chi-squared calculation obtained from explicitly enumerating the contingency table is inconsistent with chi2. See http://stackoverflow.com/questions/21281328/scikit-learn-chi-squared-statistic-and-corresponding-contingency-table
I've just been trying to work with RFECV for the last few hours and found a number of issues. Some of these have already been mentioned above but I will list them again for completeness and elaboration.
grid_scores_ is of shape [n_features] but the documentation states it is [n_subsets_of_features]. Note that with a normalized value of step (i.e.

if 0.0 < self.step < 1.0:
        step = int(self.step * n_features)
else:
        step = int(self.step)