CountVectorizer
seems a bit dense and tries to do too much. This is not a bad thing but from a design viewpoint, it could benefit from making use of thePipeline
framework and separately define thepreprocessor
andtokenizer
or even justanalyzer
as transformers in and of themselves and defineCountVectorizer
as a pipeline of these operatons. Should conform or make use of existing frameworks where possible rather than extending functionality to allow for exceptions.- Functions from the
metric
module such asconfusion_matrix
andclassification_report
ought to support cross-validation by perhaps allowing acv
parameter. This might be awkward given that we already havecross_val_score
andlearning_curve
and those are supposed to be the functions which take care of cross validation scoring. It might make sense to useconfusion_matrix
as the scoring function forcross_val_score
but the latter only accepts scoring functions which return a single value so that won't work. The other thing is thatcross_val_score
can only work with one scoring function at a time for if we wanted to define our own cross-validatedclassification_report
or just make use ofcross_val_score
for differenct metrics, we'd need to train the model again each time for every metric we want to use to evaluate the under cross-validation. - The
learning_curve
method should behave consistently withcross_val_score
in that it should return a value (or in this case a learning curve) for each cross-validation fold, rather the just the mean. - The
learning_curve
method still has the bug where there might only be a single class when the training sets are partitioned into the defined training sizes._safe_split
or whichever helper function used to partition the training data should ensure that no partition has only one class. It might not be a good idea to reorder the instances of the training data so that the targets arr ordered like['class_0', 'class_1', 'class_2', 'class_0', 'class_1', 'class_2', ..., 'class_0', 'class_1', 'class_2']
(as has been suggested scikit-learn/scikit-learn#2701 (comment)) as for many learning algorithms, the search through the hypothese space is very much affected or even sometimes guided by the order of the training data. It is worth noting that a shuffle might be enough but we just need to calculate and document the probability that a partition contains only one class after a uniformly distributed shuffle. - The
chi2
univariate feature selection metric is still somewhat of a mystery as it is not clear what it does with frequencies vs. categorical variables. In fact, it is not even clear what occurs in the case of categorical vs. categorical variables since the chi-squared calculation obtained from explicitly enumerating the contingency table is inconsistent withchi2
. See http://stackoverflow.com/questions/21281328/scikit-learn-chi-squared-statistic-and-corresponding-contingency-table - I've just been trying to work with
RFECV
for the last few hours and found a number of issues. Some of these have already been mentioned above but I will list them again for completeness and elaboration. grid_scores_
is of shape[n_features]
but the documentation states it is[n_subsets_of_features]
. Note that with a normalized value of step (i.e.
if 0.0 < self.step < 1.0:
step = int(self.step * n_features)
else:
step = int(self.step)