Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Applying transformations on subset of columns
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
class GetDummiesCatCols(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
"""Replace `cols` with their dummies (One Hot Encoding).
`cols` should be a list of column names holding categorical data.
Furthermore, this class streamlines the implementation of one hot encoding
as available on [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
"""
def __init__(self, cols=None):
self.cols = cols
def transform(self, df, **transform_params):
cols_dummy = pd.get_dummies(df[self.cols])
df = df.drop(self.cols, axis=1)
df = pd.concat([df, cols_dummy], axis=1)
return df
def fit(self, df, y=None, **fit_params):
return self
class StandartizeFloatCols(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
"""Standard-scale the columns in the data frame.
`cols` should be a list of columns in the data.
"""
def __init__(self, cols=None):
self.cols = cols
self.standard_scaler = StandardScaler()
def transform(self, df, **transform_params):
standartize_cols = pd.DataFrame(
# StandardScaler returns a NumPy.array, and thus indexing
# breaks. Explicitly fixed next.
self.standard_scaler.transform(df[self.cols]),
columns=self.cols,
# The index of the resulting DataFrame should be assigned and
# equal to the one of the original DataFrame. Otherwise, upon
# concatenation NaNs will be introduced.
index = df.index
)
df = df.drop(self.cols, axis=1)
df = pd.concat([df, standartize_cols], axis=1)
return df
def fit(self, df, y=None, **fit_params):
self.standard_scaler.fit(df[self.cols])
return self
@drorata
Copy link
Author

drorata commented Jun 21, 2017

When working with pandas.DataFrame, it is likely that one would like to apply different preprocessing steps to different columns. For example, apply one hot encoding to the categorical columns (or a subset of them) and standard scale the other numerical ones.

Here's my attempt on doing this in a way that will play nicely with sklearn.model_selection.cross_val_score and sklearn.pipeline.make_pipeline for example.

For example, here's a pipeline:

sklearn.pipeline.make_pipeline(
        StandartizeFloatCols(cols=X.dtypes[X.dtypes != 'category'].index.tolist()),
        GetDummiesCatCols(cols=X.dtypes[X.dtypes == 'category'].index.tolist()),
        DecisionTreeClassifier(random_state=42)
    )

I'll be happy to discuss this and hear feedback.

@drorata
Copy link
Author

drorata commented Jun 22, 2017

So, I realized that this version is wrong. The reason is that StandartizeFloatCols() fits the scaling factors during the transform. Consider the following example.

custom_std_scale = StandartizeFloatCols(cols=['v1', 'v2'])

N=1000
np.random.seed(42)
df = pd.DataFrame(
    {
        "v1": np.random.random(size=N),
        "v2": np.random.geometric(0.2, size=N)
    }
)

df_test = pd.DataFrame(
    {
        "v1": np.random.random(size=N),
        "v2": np.random.geometric(0.2, size=N)
    }
)

df_test_swap = pd.DataFrame(
    {
        "v1": np.random.geometric(0.2, size=N),
        "v2": np.random.random(size=N)
    }
)

Let us fit the class using df:

custom_std_scale.fit(df)

Now, we can use it to transform df_test. We can also expect (and this is the expected behavior) that the fitted class will use the scales that it learned in the previous step and apply them to df_test. As the distribution used are the same, the resulting columns should have zero mean and unit STD. Indeed:

custom_std_scale.transform(df_test).describe()

yields:

        v1              v2
mean	-7.460699e-17	9.148238e-17
std	1.000500e+00	1.000500e+00

However, if we use custom_std_scale on df_test_swap, we expect to have the non zero mean and non-unit STD since the distributions are swapped. BUT:

custom_std_scale.transform(df_test_swap).describe()

yields

        v1              v2
mean	-6.217249e-17	-7.016610e-17
std	1.000500e+00	1.000500e+00

The reason is that the instance re-fitted the underlying StandardScaler() instance.

I will fix it and commit a new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment