Skip to content

Instantly share code, notes, and snippets.

@kljensen
Forked from anonymous/onehot_pandas_scikit.py
Last active May 18, 2020 23:17
Show Gist options
  • Star 25 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save kljensen/5452382 to your computer and use it in GitHub Desktop.
Save kljensen/5452382 to your computer and use it in GitHub Desktop.
# -*- coding: utf-8 -*-
""" Small script that shows hot to do one hot encoding
of categorical columns in a pandas DataFrame.
See:
http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.DictVectorizer.html
"""
import pandas
import random
import numpy
from sklearn.feature_extraction import DictVectorizer
def one_hot_dataframe(data, cols, replace=False):
""" Takes a dataframe and a list of columns that need to be encoded.
Returns a 3-tuple comprising the data, the vectorized data,
and the fitted vectorizor.
"""
vec = DictVectorizer()
mkdict = lambda row: dict((col, row[col]) for col in cols)
vecData = pandas.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
vecData.columns = vec.get_feature_names()
vecData.index = data.index
if replace is True:
data = data.drop(cols, axis=1)
data = data.join(vecData)
return (data, vecData, vec)
def main():
# Get a random DataFrame
df = pandas.DataFrame(numpy.random.randn(25, 3), columns=['a', 'b', 'c'])
# Make some random categorical columns
df['e'] = [random.choice(('Chicago', 'Boston', 'New York')) for i in range(df.shape[0])]
df['f'] = [random.choice(('Chrome', 'Firefox', 'Opera', "Safari")) for i in range(df.shape[0])]
print df
# Vectorize the categorical columns: e & f
df, _, _ = one_hot_dataframe(df, ['e', 'f'], replace=True)
print df
if __name__ == '__main__':
main()
Example output
Original DataFrame
------------------
a b c e f
0 -0.219222 -0.368154 0.388479 New York Opera
1 1.879536 -0.033210 -0.099437 New York Firefox
2 0.909419 -0.498084 0.084163 New York Safari
3 -0.002199 -0.692806 -0.844436 New York Opera
4 -0.109549 -0.367305 -0.520999 Chicago Firefox
5 -0.400515 -1.202466 -1.664337 New York Chrome
6 -2.241892 -0.888160 -0.332380 New York Chrome
7 -0.432767 -1.794931 0.975878 Chicago Chrome
8 -1.401193 -0.478224 0.112729 Chicago Safari
9 -1.493518 0.584824 0.652820 New York Opera
10 0.525359 -0.885912 0.474492 Boston Firefox
11 0.671226 -0.733788 0.272915 Boston Chrome
12 0.775901 -0.163745 0.628414 Boston Opera
13 -1.158007 -0.495240 1.183522 New York Chrome
14 -1.200085 1.083380 -0.692171 Boston Safari
15 0.872763 -2.119172 -0.169185 Boston Chrome
16 1.423514 -1.802891 -2.947628 Boston Safari
17 -0.547940 -0.788654 -1.065005 Boston Safari
18 -0.380440 2.050783 1.548453 New York Firefox
19 -0.095913 1.260104 0.196552 Boston Opera
20 -1.558961 1.240931 -0.165927 Boston Safari
21 1.111618 -0.309371 -0.803404 Chicago Chrome
22 0.348182 -1.200900 0.307754 New York Firefox
23 -0.834901 0.188590 -1.115227 New York Chrome
24 1.463240 -1.559017 0.954684 New York Chrome
Encoded DataFrame
-----------------
a b c e=Boston e=Chicago e=New York f=Chrome f=Firefox f=Opera f=Safari
0 -0.219222 -0.368154 0.388479 0 0 1 0 0 1 0
1 1.879536 -0.033210 -0.099437 0 0 1 0 1 0 0
2 0.909419 -0.498084 0.084163 0 0 1 0 0 0 1
3 -0.002199 -0.692806 -0.844436 0 0 1 0 0 1 0
4 -0.109549 -0.367305 -0.520999 0 1 0 0 1 0 0
5 -0.400515 -1.202466 -1.664337 0 0 1 1 0 0 0
6 -2.241892 -0.888160 -0.332380 0 0 1 1 0 0 0
7 -0.432767 -1.794931 0.975878 0 1 0 1 0 0 0
8 -1.401193 -0.478224 0.112729 0 1 0 0 0 0 1
9 -1.493518 0.584824 0.652820 0 0 1 0 0 1 0
10 0.525359 -0.885912 0.474492 1 0 0 0 1 0 0
11 0.671226 -0.733788 0.272915 1 0 0 1 0 0 0
12 0.775901 -0.163745 0.628414 1 0 0 0 0 1 0
13 -1.158007 -0.495240 1.183522 0 0 1 1 0 0 0
14 -1.200085 1.083380 -0.692171 1 0 0 0 0 0 1
15 0.872763 -2.119172 -0.169185 1 0 0 1 0 0 0
16 1.423514 -1.802891 -2.947628 1 0 0 0 0 0 1
17 -0.547940 -0.788654 -1.065005 1 0 0 0 0 0 1
18 -0.380440 2.050783 1.548453 0 0 1 0 1 0 0
19 -0.095913 1.260104 0.196552 1 0 0 0 0 1 0
20 -1.558961 1.240931 -0.165927 1 0 0 0 0 0 1
21 1.111618 -0.309371 -0.803404 0 1 0 1 0 0 0
22 0.348182 -1.200900 0.307754 0 0 1 0 1 0 0
23 -0.834901 0.188590 -1.115227 0 0 1 1 0 0 0
24 1.463240 -1.559017 0.954684 0 0 1 1 0 0 0
@ericmjl
Copy link

ericmjl commented Feb 25, 2014

Hi Kyle, this code was really helpful for me earlier on! However, I found that it doesn't work anymore. I'm using the latest versions of scikit-learn and pandas. Might you happen to know what's going on?

@saihttam
Copy link

saihttam commented Mar 2, 2014

I was also having problems. For me it seemed like the apply function did not work correctly. I used the to_dict function of the most recent version of pandas (0.13) and the following instead of the apply:
data[cols].to_dict(outtype='records')

@svkerr
Copy link

svkerr commented Mar 12, 2014

saihttam: thanks very much for the to_dict recommendation. code now works!

@slegroux
Copy link

slegroux commented Jun 3, 2014

_Pb_
Hi I didn't manage to make your example work. I am on python 2.7 with scikit-learn 0.14.1 and pandas 0.13.1
_error_
It tells me "AttributeError: 'builtin_function_or_method' object has no attribute 'iterates'"
Any idea what could go wrong?


AttributeError Traceback (most recent call last)
in ()
7
8 # Vectorize the categorical columns: e & f
----> 9 df, _, _ = learn.one_hot_dataframe(df, ['e', 'f'], replace=True)
10

/usr/local/lib/python2.7/site-packages/mirlib/learn.pyc in one_hot_dataframe(data, cols, replace)
68 vec = feature_extraction.DictVectorizer()
69 mkdict = lambda row: dict((col, row[col]) for col in cols)
---> 70 vecData = pandas.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
71 vecData.columns = vec.get_feature_names()
72 vecData.index = data.index

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.pyc in fit_transform(self, X, y)
140 """
141 X = _tosequence(X)
--> 142 self.fit(X)
143 return self.transform(X)
144

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.pyc in fit(self, X, y)
105 feature_names = set()
106 for x in X:
--> 107 for f, v in six.iteritems(x):
108 if isinstance(v, six.string_types):
109 f = "%s%s%s" % (f, self.separator, v)

/usr/local/lib/python2.7/site-packages/sklearn/externals/six.pyc in iteritems(d)
266 def iteritems(d):
267 """Return an iterator over the (key, value) pairs of a dictionary."""
--> 268 return iter(getattr(d, _iteritems)())
269
270

AttributeError: 'builtin_function_or_method' object has no attribute 'iteritems'

@funpan
Copy link

funpan commented Jun 17, 2014

Hi,

I got the same AttributeError even after dropna to 'embarked' and try to call one_hot_dataframe() with 'embarked' only.

titanic = titanic.dropna(subset=['embarked'])
titanic, titanic_n = one_hot_dataframe(titanic, ['embarked'], replace=True)

@a-whitej
Copy link

run into same issue.

IPython version: 2.1.0
numpy version: 1.8.1
scikit-learn version: 0.15.1
matplotlib version: 1.3.1

@felipeclopes
Copy link

The modified version runs, but it does not generate the vector properly.

@pmdscully
Copy link

Does this semantically work for both ordinal and non-ordinal categorical data... any thoughts about that?

@saihttam
Copy link

I currently use this slightly modified version: https://gist.github.com/saihttam/cad6d3d223fc8d769227

@ashok0587
Copy link

I have a column with all numeric values, but they are all categorical. I am not able to binarize them using this code. only columns with non-numerical values are binarized with this code. Kindly help if any modification is needed.

sorry, the table format is not clear.. Hope you understand my question. Thanks in advance.
EG:
col1 col2
1 4
2 4
3 3
4 3

should get converted to

col1 col2_4 col2_3
1 1 0
2 1 0
3 0 1
4 0 1

@carlgieringer
Copy link

One-hot encoding is supported in pandas (I think since 0.13.1) as pd.get_dummies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment