"# Question: How should I transform multiple key/value columns in a scikit-learn pipeline?\n",
"Input data:"
" k1 v1 k2 v2\n",
"0 a 1 b 2\n",
"1 b 2 c 3\n"
"import pandas as pd\n",
"D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])\n",
"This is the type of output data that is required:"
"[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]\n"
"from sklearn.feature_extraction import DictVectorizer\n",
"row1 = {'a':1, 'b':2}\n",
"row2 = {'b':2, 'c':3}\n",
"data = [row1, row2]\n",
"DictVectorizer( sparse=False ).fit_transform(data)"
"# Solution\n",
"Courtesy of [Mike]( and extended into a general pipeline transformer.\n",
"Here is the transformer:"
"from sklearn.base import TransformerMixin\n",
"from sklearn.pipeline import Pipeline, FeatureUnion\n",
"class KVExtractor(TransformerMixin):\n",
" def __init__(self, kvpairs):\n",
" self.kpairs = kvpairs\n",
" \n",
" def transform(self, X, *_):\n",
" result = []\n",
" for index, rowdata in X.iterrows():\n",
" rowdict = {}\n",
" for kvp in self.kpairs:\n",
" rowdict.update( { rowdata[ kvp[0] ]: rowdata[ kvp[1] ] } )\n",
" result.append(rowdict)\n",
" return result\n",
" \n",
" def fit(self, *_):\n",
" return self"
"Lets try it out:"
"kvpairs = [ ['k1', 'v1'], ['k2', 'v2'] ]\n",
"KVExtractor( kvpairs ).transform(D)"
"Now try it out in a pipeline with `DictVectorizer`:"
" k1 v1 k2 v2\n",
"0 a 1 b 2\n",
"1 b 2 c 3\n",
"(2, 3)\n",
"[[ 1. 2. 0.]\n",
" [ 0. 2. 3.]]\n"
"pipeline = Pipeline(\n",
" [( 'kv', KVExtractor( kvpairs ) )] +\n",
" [( 'dv', DictVectorizer(sparse=False) )] +\n",
" []\n",
"print A.shape\n",
"print A"
"Try a new key without transforming:"
" k1 v1 k2 v2\n",
"0 a 1 x 2\n",
"1 b 2 c 3\n",
"[[ 1. 0. 0.]\n",
" [ 0. 2. 3.]]\n"
"D['k2'] = ['x', 'c']\n",
"print D\n",
"print pipeline.transform(D)"
Explained in an elegant manner. Thank you!

very clean and clear

