Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save davidthewatson/2001073 to your computer and use it in GitHub Desktop.
Save davidthewatson/2001073 to your computer and use it in GitHub Desktop.
Proposed handling for 'list of dicts' in pandas.DataFrame
"""
Sketch of proposed behaviour... make 'list of dicts'
create a (potentially) 'ragged' array, with autoguessed column names,
and sensible default values, where the keys don't match.
Current behaviour
In [215]: pandas.DataFrame([dict(a=1),dict(a=2)],columns=['a'])
Out[215]:
a
0 {'a': 1}
1 {'a': 2}
(I happen to find this very surprising/useless behaviour!)
(one) Proposed behaviour...
#
print DataFrame2([dict(a=1,c=1,d=True),dict(b=2,c='abc')])
a c d
0 1 1 True
1 NaN abc NaN
Proposed code follows...
This is entirely straw implementation. Real might affect the .pyx files,
and should be reviewed for sensible-ness.
Grossnesses:
* default_iget is *foul!*
* adding another potential arg for the 'default' value...
* is this *guessing*? If so, is it unpythonic levels of guessing?
* so much gross around itemgetter and single/multiple... ugh!
* this adds a lot of potential interactions with other 'init' forms of DataFrame
Wins:
--------
* this is *super lazy* and even lazier than R data.frame
Extensions
-------------
* should this go to collections.namedtuple as well?
* rather than guessing based on data[0], get the set of all keys over data
(so nothing will be lost)
"""
import operator
def default_iget(default=None,*fields):
"""
Note: it is gross the default must be first arg, but
*fields...
Note: *always* returns a list... unlike itemgetter,
which can return tuples or 'singles'
"""
myiget = operator.itemgetter(*fields)
L = len(fields)
def f(thing):
try:
ans = list(myiget(thing))
if L < 2:
ans = [ans,]
return ans
except KeyError:
return [thing.get(x,default) for x in fields]
f.__doc__ = "itemgetter with default %r for fields %r" %(default,fields)
f.__name__ = "default_itemgetter"
return f
import pandas
def DataFrame2(data=None, columns=None,*args,**kwargs):
# we do some preprocessing...
if data and isinstance(data, list) and isinstance(data[0],dict):
if columns: # we could guard here too... using 'contracts' module?
# this is gross that it's a full copy
f = default_iget(None,*columns)
else:
columns = sorted(data[0].keys())
f = default_iget(None,*columns)
data = [f(x) for x in data]
#print data
# now data and columns are both 'clean'-ish...
return pandas.DataFrame(data=data,columns=columns,*args,**kwargs)
print DataFrame2([dict(a=1,c=1,d=True),dict(b=2,c='abc')])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment