kghose/p01.py

## p01.py
import pandas as pd, pylab

idx = pd.MultiIndex.from_tuples([('a','x'),('a','y'),('b','x'),('b','y')])
col = pd.MultiIndex.from_tuples([('c1',0),('c1',1),('c2',0)],names=['f','s'])
dat = pylab.randn(len(idx),len(col))
df1 = pd.DataFrame(dat, index=idx, columns=col)

## p02.txt
In [6]: df1
Out[6]:
f          c1                  c2
s           0         1         0
a x  0.091242  0.700668  1.755267
  y  0.933405  0.129897 -0.082570
b x -0.828174 -0.614804  0.299050
  y -2.034121 -0.560468 -0.443721

As can be seen, the function MultiIndex enables us to turn a
list of tuples into a hierarchical index. The cool thing about
the hierarchical index is that we can now select out groups
of columns/rows like so:

In [13]: df1.c1
Out[13]:
s           0         1
a x  0.091242  0.700668
  y  0.933405  0.129897
b x -0.828174 -0.614804
  y -2.034121 -0.560468

Or like so:

In [17]: df1.loc['b']
Out[17]:
f        c1                  c2
s         0         1         0
x -0.828174 -0.614804  0.299050
y -2.034121 -0.560468 -0.443721

A more powerful example of label based selection over a range is given later.


## p03.py
col = pd.MultiIndex.from_tuples([('c3','x'),('c3','y')])
dat = pylab.randn(len(idx),len(col))
df2 = pd.DataFrame(dat, index=idx, columns=col)

## p04.txt
In [7]: df2
Out[7]:
           c3
            x         y
a x -0.524225  0.245902
  y  0.286759  0.365465
b x  0.542192  1.605373
  y -0.554882  0.124332

## p05.py
col = pd.MultiIndex.from_tuples([('c4','x'),('c4','y')])
dat = pylab.randn(len(idx),len(col))
df3 = pd.DataFrame(dat, index=idx, columns=col)

## p06.txt
In [8]: df3
Out[8]:
           c4
            x         y
a x -0.339814 -0.399197
  y  0.053150  0.368418
b x  0.314094  0.228735
  y  0.030094  0.005102

## p07.py
df_a = pd.concat([df1,df2,df3])

## p08.txt
In [9]: df_a
Out[9]:
     c1      c2        c3                  c4
      0   1   0         x         y         x         y
a x NaN NaN NaN       NaN       NaN       NaN       NaN
  y NaN NaN NaN       NaN       NaN       NaN       NaN
b x NaN NaN NaN       NaN       NaN       NaN       NaN
  y NaN NaN NaN       NaN       NaN       NaN       NaN
a x NaN NaN NaN -0.524225  0.245902       NaN       NaN
  y NaN NaN NaN  0.286759  0.365465       NaN       NaN
b x NaN NaN NaN  0.542192  1.605373       NaN       NaN
  y NaN NaN NaN -0.554882  0.124332       NaN       NaN
a x NaN NaN NaN       NaN       NaN -0.339814 -0.399197
  y NaN NaN NaN       NaN       NaN  0.053150  0.368418
b x NaN NaN NaN       NaN       NaN  0.314094  0.228735
  y NaN NaN NaN       NaN       NaN  0.030094  0.005102

If you find those NaNs in the topmost rows and leftmost
columns odd, I agree with you. What has happened here is
that the column subindex for df1 is an integer (0,1) while
df2 and df3 it is a string ('x','y').

Pandas (I used v 0.11.0) for some reason does not combine
those indeces consistently, for example below you see that
a join on axis=1 goes smoothly.

And if you skip down a bit you will see that if we set the
indeces as strings ('0','1') the combination goes fine.

## p09.py
df_b = pd.concat([df1,df2,df3], axis=1)

## p10.txt
In [10]: df_b
Out[10]:
f          c1                  c2        c3                  c4
s           0         1         0         x         y         x         y
a x  0.091242  0.700668  1.755267 -0.524225  0.245902 -0.339814 -0.399197
  y  0.933405  0.129897 -0.082570  0.286759  0.365465  0.053150  0.368418
b x -0.828174 -0.614804  0.299050  0.542192  1.605373  0.314094  0.228735
  y -2.034121 -0.560468 -0.443721 -0.554882  0.124332  0.030094  0.005102

Note how we don't have an issue with some of the subindeces starting out
as numbers

## p11.py
df_c = pd.concat([df1,df2,df3], axis=0)

## p12.txt
In [11]: df_c
Out[11]:
     c1      c2        c3                  c4
      0   1   0         x         y         x         y
a x NaN NaN NaN       NaN       NaN       NaN       NaN
  y NaN NaN NaN       NaN       NaN       NaN       NaN
b x NaN NaN NaN       NaN       NaN       NaN       NaN
  y NaN NaN NaN       NaN       NaN       NaN       NaN
a x NaN NaN NaN -0.524225  0.245902       NaN       NaN
  y NaN NaN NaN  0.286759  0.365465       NaN       NaN
b x NaN NaN NaN  0.542192  1.605373       NaN       NaN
  y NaN NaN NaN -0.554882  0.124332       NaN       NaN
a x NaN NaN NaN       NaN       NaN -0.339814 -0.399197
  y NaN NaN NaN       NaN       NaN  0.053150  0.368418
b x NaN NaN NaN       NaN       NaN  0.314094  0.228735
  y NaN NaN NaN       NaN       NaN  0.030094  0.005102

This is the same as axis=0

## p13.py
col = pd.MultiIndex.from_tuples([('c1','0'),('c1','1'),('c2','0')],names=['f','s'])
dat = pylab.randn(len(idx),len(col))
df4 = pd.DataFrame(dat, index=idx, columns=col)
df_d = pd.concat([df4,df2,df3], axis=0)

## p14.txt
In [12]: df_d
Out[12]:
           c1                  c2        c3                  c4
            0         1         0         x         y         x         y
a x  0.337708 -0.677786  2.774935       NaN       NaN       NaN       NaN
  y  1.661593 -1.035824 -1.680486       NaN       NaN       NaN       NaN
b x -0.797928 -1.212866 -1.161657       NaN       NaN       NaN       NaN
  y -0.750026  0.442908 -0.086385       NaN       NaN       NaN       NaN
a x       NaN       NaN       NaN -0.524225  0.245902       NaN       NaN
  y       NaN       NaN       NaN  0.286759  0.365465       NaN       NaN
b x       NaN       NaN       NaN  0.542192  1.605373       NaN       NaN
  y       NaN       NaN       NaN -0.554882  0.124332       NaN       NaN
a x       NaN       NaN       NaN       NaN       NaN -0.339814 -0.399197
  y       NaN       NaN       NaN       NaN       NaN  0.053150  0.368418
b x       NaN       NaN       NaN       NaN       NaN  0.314094  0.228735
  y       NaN       NaN       NaN       NaN       NaN  0.030094  0.005102

Note how the sub indecies are not a problem anymore, since they are all
of the same type ('string')
	import pandas as pd, pylab

	idx = pd.MultiIndex.from_tuples([('a','x'),('a','y'),('b','x'),('b','y')])
	col = pd.MultiIndex.from_tuples([('c1',0),('c1',1),('c2',0)],names=['f','s'])
	dat = pylab.randn(len(idx),len(col))
	df1 = pd.DataFrame(dat, index=idx, columns=col)
	In [6]: df1
	Out[6]:
	f c1 c2
	s 0 1 0
	a x 0.091242 0.700668 1.755267
	y 0.933405 0.129897 -0.082570
	b x -0.828174 -0.614804 0.299050
	y -2.034121 -0.560468 -0.443721

	As can be seen, the function MultiIndex enables us to turn a
	list of tuples into a hierarchical index. The cool thing about
	the hierarchical index is that we can now select out groups
	of columns/rows like so:

	In [13]: df1.c1
	Out[13]:
	s 0 1
	a x 0.091242 0.700668
	y 0.933405 0.129897
	b x -0.828174 -0.614804
	y -2.034121 -0.560468

	Or like so:

	In [17]: df1.loc['b']
	Out[17]:
	f c1 c2
	s 0 1 0
	x -0.828174 -0.614804 0.299050
	y -2.034121 -0.560468 -0.443721

	A more powerful example of label based selection over a range is given later.
	col = pd.MultiIndex.from_tuples([('c3','x'),('c3','y')])
	dat = pylab.randn(len(idx),len(col))
	df2 = pd.DataFrame(dat, index=idx, columns=col)
	In [7]: df2
	Out[7]:
	c3
	x y
	a x -0.524225 0.245902
	y 0.286759 0.365465
	b x 0.542192 1.605373
	y -0.554882 0.124332
	col = pd.MultiIndex.from_tuples([('c4','x'),('c4','y')])
	dat = pylab.randn(len(idx),len(col))
	df3 = pd.DataFrame(dat, index=idx, columns=col)
	In [8]: df3
	Out[8]:
	c4
	x y
	a x -0.339814 -0.399197
	y 0.053150 0.368418
	b x 0.314094 0.228735
	y 0.030094 0.005102
	In [9]: df_a
	Out[9]:
	c1 c2 c3 c4
	0 1 0 x y x y
	a x NaN NaN NaN NaN NaN NaN NaN
	y NaN NaN NaN NaN NaN NaN NaN
	b x NaN NaN NaN NaN NaN NaN NaN
	y NaN NaN NaN NaN NaN NaN NaN
	a x NaN NaN NaN -0.524225 0.245902 NaN NaN
	y NaN NaN NaN 0.286759 0.365465 NaN NaN
	b x NaN NaN NaN 0.542192 1.605373 NaN NaN
	y NaN NaN NaN -0.554882 0.124332 NaN NaN
	a x NaN NaN NaN NaN NaN -0.339814 -0.399197
	y NaN NaN NaN NaN NaN 0.053150 0.368418
	b x NaN NaN NaN NaN NaN 0.314094 0.228735
	y NaN NaN NaN NaN NaN 0.030094 0.005102

	If you find those NaNs in the topmost rows and leftmost
	columns odd, I agree with you. What has happened here is
	that the column subindex for df1 is an integer (0,1) while
	df2 and df3 it is a string ('x','y').

	Pandas (I used v 0.11.0) for some reason does not combine
	those indeces consistently, for example below you see that
	a join on axis=1 goes smoothly.

	And if you skip down a bit you will see that if we set the
	indeces as strings ('0','1') the combination goes fine.
	In [10]: df_b
	Out[10]:
	f c1 c2 c3 c4
	s 0 1 0 x y x y
	a x 0.091242 0.700668 1.755267 -0.524225 0.245902 -0.339814 -0.399197
	y 0.933405 0.129897 -0.082570 0.286759 0.365465 0.053150 0.368418
	b x -0.828174 -0.614804 0.299050 0.542192 1.605373 0.314094 0.228735
	y -2.034121 -0.560468 -0.443721 -0.554882 0.124332 0.030094 0.005102

	Note how we don't have an issue with some of the subindeces starting out
	as numbers
	In [11]: df_c
	Out[11]:
	c1 c2 c3 c4
	0 1 0 x y x y
	a x NaN NaN NaN NaN NaN NaN NaN
	y NaN NaN NaN NaN NaN NaN NaN
	b x NaN NaN NaN NaN NaN NaN NaN
	y NaN NaN NaN NaN NaN NaN NaN
	a x NaN NaN NaN -0.524225 0.245902 NaN NaN
	y NaN NaN NaN 0.286759 0.365465 NaN NaN
	b x NaN NaN NaN 0.542192 1.605373 NaN NaN
	y NaN NaN NaN -0.554882 0.124332 NaN NaN
	a x NaN NaN NaN NaN NaN -0.339814 -0.399197
	y NaN NaN NaN NaN NaN 0.053150 0.368418
	b x NaN NaN NaN NaN NaN 0.314094 0.228735
	y NaN NaN NaN NaN NaN 0.030094 0.005102

	This is the same as axis=0
	col = pd.MultiIndex.from_tuples([('c1','0'),('c1','1'),('c2','0')],names=['f','s'])
	dat = pylab.randn(len(idx),len(col))
	df4 = pd.DataFrame(dat, index=idx, columns=col)
	df_d = pd.concat([df4,df2,df3], axis=0)