Load the Numpy module:
import numpy as np
Use the Numpy genfromtxt
function to load the data, manually defining the column names.
array = numpy.genfromtxt('logfile.txt',
names=['DATA', 'TEMP', 'UMIDADE', 'PRESSAO','LUMINOSIDADE'])
Now check out the size of the array.
array.shape
And now the type and names of each column.
array.dtype
We can access each column by name.
array['TEMP']
And we can easily perform some statistics
print 'Max temp is {}'.format(array['TEMP'].max())
print 'Min temp is {}'.format(array['TEMP'].min())
print 'Mean temp is {}'.format(array['TEMP'].mean())
print 'STD of temp mean is {}'.format(array['TEMP'].std())
Set up the IPython Notebook inline plotting
%pylab inline
First we import the matplotlib plotting module.
import matplotlib.pyplot as plt
And we can make a simple plot.
figure()
plot(array['DATA'], array['TEMP'], 'r.')
xlabel('DATA')
ylabel('TEMP')
title('TEMP vs DATA')
show()
Pandas is an awesome data analysis tool. Check out the webpage here: http://pandas.pydata.org/
First we have to install Pandas on out virtual machines. In order to do this we first need to install the Python package installing utility pip
. In the folder where you're keeping your Python work from the course run the following:
$ wget https://bootstrap.pypa.io/get-pip.py
$ sudo python get-pip.py
Now, because our VM's have very little memory, close everything else you have open (other shells, web browsers, etc.) and then run the following to install pandas:
$ sudo pip install pandas
If the VM seems frozen be patient, they very rarely crash, just give it a minute.
Import Pandas and set up the pretty plotting.
import pandas as pd
pd.options.display.mpl_style = 'default'
Read in the data file as a tab seperated table, skipping the first 5 rows, defining an index, manually providing the names, and parsing the index as date information.
df = pd.read_table('../shell/data.txt',
skiprows=5,
index_col='DATA',
names=['DATA', 'TEMP', 'UMIDADE', 'PRESSAO','LUMINOSIDADE'],
parse_dates=True)
Check out the first few rows
df.head()
Check out some quick stats
df.describe()
Make a nice plot, dropping the 'PRESSURE' column because it streches the Y-axis out too much.
df.drop('PRESSAO', 1).plot()