Skip to content

Instantly share code, notes, and snippets.

@immanuelpotter
Last active November 9, 2019 17:34
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save immanuelpotter/ea954a04566e0c78cd473b57346b8959 to your computer and use it in GitHub Desktop.
Save immanuelpotter/ea954a04566e0c78cd473b57346b8959 to your computer and use it in GitHub Desktop.
Basic Data Science in Python guide (pandas, numpy, scikit-learn, nltk, stats)

An introduction to Data Science: Python

Immanuel Potter

This guide may assume some very basic programming knowledge, but not necessarily python. It aims to give anyone enough knowledge to start diving into data problems themselves by using these techniques, and tries to explain (fairly comprehensively) everything that is going on for the newcomer.

This guide will be covering several different topics and should equip you with enough knowledge to try some of what you read out on real problems.

Contents:

• Data Structures & Analysis using pandas

• Machine Learning with scikit-learn

• Data Viz

• Time Series and forecasting

• Data Engineering

• DB Connectivity

• Inferential Statistics

Pre-requisities:

• Ideally, some/any programming experience in any language

• Download and install Anaconda

Data analysis using python with pandas

Starting from the top, this is how create basic data structures using the pandas library in python.

The first structure we are going to create using pandas is a Series. We'll construct this from a dictionary (which is similar to a hashmap for java users - although has stricter requirements on which data types can act as a key) A dictionary assigns values to a specific key, so the value can be obtained by the key.

Pandas is short for Panel Data System. Below, we are creating a dictionary with indexes as capital letters, assigned to key-value pairs of ints. We are importing our two libraries and stating what we will be referring to them as, when calling their pre-built functions. The next line we define 'd' as a dictionary - and assign.

import numpy as np
import pandas as pd
d = {'A': 11, 'B':28, 'C':34}

Now lets put this dictionary object into a series using pandas.

Series

pd.Series(d)
A    11
B    28
C    34
dtype: int64

Okay, so we have our dictionary d in series form now thanks to pandas. So we can see A, B, C as a pattern and 11, 28, 34 as a pattern.

Next we are going to use the random number generator, from numpy, to generate 5 random numbers. It'll spit out 5 floats from the guassian distribution's range (which should be between -5 and 5 as max limits but much more likely to be closer to 0).

Since we imported numpy as np earlier, we call this by typing np.random.randn() and passing in how many values we want to generate.

pd.Series(np.random.randn(5))
0    0.373289
1   -0.893447
2   -0.004254
3   -0.296780
4    0.113821
dtype: float64

So now we've created a series of random floats. As we didn't specify the index of any of them, note pandas has generated us the standard array index of [0] to [4] to mark the first and fifth positions respectively.

What if we want to change the indexes to mean something more meaningful? We can do this quite easily, by passing in an index parameter to the Series function. Below we're going to pass in an array of indices, and now pandas will assign these to the corresponding value in the series.

d = pd.Series(np.random.randn(5), index=['1a', '2b', '3c', '4d', '5e'])
d
1a    0.643820
2b    0.648194
3c    0.994418
4d    0.790750
5e   -0.783578
dtype: float64

We can observe that each index corresponds to a particular float value now. However, a series is only a one-dimensional representation of data - the indices are just names of the positions.

DataFrame

A more common structure is a "DataFrame". A DataFrame is a 2-dimensional grid of data, like a matrix, and we can think of it like a spreadsheet or SQL table. It's also like a dictionary of Series objects. It can accept many different types of input such as a dict of 1d ndarrays, lists, dicts and Series, 2-d numpy.ndarrays, structured/record ndarray, a Series (outside a dict), and even other DataFrames.

Let's create one from a dict of two Series. (This means fitting two 1-D Series together to form a 2-D DataFrame).

Notice the decimal points in the second series - this is to let the compiler know we want to store these as floats.

d = {'column1':pd.Series(['E', 'F', 'G', 'H']),'column2':pd.Series([3., 4., 5., 6.])}
d
{'column1': 0    E
 1    F
 2    G
 3    H
 dtype: object, 'column2': 0    3.0
 1    4.0
 2    5.0
 3    6.0
 dtype: float64}

This looks good, two 1d columns called column1 and column2. But at the moment, that's all they are - just standalone columns, or 1-dimensional. So let's arrange these two columns next to each other,using the pandas dataframe function, to form a grid:

df = pd.DataFrame(d)
df
column1 column2
0 E 3.0
1 F 4.0
2 G 5.0
3 H 6.0

There we have it. Our first DataFrame. Here we created one out of two series, but if you already have sufficient data to create a dataframe, we can make one straight away.

Here we're going to assign the headings as column1 and column2 respectively, and pass straight in the values (in two seperate arrays) to one dict. This method is identical but we don't have to bother using the Series function (like above) to get the same result - but now you can merge Series aswell as this method below.

d = {'column1': ['A1', 'B2', 'C3', 'D4'],
'column2': [1.1, 2.4, 3.6, 4.2]}
df = pd.DataFrame(d)
df
column1 column2
0 A1 1.1
1 B2 2.4
2 C3 3.6
3 D4 4.2

So as we can see - same result, less code.

So far so good, we've created a dictionary object, some Series and now a DataFrame - what's next?

Panel

Panels are where pandas gets it's name from. They are more seldom-seen, but nonetheless important; They can represent 3-D data. In the same way we can create a 2-D DataFrame from two 1-D objects, we can create a 3-D panel from two dataframes... if they share an axis.

Let's create one. First we'll pass the first item as a DataFrame of 5 rows and 4 columns, and the second item as a DataFrame of 5 rows and 2 columns.

By calling panel on these, pandas sets the items, major_axis and minor_axis to fit with the values we passed in.

d = {'Item1': pd.DataFrame(np.random.randn(5,4)),
'Item2':pd.DataFrame(np.random.randn(5,2))}
pd.Panel(d)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
d['Item1']
0 1 2 3
0 0.729543 -0.358357 1.052881 0.549296
1 -0.344916 0.818123 0.977408 -0.118194
2 1.743073 -0.657648 -1.578111 2.213557
3 -1.037148 -0.256341 -0.878754 1.642333
4 -0.759792 -1.505197 0.899062 1.070015
d['Item2']
0 1
0 0.310244 -1.243522
1 0.321864 -0.203512
2 0.005516 -0.052490
3 0.269082 0.930782
4 1.817205 -1.328925
d
{'Item1':           0         1         2         3
 0  0.729543 -0.358357  1.052881  0.549296
 1 -0.344916  0.818123  0.977408 -0.118194
 2  1.743073 -0.657648 -1.578111  2.213557
 3 -1.037148 -0.256341 -0.878754  1.642333
 4 -0.759792 -1.505197  0.899062  1.070015, 'Item2':           0         1
 0  0.310244 -1.243522
 1  0.321864 -0.203512
 2  0.005516 -0.052490
 3  0.269082  0.930782
 4  1.817205 -1.328925}

So above we can observe: the Item1 grid has 4 columns and 5 rows as we passed in, and the Item2 column has 5 rows and 2 columns. They both share the 5 rows as the major axis, and both shoot off in different dimensions with their columns - think of the "5" as a common x axis, "4" as a y axis and "2" as a z axis in a 3-dimensional plot. Notice there is no column titled '2' or '3' in the second item, as the number of columns doesn't go that far.

Hierarchical Indexing

What is hierarchical indexing?

This is also known as multi-level indexing, and is used on multi-indexed objects.

It allows for the storage and manipulation of multi-dimensional data in data structures with fewer dimensions, such as Series inside DataFrames and DataFrames inside panels.

Let's go ahead and implement one as it may be easier to see in action.

First, we'll create an array of two arrays.

ourarrays = [['blue', 'blue', 'blue', 'blue', 'red', 'red', 'red', 'green', 'green','yellow'],
['true', 'false', 'true', 'false', 'true', 'false', 'true', 'false', 'true', 'false']]

Okay, so we've created the array of arrays. Now, by calling the pandas MultiIndex function, we can assign the titles of these two arrays to match the position of the two values we pass in - in another array. See below for clarification.

ourindex = pd.MultiIndex.from_arrays(ourarrays, names=['colour', 'boolean'])
ourindex
MultiIndex(levels=[['blue', 'green', 'red', 'yellow'], ['false', 'true']],
           labels=[[0, 0, 0, 0, 2, 2, 2, 1, 1, 3], [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]],
           names=['colour', 'boolean'])

Now we can see that the first colour blue corresponds to position [0] in the levels, colours array, red[1], yellow[2] and green[3]. The order is a little mixed up due to alphabetical preference in the return. Hence the pattern [0,0,0,0,2,2,2,1,1,3] in the labels colours array represents our original pattern in ourarrays. In the second array, position [0] corresponds to false, and position [1] to true, which signifies the true-false-true-false pattern we also created in myarrays.

Successfully multi-indexed.

Let's now create a Series of random numbers, but this time using the multi-level index we've just made.

myseries = pd.Series(np.random.randn(10), index=ourindex)
myseries
colour  boolean
blue    true      -0.953461
        false     -0.977766
        true       1.157950
        false      0.692890
red     true      -0.185875
        false      0.809776
        true       1.736262
green   false      0.583749
        true      -2.049528
yellow  false      0.043418
dtype: float64

Everything is grouped together nicely in hierarchical form now.

So quickly, if we wanted to find the values corresponding to "green" colour and then the "blue", "false" value, we would type:

myseries['green']
boolean
false    0.583749
true    -2.049528
dtype: float64
myseries['blue']['false']
false   -0.977766
false    0.692890
dtype: float64

Which we can see is as we expected.

Querying data in pandas

Here, we're going to have a look at how to query data using pandas. We're going to use a small example to simplify the process and enable you to see what's going on - but you will be able to imagine how useful this is in larger scenarios.

First, we're going to create a dict of 4 arrays with titles. Think of these as stray columns, that aren't currently joined together in any way.

d = {'column_A': [1, 2, 3, 4],
'column_B': [2, 4, 6, 8],
'column_C':[4, 8, 16, 32],
'column_D': [10, 20, 30, 40]}
d
{'column_A': [1, 2, 3, 4],
 'column_B': [2, 4, 6, 8],
 'column_C': [4, 8, 16, 32],
 'column_D': [10, 20, 30, 40]}

Let's combine these all together using our DataFrame function.

df = pd.DataFrame(d, index=['row_one', 'row_two', 'row_three', 'row_four'])
df
column_A column_B column_C column_D
row_one 1 2 4 10
row_two 2 4 8 20
row_three 3 6 16 30
row_four 4 8 32 40

Okay, so we have our DataFrame - now let's run some simple queries so you can get a flavour for the syntax. What are the values of the column_C column?

df['column_C']
row_one       4
row_two       8
row_three    16
row_four     32
Name: column_C, dtype: int64

That makes sense. What about column C up to and including the third value?

df['column_C'][:3]
row_one       4
row_two       8
row_three    16
Name: column_C, dtype: int64

That gave us the trimmed result we were after. What about if we want to find information by row rather than column? For example row 4?

df.loc['row_four']
column_A     4
column_B     8
column_C    32
column_D    40
Name: row_four, dtype: int64

This has given us the row by passing it's name - but supposing we didn't know the name, we can query by row position instead using iloc:

df.iloc[3]
column_A     4
column_B     8
column_C    32
column_D    40
Name: row_four, dtype: int64

This gives us the same result as above, but we passed in the row position [3].

How about if we want just the third and fourth row?

df[2:4]
column_A column_B column_C column_D
row_three 3 6 16 30
row_four 4 8 32 40

This means give us the result from the third row up to before the 5th row. (I.e. fourth). So this makes sense too - this is useful for large dataframes.

We're going to add a row called c_New to our DataFrame based on the output of another row. Defining 'column_New' as the column name, we are saying that this column will become the value of column_A modulo 2 that equals 0. This reads the output of where column_C's remainder when divided by 2 is 0... so even numbers.

This results in a boolean as it can only be true or false.

df['column_New'] = df['column_A'] % 2 == 0
df
column_A column_B column_C column_D column_New
row_one 1 2 4 10 False
row_two 2 4 8 20 True
row_three 3 6 16 30 False
row_four 4 8 32 40 True

Reading down column_A, we can see that 1 is not divisible by 2, 2 is, 3 isn't and 4 is - hence the pattern observed in column_New. We just added a column to a DataFrame using a conditional statement based on the values of other data in our DataFrame.

You can now navigate DataFrames and query data you want from them.

Data Aggregation using pandas

So far, we've looked at data structures, how to set them up and how to navigate them - let's jump into an actual data set from the internet and use some pandas functions on it to try and learn something about the data we're presented with.

For this example we're going to look at a famous, free online data set on flower petals called "iris". It's in .csv format. If you'd like to read around the iris dataset before diving in I suggest having a read at http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

It's possible to retrieve this actual dataset straight from scikit learn, but for URL scraping purposes we're going to grab it via urllib in this section.

We're going to import the library here called urllib - if you have Anaconda you should already have it. The read_csv function from pandas will read in a .csv format file straight into a DataFrame.

import urllib

Okay, let's assign the webpage URL where our data set lies to a variable called myUrl:

myUrl = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Now we're using our library import's built-in request function to start communicating with the page:

urlRequest = urllib.request.Request(myUrl)

Now it's opening the url and storing everything in the iris_fp variable we're creating.

iris_fp = urllib.request.urlopen(urlRequest)

Okay, now everything's in iris_fp we can actually read the data into pandas using the read_csv function. Don't be alarmed at the length of this command - iris_fp is everything above condensed into one variable, sep means how the data is separated (if you open the url you can indeed see the commas that separate the data), header - we don't need one in this case but feel free to set one. Decimals are represented by . and this is specified as they aren't in all countries.

The names array is what each data value represents, or the "column headers" if you like.

iris = pd.read_csv(iris_fp, sep =',', header=None, decimal ='.', names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'])
iris.head()
sepal_length sepal_width petal_length petal_width target
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Nice. Calling .head() on a DataFrame will give you the first 5 values so you don't have to take up lots of room by viewing the entire dataset top to bottom. You can also pass in how many values you want to see as a parameter.

In the real world, data nearly always comes in "dirty", so we would normally apply some forms of pre-processing, normalization or standardization - but this data set is clean.

For now, this means we can jump straight into analysis.

Let's run some functions to tell us a little bit more about the overall data set.

type(iris)
pandas.core.frame.DataFrame

We already knew this was a DataFrame, as .read_csv() reads data into a DataFrame when in .csv format - but calling type(yourData) will let you know what you're dealing with.

iris.tail(7)
#This is like calling .head(), but shows the last of the DataFrame rather than the first
sepal_length sepal_width petal_length petal_width target
143 6.8 3.2 5.9 2.3 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
iris['target'].count()
#This shows how many entries there are in the dataset.
150

From the head and tail, we can see there are different types of plant in the "target" section.

If you type "iris" in your console, you can observe the whole table. There's only 150 entries so we can observe data of this size by eye to some extent - look at the "target" column. Can you see the other type of plants?

How many entries are NOT virginicas?

notVirginicas = iris[iris['target'] != 'Iris-virginica']
notVirginicas.count()
sepal_length    100
sepal_width     100
petal_length    100
petal_width     100
target          100
dtype: int64

Exactly 100 aren't virginicas.

What if we were interested in just the setosa plants, and not the virginica/other plants?

df = iris[iris['target'] == 'Iris-setosa']
df.count()
sepal_length    50
sepal_width     50
petal_length    50
petal_width     50
target          50
dtype: int64

What did we just do? We've made two smaller DataFrames out of the main Iris DataFrame - one with NOT virginicas, and one that just uses the setosa plants. We called count() on them, and it showed that there are 100 plants that are not virginicas, and 50 results in the smaller Setosa DataFrame.

We've got the plants we're interested in stored here. Suppose we wanted to know what the mean, standard deviation, smallest, largest and median values of petal length, width or others were for our setosa plants?

It's as easy as:-

df.mean()
#mean of our setosa dataframe
sepal_length    5.006
sepal_width     3.418
petal_length    1.464
petal_width     0.244
dtype: float64
df['petal_width'].std()
#standard deviation of petal width for our setosa dataframe
0.10720950308167837
df['sepal_length'].min()
#smallest value of sepal length in our setosa dataframe
4.2999999999999998
df['sepal_width'].max()
#largest value of sepal width in our setosa dataframe
4.4000000000000004
df.median()
#median values of each attribute in our setosa dataframe
sepal_length    5.0
sepal_width     3.4
petal_length    1.5
petal_width     0.2
dtype: float64

Data Merging with DataFrames

If you know SQL, you'll be comfortable with the logic behind merging with DataFrames as it works in just the same way as table joins.

Let's make two DataFrames, df1 and df2. They both have the same key - this makes our life easier when merging them. They have different other rows, and that's absolutely fine.

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
'B': ['B0', 'B1', 'B2', 'B3', 'C4']})

df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4'],
'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
'D': ['D0', 'D1', 'D2', 'D3', 'C5']})
pd.merge(df1, df2, on='key')
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
4 A4 C4 K4 C4 C5

We called merge from pandas, passed in our two DataFrames, specified to join on our key column - and our two DataFrames have been merged into one with no issues.

Let's create another DataFrame, df3, but with some variation in its key; we'll add some duplicates and some values that go beyond the scope of df1.

df3 = pd.DataFrame({'key': ['K0', 'K0', 'K5', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3', None],
'D': ['D0', 'D1', 'D2', 'D3', 'D4']})
pd.merge(df1, df3, on='key', how='left')
A B key C D
0 A0 B0 K0 C0 D0
1 A0 B0 K0 C1 D1
2 A1 B1 K1 NaN NaN
3 A2 B2 K2 C3 D3
4 A3 B3 K3 None D4
5 A4 C4 K4 NaN NaN

Hmmm... what's going on at C2 & D2, and C5 & D5? We have four "Not a number" entries. Even though the 'K3' was in a different position in the key, it still merged the corresponding data correctly at C3 and D3 - but the issue with our "NaN"s stems from the lack of 'K1' in df3's key, and lack of 'K5' in df1's key, so all four of these are lost as they can't be matched by key.

So if the keys do not match, the data will be lost in this type of join.

What if we have multiple duplicate keys?

dfleft = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1'],
'key2': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})

dfright = pd.DataFrame({'key1': ['K0', 'K0', 'K2', 'K1'],
'key2': ['K0', 'K1', 'K1', 'K2'],
'A': ['C0', 'C1', 'C2', 'C3'],
'B': ['D0', 'D1', 'D2', 'D3']})
dfleft
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K2
3 A3 B3 K1 K3
dfright
A B key1 key2
0 C0 D0 K0 K0
1 C1 D1 K0 K1
2 C2 D2 K2 K1
3 C3 D3 K1 K2
pd.merge(dfleft, dfright, on=['key1', 'key2'])
A_x B_x key1 key2 A_y B_y
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 C1 D1
2 A2 B2 K1 K2 C3 D3

Data is lost as we have specified to join on BOTH key columns. This means if any key for each row doesn't match, the row will be omitted. The first row of both DataFrames is fine as they both use a 'K0' key1 and key2. The second row of both also makes it into the merge as they both use a 'K0' and 'K1' key. The third row of dfleft also makes it through as it uses a 'K1' and 'K2' key, which matches with the last row of dfright. However, both the last row of dfleft and the third row of dfright are lost - because there is no 'K3' to join in dfright, and dfleft doesn't use a 'K2' key 1 and 'K1' key 2.

That was a lot of Ks and keys, but if you're having trouble understanding what's going on, try and match up the two original tables side by side with the merge to see what's gone and why.

Machine Learning using Scikit-learn

In this section we're going to discuss using scipy and scikit-learn, standardization of data to be able to apply machine-learning algorithms, normalization of data, basic linear regression and also an introduction to supervised machine-learning.

SciPy uses toolboxes and numerical algorithms, including signal processing and optimization. It's built on top of the NumPy library. SciPy adds to pre-existing numpy functions.

You can use statistical functions, fourier transform routines, interpolation and smoothing splines, linear algebra, optimization routines, orthogonal distance regression, integration operations, ordinary differential equation solvers, N-dimensional image processing, and many more.

Scikit-learn is a great machine learning library that incorporates a tonne of ready-to-use algorithms, and can be used for all things ML.

Let's revisit our friend numpy and use some of its functionality to learn how to concatenate arrays.

Concatenating Arrays

import numpy as np
np.concatenate(([4], [0]*3, np.arange(-2, 2.25, 0.25)))
array([ 4.  ,  0.  ,  0.  ,  0.  , -2.  , -1.75, -1.5 , -1.25, -1.  ,
       -0.75, -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75,  1.  ,  1.25,
        1.5 ,  1.75,  2.  ])

Lets break our concatenate method down.

We had an array with one element in it, [4]. Then we had another array of one element in, 0, or [0]. We multiplied this array so we have THREE arrays of one element, 0. Hence so far in total we have 4 arrays, each with an element. We then specified "a range" using numpy which started at -2, finished at 2.25, and went up in increments of a quarter - which you can see ascending in the output above.

These were all fed into np.concatenate, and our output is one array with our 4, three 0s, and 17 various floats - so we merged them all into one array with 21 terms.

Quicker way to write this:

np.r_[4,[0]*3,-2:2:17j]
array([ 4.  ,  0.  ,  0.  ,  0.  , -2.  , -1.75, -1.5 , -1.25, -1.  ,
       -0.75, -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75,  1.  ,  1.25,
        1.5 ,  1.75,  2.  ])

NOTE: The first method of concatenation is not inclusive for the upper limit of the range. If we wanted to include 2.25, we'd specify a term larger than 2.25.

The np.r function allows arrays to be quickly concatenated, and the format for creating a range follows the same pattern - j just represents how many terms we want to produce from the two limits, in this case 17.

Polynomials

The poly1d package from numpy makes calculus easier.

You can represent expressions and equations from pure maths in python in a very readable format.

from numpy import poly1d
p = poly1d([3,2,9])
print(p)
   2
3 x + 2 x + 9

Inside the poly1d function just specify the coefficient of each descending power of x, and obviously the larger the array, the higher the start power of x will be.

p*p
poly1d([ 9, 12, 58, 36, 81])
print(p*p)
   4      3      2
9 x + 12 x + 58 x + 36 x + 81

We can square equations as above,

Integrate them:

print(p.integ(k=12))
   3     2
1 x + 1 x + 9 x + 12

Where k is the constant of integration.

Differentiation? Let's find the derivative of p*p:

psquared = p*p
print(psquared.deriv())
    3      2
36 x + 36 x + 116 x + 36

And it's also worth noting that once our function is in the poly1d format, we can call that function for values of x. E.g, if we call p at x=4 and x=7:

p([4,7])
array([ 65, 170])

We return 65 and 170 respectively.

Standardization

This is essentially the process of removing the mean and scaling it to have unit variance.

This is a commonly required operation for many machine learning estimators, as they require the data to closely resemble a standard normal distribution.

It involves centering and scaling the data in the given training set so that the set ends up with a mean of 0 and standard deviation of 1 (ideally). Since the standard deviation is the square root of variance, if the sd is close to 1 so will the variance be.

From scikit-learn, let's import preprocessing.

from sklearn import preprocessing
A = np.array([[2.0, 1.0, 4.0], [4.0, 2.0, 7.8], [3.0, 1.4, 6.2]])
A
array([[ 2. ,  1. ,  4. ],
       [ 4. ,  2. ,  7.8],
       [ 3. ,  1.4,  6.2]])

Okay, so A is a 3x3 array. We passed in float values. Let's create an instance of a scaler from the preprocessing package we imported - this will get scalar values of this data.

myscaler = preprocessing.StandardScaler().fit(A)

We've created a scaler to work on our A array. Next we'll call the mean and scale functions.

With those values, we can then transform A to our standardized form.

myscaler.mean_
array([ 3.        ,  1.46666667,  6.        ])
myscaler.scale_
array([ 0.81649658,  0.41096093,  1.55777619])
myscaler.transform(A)
array([[-1.22474487, -1.13554995, -1.28388148],
       [ 1.22474487,  1.29777137,  1.15549333],
       [ 0.        , -0.16222142,  0.12838815]])

There... a scaled array. However - we need to get test and train versions for our ML model, and this is because we need to provide ourselves an honest assessment of whether the predictive model will perform well or not.

The train data is what you use to teach a Machine Learning model, and the test data is what you want to actually predict: You keep them separate to prevent "overfitting" or cross-contamination, as you want to make sure the model performs on any test data you pass it.

For our purposes here, I'm just showing you how to standardize the data into these forms.

A_test = np.array([[2.0, 1.0, 4.0], [3.0, 4.0, 6.3], [2.8, 3.0, 5.8]])
myscaler.transform(A_test)
array([[-1.22474487, -1.13554995, -1.28388148],
       [ 0.        ,  6.164414  ,  0.19258222],
       [-0.24494897,  3.73109269, -0.12838815]])

So that's the test array that we will apply our actual algorithm to, half-scaled, with our original transformation for A applied to it.

Now we need to scale down these values to have unit variance and mean 0 - so the range falls as close to -1 and +1 with mean 0, to better fit our machine learning algorithms.

Our training dataset in this case can just be A, as we already have a separate dataset for test.

A_train = A
minmaxscaler = preprocessing.MinMaxScaler()
A_train_min_max = minmaxscaler.fit_transform(A_train)
A_train_min_max
array([[ 0.        ,  0.        ,  0.        ],
       [ 1.        ,  1.        ,  1.        ],
       [ 0.5       ,  0.4       ,  0.57894737]])

A_train_min_max has been standardized.

A_test_min_max = minmaxscaler.transform(A_test)
A_test_min_max
array([[ 0.        ,  0.        ,  0.        ],
       [ 0.5       ,  3.        ,  0.60526316],
       [ 0.4       ,  2.        ,  0.47368421]])

now A_test_min_max has also been standardized.

We started off with A, a 3x3 array. We created a scaler. We fitted the scaler to our A array and took measurements of the mean and scaled it. We then transformed our array.

After that we set the value of A_train to be the same as our transformed, scaled A. We then got a minmaxscaler instance, which scales and translates each feature individually such that it is in the given range on the training set - i.e. between zero and one. We used this to transform those same scaled A values, and are left with final test and train arrays.

Thus we have standardized these two sets and we are on the way to being able to process them through ML.

Normalization

The steps here are very similar to the standardization process, but normalization is for scaling individual samples so they have unit norm.

This is a common operation in text classification and clustering.

Normalizer in the scikit-learn documentation found here:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

"Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one. This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion)."

from sklearn import preprocessing
import numpy as np
A = np.array([[3.0, 1.0, 5.0,], [1.0, 2.0, 3.0], [5.0, 2.0, 2.0]])
A
array([[ 3.,  1.,  5.],
       [ 1.,  2.,  3.],
       [ 5.,  2.,  2.]])

So, we've made a new 3x3 array called A.

Let's now create an instance of a normalizer from preprocessing and apply it to A:

mynormalizer = preprocessing.Normalizer().fit(A)
mynormalizer.transform(A)
array([[ 0.50709255,  0.16903085,  0.84515425],
       [ 0.26726124,  0.53452248,  0.80178373],
       [ 0.87038828,  0.34815531,  0.34815531]])

So now as before, let's make an A_test:

A_test = np.array([[2.0, 1.0, 4.0], [1.0, 3.0, 2.0], [5.5, 2.2, 1.9]])
mynormalizer.transform(A_test)
array([[ 0.43643578,  0.21821789,  0.87287156],
       [ 0.26726124,  0.80178373,  0.53452248],
       [ 0.88411184,  0.35364474,  0.30542045]])

And there we have a normalized Test version.

Linear Regression

Right, now we've looked at prepping data sets for machine learning, let's actually apply some of this knowledge in an algorithm.

Lets revisit the iris dataset from earlier. Let's grab this data set, this time from scikit-learn, and jump into applying a first machine-learning algorithm to it.

from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
iris = datasets.load_iris()

Feel free to type "iris" in your console, and you should see the entire dataset and foreword - how did I know "feature_names" would work below?

iris.feature_names
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Okay, so the raw data is there. Let's suppose we wanted to set up a regression plot of how sepal width varies with sepal length?

We can see above the data is split into arrays, so we can manipulate which part of the data we want to extract.

Setting up our axes:

X = iris.data[0:49,[0]]
Y = iris.data[0:49,[1]]

This said, lets take the first 50 results at position 0 (which is Sepal length) and position 1 (Sepal width) to be our X and Y axes respectively.

fig, ax = plt.subplots(1,1)

This was just setting up our figure and axes to become plotted... well, on our plot.

Our axis should plot Y against X, width versus length:

ax.scatter(X,Y)
<matplotlib.collections.PathCollection at 0xb822eb8>

Setting our axis labels:

ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal width')
<matplotlib.text.Text at 0xafdda90>
plt.show()

And this is looking pretty good. It's definitely fair to say there's some correlation here so we picked good properties to test for a relationship. Let's add a linear regression line to it.

First we import the linear model functionality, then store it as a variable, and add our axes to it.

from sklearn import linear_model
lm = linear_model.LinearRegression()
fitted = lm.fit(X,Y)
#Now we want to add our line to the plot
X = iris.data[0:49,[0]]
Y = iris.data[0:49,[1]]
fig, ax = plt.subplots(1,1)
ax.scatter(X,Y)
#new line is below
ax.plot(X, lm.predict(X), color = 'red')
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
plt.show()

print('Sepal Length coefficient = ' + str(lm.coef_) + '\n')
Sepal Length coefficient = [[ 0.80711987]]

And that bonus line at the end told us the gradient.

Nice. Now how about for the petal length and width of versicolor plants?

X = iris.data[49:100,[2]]
Y = iris.data[49:100,[3]]
fig, ax = plt.subplots(1,1)
ax.scatter(X,Y)
ax.plot(X, lm.predict(X), color = 'gold')
ax.set_xlabel('Petal length')
ax.set_ylabel('Petal width')
plt.show()

What about how petal length varies with sepal length across versicolor and virginica plants?

X = iris.data[49:149,[0]]
Y = iris.data[49:149,[2]]
fig, ax = plt.subplots(1,1)
ax.scatter(X,Y)
ax.plot(X, lm.predict(X), color = 'green')
ax.set_xlabel('Sepal length')
ax.set_ylabel('Petal length')
plt.show()

So now we've actually applied a machine learning algorithm, albeit a simple one, linear regression, let's look at supervised machine learning.

Supervised machine-learning

Let's create a program that can recognise hand-written digits and classify them as to what number they represent by scanning them optically.

Let's jump straight in, and take a look at a dataset we've never seen before.

import matplotlib.pyplot as plt
from sklearn import datasets

digits = datasets.load_digits()

This dataset contains 8x8 matrix representations of bitmap images of handwritten digits, along with the 10 target classes, each one representing one of 10 digits 0-9. In the case of supervised learning, the learning targets are stored in the ‘target’ member of the dataset. “target” represents the number its trying to represent. “Target names” are the labels for these.

We have our set now... How many records/columns are there?

digits.data.shape
(1797, 64)

64 columns, but how are these formatted?

digits.images.shape
(1797, 8, 8)

Ah, in 8x8 matrices. This tallies, as we know the images are 8x8 bitmap images.

print(digits.images[-2].shape)
(8, 8)

Here we checked that the shape of the second to last image is indeed 8 x 8, which it is.

If we needed to reshape one, we can call:

digits.images[-2].reshape(8,8)
array([[  0.,   0.,   2.,  10.,   7.,   0.,   0.,   0.],
       [  0.,   0.,  14.,  16.,  16.,  15.,   1.,   0.],
       [  0.,   4.,  16.,   7.,   3.,  16.,   7.,   0.],
       [  0.,   5.,  16.,  10.,   7.,  16.,   4.,   0.],
       [  0.,   0.,   5.,  14.,  14.,  16.,   4.,   0.],
       [  0.,   0.,   0.,   0.,   0.,  16.,   2.,   0.],
       [  0.,   0.,   4.,   7.,   7.,  16.,   2.,   0.],
       [  0.,   0.,   5.,  12.,  16.,  12.,   0.,   0.]])
digits.images[-2].reshape(8,8).tolist()
[[0.0, 0.0, 2.0, 10.0, 7.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 14.0, 16.0, 16.0, 15.0, 1.0, 0.0],
 [0.0, 4.0, 16.0, 7.0, 3.0, 16.0, 7.0, 0.0],
 [0.0, 5.0, 16.0, 10.0, 7.0, 16.0, 4.0, 0.0],
 [0.0, 0.0, 5.0, 14.0, 14.0, 16.0, 4.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 2.0, 0.0],
 [0.0, 0.0, 4.0, 7.0, 7.0, 16.0, 2.0, 0.0],
 [0.0, 0.0, 5.0, 12.0, 16.0, 12.0, 0.0, 0.0]]

And this gives us a pixel by pixel breakdown of the image here.

Let's take this second-last image and make a plot of it using imshow, passing it in, setting the colour mapping to grayscale, and interpolation to nearest. Interpolation is an estimation of a value within two known values in a sequence of values.

myimg = digits.images[-2]
plt.imshow(myimg, cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

There's our image... It looks an awful lot like a number 9, but we need to let the machine know what a no. 9 looks like to be able to predict anything.

Let's store the number of samples and obtain the support vector machine module out of scikit-learn.

SVC is the support vector classifier, and a gamma value means how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The C parameter trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors.

numSamples = len(digits.images)
data = digits.data
from sklearn import svm
myEstimator = svm.SVC(gamma=0.001, C=100)

Let's fit our data to myEstimator now:

myEstimator.fit(data[:int(numSamples/2)], digits.target[:int(numSamples/2)])
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

This said fit our data up to and including half the total samples, and target up to and including half the number of samples.

The 'target names' column in digits told us which number the bitmap is trying to represent.

Let's now actually use this estimator to "predict" this image that we have been inspecting.

We'll say: predict the image that's including second last up to last.

myEstimator.predict(data[-2:-1])
array([9])

It's realised for itself it's a 9. Have a go at predicting other numbers in the dataset! Can you obtain data about those? What about an image the computer hasn't seen before? How accurate was it?

Python for Data Visualization

It's important to be able to put data into formats that can easily display patterns and trends, and for people of all backgrounds to be able to make sense of it.

In this section, we're going to look at using the R-like ggplot, heatmaps and 3-D charts.

Let's see what ggplot can do.

ggplot

import numpy as np
import matplotlib.pyplot as plt

We'll start by storing pi from numpy, and creating a range for x from -10pi to +10 pi in increments of 0.01.

pi = np.pi
x = np.arange(-8*pi, 8*pi, 0.01)

Now we're creating our function, and here we're creating sinc(x) which is sin(x)/x.

y = np.sin(x)/x

W're setting the style to ggplot here, and mapping the points of our x range and function(x) from above onto the line.

plt.style.use('ggplot')
line = plt.plot(x,y)

Here is the setup of our plot. We're telling it that we want to use the line we've created above which uses x and y, colour to be green, a linewidth of 2px and a linestyle of continuous.

plt.setp(line, color = 'g', linewidth=2.0, linestyle='-')
[None, None, None]

Now we're just setting the labels and calling the show() function to display us the finished result.

plt.xlabel('x')
<matplotlib.text.Text at 0xb882eb8>
plt.ylabel('sinc(x)')
<matplotlib.text.Text at 0xcaa0e48>
plt.show()

This looks great! What about if we change the styling on the setup line? let's have a red line that's thicker and composed of slashes and dots.

pi = np.pi
x = np.arange(-12*pi, 12*pi, 0.01)
y = np.sin(x)/x
plt.style.use('ggplot')
line = plt.plot(x, y)
plt.setp(line, color='r', linewidth=3.0, linestyle='-.')
plt.xlabel('x')
plt.ylabel('sinc(x)')
plt.show()

So this is simple once you've seen the syntax!

What about if you wanted to have more than one plot show at the same time, one below the other?

For this we can use subplots.

If they are using the same range for their x axis we can use that same axis for both charts.

Always import matplotlib.pyplot first, and in this case numpy for the special maths functionality.

Notice we've added a different y function this time round, cos(x).

pi = np.pi
x = np.arange(-10*pi, 10*pi, 0.01)
y1 = np.sin(x)/x
y2 = np.cos(x)

The subplot() command is the same as in MATLAB if you've ever used it - it creates a grid organised into cells and rows, plotting a different figure in each cell. It can be used to display less than ten plots in total, and takes 3 arguments in the form of integers: subplot(number_rows, number_columns, plot_number) So what we're about to do is very similar to making a normal plot, except that we give the two plots subplot identifier integers and specify which version of y we want to use. So subplot 211 will give us sinc(x) and subplot 212 will give us cos(x).

And just to clarify the 211 and 212:

211 in the subplot command will mean 2 rows (or separate charts), 1st column, 1st plot.

212 means the same 2 rows, first column, 2nd plot.

plt.subplot(211)
plt.style.use('ggplot')
line1 = plt.plot(x, y1)
plt.setp(line1, color='b', linewidth=2.0, linestyle='-')
plt.xlabel('x')
plt.ylabel(('sinc(x)'))

plt.subplot(212)
plt.style.use('ggplot')
line2 = plt.plot(x, y2)
plt.setp(line2, color='g', linewidth=2.0, linestyle='-.')
plt.xlabel('x')
plt.ylabel('cos(x)')
plt.show()

Boxplots

#import numpy and matplotlib.pyplot as you usually would

For this boxplot demonstration, we're going to use some data from a normal distribution as it'll aid in explaining what's going on visually. To do this we will seed a random number generator, using random.normal from numpy.

np.random.seed(8)

Next we are passing in the centre of distibution or mean, 300 here, the standard deviation which we want as 25, and size/output shape of 300.

mydata = np.random.normal(300,25,300)

Due to the nature of boxplots, we set our configuration in a slightly different way. Don't be alarmed - we first create the boxplot from our data, then set the box colour and linewidth, then whisker colours and linewidth, and the flier markers, colour and transparency.

mybp = plt.boxplot(mydata)
for box in mybp['boxes']:
    box.set(color='#4F319F', linewidth=3.0)
for whisker in mybp['whiskers']:
    whisker.set(color='#FF0000', linewidth=3.0)
for flier in mybp['fliers']:
    flier.set(marker='o',color='#D34F62', alpha=0.5)
plt.show()

• Centre of plot is box itself – the stem

• Top box line is Q3 or 75% value of data / upper hinge

• Bottom line of box is Q1 or 25% value of data/ lower hinge

• Middle red box line is median

• Top black horizontal is the “max” or upper inner fence and is 1.5x the stem above Q3

• Bottom of the plot lower inner fence or “min” and is 1.5x below Q1 the bottom of the stem

• The whiskers are the lines between the stem and the min and max

• Outliers lie out of all these ranges

Heatmaps

Useful for all sorts of applications - they display higher frequencies as a darker/more intense colour so we can quickly see where the action is.

We're just going to use 14000 random numbers for our x and the same for y, and create a heatmap of where they fall - since numpy's random.rand uses the normal distribution, we would expect the largest frequency to appear close to the middle.

Let's find out - we'll start by setting x and y to these randomly generated numbers, so yours will look a little different.

import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(14000)
y = np.random.randn(14000)

Next using numpy's 2d histogram function, we pass in our x and y datapoints:

heatmap, xedges, yedges = np.histogram2d(x, y, bins=100)

A "bin" is like the grouping the data is split by, so by having 100 bins we can plot this relatively large number of data points on the heatmap. If we used 50 bins, we'd see less "dots" as the data points have been grouped together in a wider range.

This function returns three empty array objects, the first of which is the 2d histogram of data samples x and y.

The second array contains the bin edges of the first dimension, xedges, and the third array contains the bin edges on the second dimension, yedges.

Next, we can create an extent array - which contains scalar values for positioning the heatmap on display.

It specifies the coordinates of the lower left and upper right of the plot.

extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]

• This means the left limit is the first element in the bin edges array for x samples

• The right limit is the last element in the bin edges array for x samples

• The bottom limit is the first element in the bin edges array for y samples

• The top limit is the last element in the bin edges array for y samples

Since our data is normally distributed, we would expect the centre of the graph to have the highest concentration of data...

plt.imshow(heatmap, extent=extent)
plt.grid(b='on', color='#ffffff')
plt.show()

Let's try the same with a different probability distribution and a spectral colourmap, and with less "bins".

n, p = 10, 0.65  # number of trials, probability of each trial succeeding

x = np.random.binomial(n, p, 10000)
y = np.random.binomial(n, p, 10000)

heatmap, xedges, yedges = np.histogram2d(x, y, bins=10)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.imshow(heatmap, extent=extent, cmap=("Spectral"))
plt.grid(b='on', color='#ffffff')
plt.show()

Legends + annotations

First, we'll create a dataset of sinc(x)-squared and cos(x), similar to earlier.

We can add a title to this using

plt.suptitle('Sine Cardinal-squared versus Cosine', fontsize=14)
<matplotlib.text.Text at 0x16141f98>

and add a legend with

plt.legend(['sinc(x) squared', 'cos(x)'], loc='upper right')
<matplotlib.legend.Legend at 0x16d4fa58>
# so now we have, from the start
import numpy as np
import matplotlib.pyplot as plt

pi = np.pi
x = np.arange(-4*pi, 4*pi, 0.01)
y1 = (np.sin(x)**2)/x
y2 = np.cos(x)
plt.plot(x, y1, x, y2)
plt.ylim(-1.2, 1.2 ,7)
plt.grid(b='on', color='k')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.suptitle('Sine Cardinal squared versus Cosine', fontsize=10)
plt.legend(['sinc(x) squared', 'cos(x)'], loc='upper right')
<matplotlib.legend.Legend at 0x16d62438>

Now we can annotate our plot. Let's pick out the coordinate (0,1) on our plot, and let it appear at a slight offset so we don't cover over the point... so at (0,1.05). We should then specify with xycoords that this is an actual data point from the fraph.

plt.annotate('0,1', (0,1.05), xycoords='data')
<matplotlib.text.Annotation at 0x16d54390>

And now to specify the actual text:

Here we're passing the first two values for where the text should appear on the axes, so x=15 and y=1.12. In this case we want the red box to appear outside of the plot.

The next string value is the text itself, i.e. what we actually want to say about our plot, and the box is passed a dict of the colour and transparency.

plt.text(15, 1.12, '(sinc(x))squared offset \n from cos() peak', bbox=dict(facecolor='red', alpha=0.5))
<matplotlib.text.Text at 0x16d546d8>
plt.show()

Creating a Scatterplot Matrix

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Scatterplot matrices are a great way to roughly determine if we have a linear correlation betwen two variables and if the data is skewed. It does this by plotting the kernel density estimation where the same plot is plotted against itself... it'll become clear visually.

First we'll make a DataFrame of 700 random numbers and 3 columns.

df = pd.DataFrame(np.random.randn(700,3), columns =['A', 'B', 'C'])

Next, we pass in this dataframe, the transparency, the size of the figure (in inches), and that we want the kernel density estimation in the leading diagonal.

myspm = pd.tools.plotting.scatter_matrix(df, alpha=0.3, figsize=(7,7), diagonal='kde')
plt.show()

We now have a visual representation of each chart plotted against the others, including itself in the leading diagonal. Since we used numbers from the normal distribution, the data is not particular skewed at all.

What do you think this would look like in a distribution where the data may be skewed?

3-Dimensional plots

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np

Let's start by creating a figure, and some 3D axes to fit to this figure.

fig = plt.figure()
ax = Axes3D(fig)

Now let's set an x range and y range. The parameters are startpoint, endpoint and increment.

X = np.arange(-3, 3, 0.2)
Y = np.arange(-3, 3, 0.2)

Fitting these two to a grid:

X, Y = np.meshgrid(X, Y)

Let's now say our Z function is X multiplied by e to the power ((-x) squared (-y) squared +1).

Z = X*np.exp((-X**2 - Y**2)+1)

Below, we're specifying we want to plot X, Y, Z, where rstride means array row step size, cstride is column row step size and the colour mapping is in this case 'winter'.

ax.plot_surface(X, Y, Z, rstride = 1, cstride = 1, cmap='winter')
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x9c0b550>
plt.show()

Nice. Wouldn't it be even better if we could see more clearly what was happening below too?

fig = plt.figure()
ax = Axes3D(fig)
X = np.arange(-3, 3, 0.2)
Y = np.arange(-3, 3, 0.2)
X, Y = np.meshgrid(X, Y)
Z = X*np.exp((-X**2 - Y**2)+1)
#this rotates the view such that we can see both peaks/troughs
ax.view_init(elev=-10, azim=70)
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='winter')
plt.show()

Time series and forecasting data

Time series

Using pandas, first we want to specify a date range. The parameters are start and end dates, in YYYY-MM-DD format.

myTimeSeries = pd.date_range('2016-08-01', '2016-08-31')
myTimeSeries
DatetimeIndex(['2016-08-01', '2016-08-02', '2016-08-03', '2016-08-04',
               '2016-08-05', '2016-08-06', '2016-08-07', '2016-08-08',
               '2016-08-09', '2016-08-10', '2016-08-11', '2016-08-12',
               '2016-08-13', '2016-08-14', '2016-08-15', '2016-08-16',
               '2016-08-17', '2016-08-18', '2016-08-19', '2016-08-20',
               '2016-08-21', '2016-08-22', '2016-08-23', '2016-08-24',
               '2016-08-25', '2016-08-26', '2016-08-27', '2016-08-28',
               '2016-08-29', '2016-08-30', '2016-08-31'],
              dtype='datetime64[ns]', freq='D')

What if we want to extract the hours in this range? Let's see using the freq = 'H' parameter:

myTseriesHrs = pd.date_range('2016-08-01', '2016-08-31', freq='H')
myTseriesHrs
DatetimeIndex(['2016-08-01 00:00:00', '2016-08-01 01:00:00',
               '2016-08-01 02:00:00', '2016-08-01 03:00:00',
               '2016-08-01 04:00:00', '2016-08-01 05:00:00',
               '2016-08-01 06:00:00', '2016-08-01 07:00:00',
               '2016-08-01 08:00:00', '2016-08-01 09:00:00',
               ...
               '2016-08-30 15:00:00', '2016-08-30 16:00:00',
               '2016-08-30 17:00:00', '2016-08-30 18:00:00',
               '2016-08-30 19:00:00', '2016-08-30 20:00:00',
               '2016-08-30 21:00:00', '2016-08-30 22:00:00',
               '2016-08-30 23:00:00', '2016-08-31 00:00:00'],
              dtype='datetime64[ns]', length=721, freq='H')

Using numpy and pandas together, we can match our series with random numbers from a normal distribution to form a sequence. We can then index these with our original time series.

myTseriesSeq = pd.Series(np.random.normal(150, 10, len(myTimeSeries)), index = myTimeSeries)

If you want to act on a specific region in the series, you can use slice notation:

#Give the region from 14th august to 19th august
myTseriesSeq['2016-08-14':'2016-08-19']
2016-08-14    153.384297
2016-08-15    154.191592
2016-08-16    148.298714
2016-08-17    143.826965
2016-08-18    157.011562
2016-08-19    161.111695
Freq: D, dtype: float64
#last 4 values
myTseriesSeq[-5:-1]
2016-08-27    140.296871
2016-08-28    156.400220
2016-08-29    146.177502
2016-08-30    160.264075
Freq: D, dtype: float64

Let's create a time series now ranging from July 2016 to October 2016.

timeSer = pd.date_range('2016-07-01', '2016-10-31', freq='H')

Okay, and now let's assign random numbers to each hour in the series, with the first two parameters meaning mean and standard deviation:

timeSerSeq = pd.Series(np.random.normal(150, 10, len(timeSer)), index=timeSer)
timeSerSeq.head()
2016-07-01 00:00:00    148.793396
2016-07-01 01:00:00    152.556226
2016-07-01 02:00:00    141.878933
2016-07-01 03:00:00    153.199367
2016-07-01 04:00:00    155.348435
Freq: H, dtype: float64
timeSerSeq.tail()
2016-10-30 20:00:00    129.060665
2016-10-30 21:00:00    158.953779
2016-10-30 22:00:00    163.155597
2016-10-30 23:00:00    160.383886
2016-10-31 00:00:00    146.089862
Freq: H, dtype: float64
timeSerSeq.size
2929

What about if we want to find out the total of these assignments by month? We can resample the above to group the total weightings monthly:

timeSerSeq.resample('M').sum()
2016-07-31    111357.614518
2016-08-31    111404.574895
2016-09-30    107671.583094
2016-10-31    108740.233203
Freq: M, dtype: float64

M = month, Y = year, d = day, s = seconds and ms = milliseconds.

Timedeltas

A timedelta is a scalar - expressed in various units, e.g. minutes, hours, days etc, and timedeltas represent relative difference in time values so can be positive or negative. We can utilise several different methods for creating them.

One such method is to pass a time string to the timedelta class constructor, as below:

a = pd.Timedelta('2 days 6 hours 11 min 1s 14ms')
a
Timedelta('2 days 06:11:01.014000')

We can also create one by passing key words with int values to the timedelta constructor.

b = pd.Timedelta(days =2, hours = 6, minutes=11, seconds=1, milliseconds=14)
b
Timedelta('2 days 06:11:01.014000')

As can be seen, both methods produced the exact same timedelta.

Here's an example of negative timedeltas.

pd.Timedelta('-3 hours')
Timedelta('-1 days +21:00:00')
pd.Timedelta('-1 days +21:00:00')
Timedelta('-1 days +21:00:00')
c = pd.to_timedelta('2 days 6 hours 11 min 1s 14ms')
c
Timedelta('2 days 06:11:01.014000')
d = pd.to_timedelta('2 days 06:11:01.014000')
d
Timedelta('2 days 06:11:01.014000')

The above display different ways to pass in the values we want a timedelta to take.

Next we're going to generate 5 timedeltas in units of seconds.

e = pd.to_timedelta(np.arange(5), unit='s')
e
TimedeltaIndex(['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04'], dtype='timedelta64[ns]', freq=None)

Now we're going to index three timedeltas with strings:

f = pd.to_timedelta(['6 days 3 hours 0 min', '14 min', '753 ms'])
f
TimedeltaIndex(['6 days 03:00:00', '0 days 00:14:00',
                '0 days 00:00:00.753000'],
               dtype='timedelta64[ns]', freq=None)

What are the largest and smallest values a timedelta can take?

pd.Timedelta.max
Timedelta('106751 days 23:47:16.854775')
pd.Timedelta.min
Timedelta('-106752 days +00:12:43.145224')

Let's now create a time series, and try and perform an operation on it with a timedelta.

We're passing in the start date, how many periods, and what measurement of time the periods are.

ts = pd.Series(pd.date_range('2016-07-01', periods = 5, freq='D'))
ts
0   2016-07-01
1   2016-07-02
2   2016-07-03
3   2016-07-04
4   2016-07-05
dtype: datetime64[ns]

Cool, so what if we wanted to say that 3 hours and 25 minutes have been added to this time series?

ts += pd.Timedelta(hours=3, minutes=25)
ts
0   2016-07-01 03:25:00
1   2016-07-02 03:25:00
2   2016-07-03 03:25:00
3   2016-07-04 03:25:00
4   2016-07-05 03:25:00
dtype: datetime64[ns]

We've universally changed the time for all values in the series - we can pick specific days too.

ts[2] += pd.Timedelta(hours = 5, seconds = 32, milliseconds = 434)
ts
0   2016-07-01 03:25:00.000
1   2016-07-02 03:25:00.000
2   2016-07-03 08:25:32.434
3   2016-07-04 03:25:00.000
4   2016-07-05 03:25:00.000
dtype: datetime64[ns]

We can also make a series out of timedeltas if we pass an array of timedeltas.

mySeries = pd.Series(pd.to_timedelta(['3 days +03:19:07', '2 days 34 min', '-5 hours']))
mySeries
0     3 days 03:19:07
1     2 days 00:34:00
2   -1 days +19:00:00
dtype: timedelta64[ns]

We can also observe the mean, median and sum of a time series:

mySeries.mean()
Timedelta('1 days 15:37:42.333333')
mySeries.median()
Timedelta('2 days 00:34:00')
mySeries.sum()
Timedelta('4 days 22:53:07')

And also by dividing our time series by a timedelta, we can produce an array of scalars - in this case showing us how many "4 hours" can be fitted into each value.

mySeries/np.timedelta64(4, 'h')
0    18.829653
1    12.141667
2    -1.250000
dtype: float64

And last but not least - we can use our seriesName.dt.components to give us a full, tabular breakdown of all the timedeltas in our series.

mySeries.dt.components
days hours minutes seconds milliseconds microseconds nanoseconds
0 3 3 19 7 0 0 0
1 2 0 34 0 0 0 0
2 -1 19 0 0 0 0 0

Data Engineering in Python

Problems faced with data quality:

• Missing values

• Null values

• Character compatibilities

• Duplicated data

• Corrupted data

Data almost always comes in ‘dirty’.

This means it’s not usually in a ready-to-use form. We have to cleanse the data before we can use it to perform meaningful analysis. Dirty data may contain anything listed above. We want to remove these issues and make it much more possible to draw valid conclusions from the given data or datasets.

Let's create a dataframe with some dirty data in and have a go at cleansing it for ourselves.

Data Cleansing

import pandas as pd
df = pd.DataFrame({'key': [1.0, None, 'K2', 'K3', 'K3', None, None],
'A': ['A0', 'pink', 'A2+', 'A3', 'A3', None, 2.0],
'B': ['hello there .', 'B1', 'blue', '  , ', '  , ',None, '. dirty']})
df
A B key
0 A0 hello there . 1
1 pink B1 None
2 A2+ blue K2
3 A3 , K3
4 A3 , K3
5 None None None
6 2 . dirty None

We'll start by dropping that duplicate row.

df = df.drop_duplicates()
df
A B key
0 A0 hello there . 1
1 pink B1 None
2 A2+ blue K2
3 A3 , K3
5 None None None
6 2 . dirty None

So that duplicate row has been removed. Now let's check for any null values:

df['A'].isnull().value_counts()
False    5
True     1
Name: A, dtype: int64
#None in A. What about B?
df['B'].isnull().value_counts()
False    5
True     1
Name: B, dtype: int64
#None in B either. What about in the key?
df['key'].isnull().value_counts()
True     3
False    3
Name: key, dtype: int64

Ah, there are null values in here - as we're using such a tiny dataset we can see this anyway, the "None" values, but of course on larger data sets it may not always be obvious by looking through visually. We know there are missing data values in all these columns, so how should we go about handling them? Hold that thought - we'll first look at regular expressions, and then how these could tie in with null values.

Manipulating Strings

df['B'].str.replace(' ', '')
0    hellothere.
1             B1
2           blue
3              ,
5           None
6         .dirty
Name: B, dtype: object

What this has done is for column B in our DataFrame, replaced all whitespace with an empty string. The str.replace() method takes two parameters - first what you want to replace, and then what you want to put in its place - in our case, nothing. The beauty of this method is we don't have to do a complex regular expression like we will see later.

A more practical usage of this would be to create a new column in an existing data frame, and assign that column to be the output of the extraction operation – as in the next example.

Just as we can extract string values, we can replace strings too using the string replace method.

df['newColumn'] = df['B'].str.replace('^blue', 'pink')
C:\Users\mapotter\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
df
A B key newColumn
0 A0 hello there . 1 hello there .
1 pink B1 None B1
2 A2+ blue K2 pink
3 A3 , K3 ,
5 None None None None
6 2 . dirty None . dirty

So we've taken 'B', and where it says 'blue', it now says 'pink' in the new column. Nice. The ^ character matches the start of a string without consuming any characters.

Now for those null values!

Handling nulls/editing

df = df.dropna(how='any')
df
A B key newColumn
0 A0 hello there . 1 hello there .
2 A2+ blue K2 pink
3 A3 , K3 ,

We just axed all rows where any null values exist: There are some cases where this is useful, but can often be detrimental - there may still be meaning to be derived from the other data values in the row.

Let's reset our DataFrame to how it was before.

#Resetting our DataFrame to how it was pre-cleansing

df = pd.DataFrame({'key': [1.0, None, 'K2', 'K3', 'K3', None, None],
'A': ['A0', 'pink', 'A2+', 'A3', 'A3', None, 2.0],
'B': ['hello there .', 'B1', 'blue', '  , ', '  , ',None, '. dirty']})
df
A B key
0 A0 hello there . 1
1 pink B1 None
2 A2+ blue K2
3 A3 , K3
4 A3 , K3
5 None None None
6 2 . dirty None

Filling in null values

#Filling all null data values with a 0

df.fillna(0)
A B key
0 A0 hello there . 1
1 pink B1 0
2 A2+ blue K2
3 A3 , K3
4 A3 , K3
5 0 0 0
6 2 . dirty 0

This filled in our null values with a 0, as that's what we passed in to the fillna() function. Now let's have a look at forward propagation - this will fill all our data values with the last non-null item it has come across in that column.

df = pd.DataFrame({'key': [1.0, None, 'K2', 'K3', 'K3', None, None],
'A': ['A0', 'pink', 'A2+', 'A3', 'A3', None, 2.0],
'B': ['hello there .', 'B1', 'blue', '  , ', '  , ',None, '. dirty']})

df.fillna(method='pad')
A B key
0 A0 hello there . 1
1 pink B1 1
2 A2+ blue K2
3 A3 , K3
4 A3 , K3
5 A3 , K3
6 2 . dirty K3

We can also fill in the null values with the mean value of the overall column.

I'm editing our dataframe here to numerical floats so our mean has a numerical value.

df = pd.DataFrame({'key': [1.0, None, 2.3, 4.2, 1.0, None, None],
'A': [3.3, 1.0, 4.2, 3.6, 2.0, None, 2.0],
'B': [2.2, 2.7, 0.01, 3.2, 2.9, None, 2.1]})

df.fillna(df.mean())
A B key
0 3.300000 2.200 1.000
1 1.000000 2.700 2.125
2 4.200000 0.010 2.300
3 3.600000 3.200 4.200
4 2.000000 2.900 1.000
5 2.683333 2.185 2.125
6 2.000000 2.100 2.125
df.mean()
A      2.683333
B      2.185000
key    2.125000
dtype: float64

These are the means of each column, and these have plugged the null values in our DataFrame above.

What about filtering specific rows?

Let's store this mean-filling-in permanently to our df:

df = df.fillna(df.mean())
df
A B key
0 3.300000 2.200 1.000
1 1.000000 2.700 2.125
2 4.200000 0.010 2.300
3 3.600000 3.200 4.200
4 2.000000 2.900 1.000
5 2.683333 2.185 2.125
6 2.000000 2.100 2.125

Let's say - we want to observe our 'key' column in the dataframe, such that the value of the 'key' column in our dataframe is equal to 2.125.

df[df['key'] == 2.125]
A B key
1 1.000000 2.700 2.125
5 2.683333 2.185 2.125
6 2.000000 2.100 2.125

We've returned the rows where the key column is what we wanted.

Forms of cleansing:

• Exploratory cleansing

• Production cleansing

• Machine learning to clean data

• Data merging

• Rebuilding missing data

• Standardize data

• Normalising

• De-duplication

These are the basics of handling nulls and processing data - let's do something more heavyweight, and use data preprocessing alongside a method known as text mining to form a wordcloud...

Text mining

You can use these ideas to make a wordcloud about practically anything you want. Follow these steps with your own example if you like.

#empty container to put data into
data = {}

We're going to read in a text file - this is Will Smith's biographical wikipedia page stored in a .txt file locally on my drive. Our aim here is to spit out a wordcloud with all the most frequently used words appearing larger in the image, so we can get a rough idea of what the article is about without even having to read it.

Below, we're opening this file, in read mode 'r', giving the encoding and calling read() so python stores all of this in a new 'willsmith' part of the container.

data['willsmith'] = open('D:/Users/mapotter/Documents/Python Scripts/wollsmoth.txt', 'r', encoding='utf-8').read()
---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-270-26351e5b85b0> in <module>()
----> 1 data['willsmith'] = open('D:/Users/mapotter/Documents/Python Scripts/wollsmoth.txt', 'r', encoding='utf-8').read()


C:\Users\mapotter\AppData\Local\Continuum\Anaconda3\lib\codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 517: invalid start byte

I've left this error here purposefully!

It's likely you will encounter something similar sooner at later, so I'll quickly show you how to fix it.

In my parameter I specified utf-8 encoding, but my text file was in another format. If you want to change a file to utf-8, do the following:

Download Notepad++, open your text file from there, click "encoding" on the toolbar, and select "convert to utf-8".

Then save the file. I saved mine with a new name.

data['willsmith'] = open('D:/Users/mapotter/Documents/Python Scripts/wollsmothutf8.txt', 'r', encoding='utf-8').read()
print("First 200 characters: " + str(data['willsmith'][:200]))
First 200 characters: Will Smith
From Wikipedia, the free encyclopedia
Jump to: navigation, search 
This article is about the actor. For other people with a similar name, see William Smith.
Will Smith
Actor Will Smith at t

Let's start by converting all characters to lower case.

The keys method returns what Python 3 calls a view of its underlying keys.

This for loop is saying: for any value in our data set, the data at this position should be made into lower case if it can be.

for k in data.keys():
    data[k] = data[k].lower()
print("First 300 characters, lower case: \n" + str(data['willsmith'][:300]))
First 300 characters, lower case: 
will smith
from wikipedia, the free encyclopedia
jump to: navigation, search 
this article is about the actor. for other people with a similar name, see william smith.
will smith
actor will smith at the 2016 san diego comic-con international. 
smith at the 2016 san diego comic-con international prom

Indeed, everything is lower-case now.

Now to remove all the punctuation:

import re
for k in data.keys():
    data[k] = re.sub(r'[-./?!,":;()\']', ' ', data[k])

We imported regular expressions (we mentioned these earlier.)

For the keys in our data, substitute any of these punctuation values found in between the first string literal '[-./?!,":;()']', with a space (the second string literal parameter ' '). And put the characters' data at position k back where it was.

Here is an excellent site for telling you exactly what it is your regex is doing - click on the "tokens" and search to decipher what a regex means if you come across one you've never seen before.

https://regex101.com/r/cJ2zT8/1

print("First 400 characters, lower case, no punctuation: " + str(data['willsmith'][:400]))
First 400 characters, lower case, no punctuation: will smith
from wikipedia  the free encyclopedia
jump to  navigation  search 
this article is about the actor  for other people with a similar name  see william smith 
will smith
actor will smith at the 2016 san diego comic con international  
smith at the 2016 san diego comic con international promoting suicide squad
born
willard carroll smith jr 
september 15  1968  age 48 
philadelphia  pennsyl

Indeed our punctuation has disappeared too now! Let's kill off our numerical characters:

We're going to replace any character in the range 0-9 with a space too.

for k in data.keys():
    data[k] = re.sub('[-|0-9]', ' ', data[k])
print("Now first 400 chars with numerical characters removed: \n" + str(data['willsmith'][:400]))
Now first 400 chars with numerical characters removed: 
will smith
from wikipedia  the free encyclopedia
jump to  navigation  search 
this article is about the actor  for other people with a similar name  see william smith 
will smith
actor will smith at the      san diego comic con international  
smith at the      san diego comic con international promoting suicide squad
born
willard carroll smith jr 
september           age    
philadelphia  pennsyl

Now let's import the natural language toolkit:

import nltk

Run this command below. A window should pop up.

nltk.download_gui()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

Navigate to the "all packages" tab at the top, and scroll down until you find the "stopwords" packages. Click download and install it.

Now we can use this package to help get rid of any common, generally unhelpful words in our text!

from nltk.corpus import stopwords
print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

So these words aren't generally that helpful. What else may not be helpful? For example, if I don't want to see some other common words pop up such as "Will" and "Smith" since it's obvious they're going to be in my example, I can append these strings to the stopwords list.

stopwords_list = stopwords.words('english') + ['will', 'smith', 'jaden', 'willow']

Here we are just printing the last 20 stopwords in our stopwords_list, so we can see if they have been appended without looking through the entire list.

print(stopwords_list[-20:])
['didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn', 'will', 'smith', 'jaden', 'willow']

Indeed, these have been appended to the end of our list.

Now we've sorted out our stopwords, we can split our data into individual words.

for k in data.keys():
    data[k] = data[k].split()
print("First 50 words after splitting: " + str(data['willsmith'][:50]))
First 50 words after splitting: ['will', 'smith', 'from', 'wikipedia', 'the', 'free', 'encyclopedia', 'jump', 'to', 'navigation', 'search', 'this', 'article', 'is', 'about', 'the', 'actor', 'for', 'other', 'people', 'with', 'a', 'similar', 'name', 'see', 'william', 'smith', 'will', 'smith', 'actor', 'will', 'smith', 'at', 'the', 'san', 'diego', 'comic', 'con', 'international', 'smith', 'at', 'the', 'san', 'diego', 'comic', 'con', 'international', 'promoting', 'suicide', 'squad']

Now every word has it's own position in the array rather than in one long string together. This is similar to using a regular expression where we split by space, but this has done it for us.

Now we're about to remove all words in our split array that appear in our stopwords array. Watch the syntax carefully.

for k in data.keys():
    data[k] = [w for w in data[k] if not w in stopwords_list ]

We've said:

For all our words in our dataset, these words become (the same word) if it does not appear in the stopwords_list array.

So basically: remove the words from our data, that appear in the stopwords list we made.

Now let's look at the first 100 words in our cleansed Will Smith set.

print("Cleansed will smith data: " + str(data['willsmith'][:100]))
Cleansed will smith data: ['wikipedia', 'free', 'encyclopedia', 'jump', 'navigation', 'search', 'article', 'actor', 'people', 'similar', 'name', 'see', 'william', 'actor', 'san', 'diego', 'comic', 'con', 'international', 'san', 'diego', 'comic', 'con', 'international', 'promoting', 'suicide', 'squad', 'born', 'willard', 'carroll', 'jr', 'september', 'age', 'philadelphia', 'pennsylvania', 'u', 'residence', 'los', 'angeles', 'california', 'u', 'names', 'fresh', 'prince', 'occupation', 'actor', '·', 'producer', '·', 'rapper', '·', 'songwriter', 'years', 'active', '–present', 'net', 'worth', 'increase', '$', 'million', '[', ']', 'spouse', 'sheree', 'zampino', 'div', 'jada', 'koren', 'pinkett', 'children', 'including', 'musical', 'career', 'genres', 'hip', 'hop', 'labels', 'jive', 'rca', '·', 'columbia', 'sme', '·', 'interscope', 'universal', 'associated', 'acts', 'dj', 'jazzy', 'jeff', '·', 'mary', 'j', 'blige', '·', 'christina', 'vidal', '·', 'kenny', 'greene']

And now we can make a wordcloud with this processed word array.

Open up a terminal and type in:

conda install -c conda-forge wordcloud=1.2.1
import wordcloud as wc
import matplotlib.pyplot as plt

wordcloud = wc.WordCloud(width=1000, height=500).generate(' '.join(data['willsmith']))

So this is saying: our wordcloud will have 1000px width and 500px height, and will use our willsmith data.

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

This library sure does make it easy...

This is a rather basic looking wordcloud, but we filtered unstructured data to produce this result.

Working with databases

To start with, we're going to look at using pandas with a MySQL database. For this we're going to import create_engine from the sqlalchemy library, with pandas.

We also need to make sure we install the correct driver for our DB. If we are using MySQL, we are going to need the pymysql driver.

Bear in mind you can use sqlalchemy with other databases, just configure it as necessary with the correct drivers for the DB you want to connect to. Use the command below in a terminal to install the pymysql driver. conda install pymysql

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('mysql+pymysql://immanuelpotter:notmypassword@localhost:8889/laterbase')

This is saying our engine is to become:

dialect: mysql

driver: pymysql

username: immanuelpotter

password: notmypassword

port: localhost:8889

DB name: laterbase

After you've matched these up, you can pass in the table name you want to read with the engine into the pandas read_sql_table() method and store it in a dataframe.

df = pd.read_sql_table('mytablename', engine)

And your table should appear as a DataFrame!

This example was a dummy one, as I haven't opened this port/this database doesn't exist.

However, if you do have a real database with the correct credentials, you should be able to read this into Python seamlessly following the exact method above.

An Introduction to Inferential Statistics

Probability mass function (pmf): A function that gives the probability that a discrete random variable is exactly equal to some value.

Normal Distribution

A normal distribution is continuous, and one whose data distribution is symmetrical about its mean.

It's also known as the Gaussian distribution, and is very common. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

The kernel density estimation resembles a bell shape, and thus is known as a “bell curve”.

It makes statistics a lot easier, and more feasible - as we mentioned in machine learning, a lot of algorithms like the data to be in a normal distribution's shape, hence we have standardization and normalisation. It has mean 0 and standard deviation of 1.

Here we've made a plot of 700 random normally distributed floats so you can see the shape.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

seriesnd = pd.Series(np.random.randn(700))
seriesnd.plot(kind='kde', title='Normal Distribution', grid=True)
plt.show()

mean = seriesnd.mean()
sd = seriesnd.std()
mean
0.029998614350023753
sd
1.0008592530246194

After generating this we can see the distribution is almost symmetrical, in our case here close to a point corresponding to x = 0.

Binomial distribution

A binomial distribution is discrete – not continuous, and takes into account the number of times we want something to happen and the probability of one event happening – and the event either happens or it doesn’t.

Let's create a binomial distribution on coin flips, heads or tails. Of course, the probability of a coin landing heads or tails is 0.5. Let's see how the probabilities of getting so many in a row are distributed.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
x = np.arange(11)
n = 10
p = 0.5

We've got the binomial distribution from scipy's stats library, and created a range of 11 ints which will automatically start at 0... so 0 - 10.

n will be the number of times the event happens, in this case 10 coin flips.

p is our set probability of success, in this case half/half.

fig, ax = plt.subplots(1,1)
randvars=binom(n,p)

We want our figure and axes on one subplot. Our plot will be stored as "randvars", which is our binomial distribution of 10 tries and half-chance of success.

ax.vlines(x, 0, randvars.pmf(x), colors='b', linestyles='-', linewidth=1.2, label='Probability')
<matplotlib.collections.LineCollection at 0x113e35c0>

The vlines function on the axes plots vertical lines.

x is how we stored our range for the axes, so we're passing this in as the axis range.

0 is the start, or y minimum value.

randvars.pmf(x) means calculate the probability mass function of our binomial distribution for the given range x, and is the y max value of each point.

colors='b' means blue, linestyles='-' means standard line, linewidth and label speak for themselves too!

ax.legend(loc='upper right')
<matplotlib.legend.Legend at 0x11585b38>
plt.suptitle('Binomial Distribution of coin flips', fontsize=12)
<matplotlib.text.Text at 0x113e5780>
plt.show()

As we can see, the greatest probability is in getting 5 heads/10, which is as we would expect, given a p of 0.5.

Poisson Distribution

The poisson distribution is used to determine the probability of getting a certain number of occurrences in a given time interval given a particular mean count in that interval.

We are at a certain location in a specific town – an average of 18 vehicles are caught speeding per month. Independent of other external factors – what would be the probability of catching 27 vehicles per month at that same location?

We can use the poisson function to work this out.

from scipy.stats import poisson

rvCatch18 = poisson(18)

Okay, so we called our poisson distribution of "18 times" rv. So now we want to calculate the probability of catching 27 vehicles given we usually catch 18... seems unlikely doesn't it? Let's see:

rvCatch27 = rvCatch18.pmf(27)
rvCatch27
0.010916267579831881

Wow! There's roughly a little over a one in a hundred chance of that happening. What's the probability we might only catch 15 vehicles that month given we usually catch 18?

rvCatch16 = rvCatch18.pmf(16)
rvCatch16
0.088397465572661954

Nearly 9 times in a 100 we'll catch only 16!

Bernoulli Distribution

The Bernoulli distribution is used to model the occurrence of success and failure in a given sample size. It is one of the simplest discrete distributions, having two possible outcomes, labelled by n=0 and n=1 in which n=1 ("success") occurs with probability p and n=0 ("failure") occurs with probability q=1-p. It's one of the building blocks for other, more complicated distributions.

from scipy.stats import bernoulli
import matplotlib.pyplot as plt

We're calling the rvs function to generate a bernoulli distribution with probability of success 0.8 and sample size 200. Let's think of this as a very biased coin: Where a head is much more likely than a tail.

berRVS = bernoulli.rvs(0.8, size=200)
p = 0.8

Now were telling our plot to add this bernoulli distribution, and the probability mass function of the distribution with red markers at the top.

plt.plot(berRVS, bernoulli.pmf(berRVS,p), 'ro', markersize=7, label='Bernoulli pmf')
[<matplotlib.lines.Line2D at 0xf1f29e8>]

Now we're specifying the vertical lines like we've seen before:

plt.vlines(berRVS, 0, bernoulli.pmf(berRVS, p), colors='g', linewidth=4.0, alpha=0.5)
<matplotlib.collections.LineCollection at 0xf1ea630>
plt.xlim(-0.2, 1.2)
plt.show()

And there we have a bernoulli distribution. It just tells us that in a given fixed probability number of coin tosses, there is an 80% chance of definite success and a 20% chance of no success... well given we knew it had a success probability of 0.8 we probably could have worked this out! But nonetheless, this is how to produce a bernoulli distribution plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment