Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save medarshanshah/47ee06f7cbfe3e197f78286d919c2ea0 to your computer and use it in GitHub Desktop.
Save medarshanshah/47ee06f7cbfe3e197f78286d919c2ea0 to your computer and use it in GitHub Desktop.
Pandas, Numpy, Python Cheatsheet

Link to source

Intro

I assembled a super quick cheatsheet of the most common Pandas, Numpy and Python tasks I tend to do. Let me know if I missed anything important in the comments below!

If you like this kernel, please give it an upvote. Thank you! :)

Table of Contents

Data Structures

There are two things to keep in mind for each of the data types you are using:

  1. Are they mutable?
    • Mutability is about whether or not we can change an object once it has been created. A list can be changed so it is mutable. However, strings cannot be changed without creating a completely new object, so they are immutable.
  2. Are they ordered?
    • Order is about whether the order of the elements in an object matters, and whether this position of an element can be used to access the element. Both strings and lists are ordered. We can use the order to access parts of a list and string.
  • For each of the upcoming data structures you see, it is useful to understand how you index, are they mutable, and are they ordered.
  • Additionally, you will see how these each have different methods, so why you would use one data structure vs. another is largely dependent on these properties, and what you can easily do with it!

Lists[]

mutable, ordered sequence of elements.

  • Mix of data types.
  • Are ordered - can lookup elements by index.
  • Are mutable - can be changed.

List Comprehensions - (do.. for)

names = ['dumbledore', 'beeblebrox', 'skywalker', 'hermione', 'leia']
capitalized_names = []
for name in names:
    capitalized_names.append(name.title())

# equals (do.. for)
capitalized_names = [name.title() for name in names]
capitalized_names
['Dumbledore', 'Beeblebrox', 'Skywalker', 'Hermione', 'Leia']

adding conditionals (do.. for.. if)

squares = [x**2 for x in range(9) if x % 2 == 0]
# to add else statements, move the conditionals to the beginning
squares = [x**2 if x % 2 == 0 else x + 3 for x in range(9)]

examples

# example
names = ["Rick S", "Morty Smith", "Summer Smith", "Jerry Smith", "Beth Smith"]
first_names = [name.split(' ')[0] for name in names]
print(first_names)
# ['Rick', 'Morty', 'Summer', 'Jerry', 'Beth']

# example
multiples_3 = [i*3 for i in range(1,21)]
print(multiples_3)
# [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60]

# example
scores = {
             "Rick": 70,
             "Morty Smith": 35,
             "Summer Smith": 82,
             "Jerry Smith": 23,
             "Beth Smith": 98
          }
passed = [name for name, score in scores.items() if score>=65]
print(passed)
# ['Rick', 'Summer Smith', 'Beth Smith']

print([key for item,key in scores.items() if key >= 25])
['Rick', 'Morty', 'Summer', 'Jerry', 'Beth']
[3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60]
['Rick', 'Summer Smith', 'Beth Smith']
[70, 35, 82, 98]

Lambda Functions

# Lambda Functions - lambda (arg1, arg2): do_a_thing_and_return_it
multiply = lambda x, y: x * y

# Equivalent of:
def multiply(x, y):
    return x * y

# Can call both of the above like:
multiply(4, 7)
28

map() - apply lambda function to a list

# example of using map() to apply lambda function to a list
numbers = [
              [34, 63, 88, 71, 29],
              [90, 78, 51, 27, 45],
              [63, 37, 85, 46, 22],
              [51, 22, 34, 11, 18]
           ]
mean = lambda num_list: sum(num_list)/len(num_list)
averages = list(map(mean, numbers))
# or
averages = list(map(lambda num_list: sum(num_list)/len(num_list), numbers))
print(averages)
# [57.0, 58.2, 50.6, 27.2]
print(list(map(lambda item: sum(item)/len(item), numbers)))
[57.0, 58.2, 50.6, 27.2]
[57.0, 58.2, 50.6, 27.2]

filter() - apply lambda function to a list

# example of using filter() to apply lambda function to a list
cities = ["New York City", "Los Angeles", "Chicago", "Mountain View", "Denver", "Boston"]
is_short = lambda name: len(name) < 10
short_cities = list(filter(is_short, cities))
# or
short_cities = list(filter(lambda name: len(name) < 10, cities))
print(short_cities)


list(filter(lambda item: len(item)<8, cities))
['Chicago', 'Denver', 'Boston']





['Chicago', 'Denver', 'Boston']

Generators

# Generators
def my_range(x):
    i = 0
    while i < x:
        yield i
        i += 1
# since this returns an iterator, we can convert it to a list
# or iterate through it in a loop to view its contents
for x in my_range(5):
    print(x)
'''
0
1
2
3
4
'''

# You can create a generator in the same way you'd normally write a list comprehension, except with
# parentheses instead of square brackets.
# this list comprehension produces a list of squares
sq_list = [x**2 for x in range(10)]
# this generator produces an iterator of squares
sq_iterator = (x**2 for x in range(10))

# example
# generator function that works like the built-in function enumerate
lessons = ["Why Python Programming", "Data Types and Operators", "Control Flow", "Functions", "Scripting"]
def my_enumerate(iterable, start=0):
    i = start
    for element in iterable:
        yield i, element
        i = i + 1
for i, lesson in my_enumerate(lessons, 1):
    print("Lesson {}: {}".format(i, lesson))
'''
Lesson 1: Why Python Programming
Lesson 2: Data Types and Operators
Lesson 3: Control Flow
Lesson 4: Functions
Lesson 5: Scripting
'''

# example
# If you have an iterable that is too large to fit in memory in full
# (e.g., when dealing with large files), being able to take and use
# chunks of it at a time can be very valuable.
# Implementing a generator function, chunker, that takes in an
# iterable and yields a chunk of a specified size at a time.
def chunker(iterable, size):
    for i in range(0, len(iterable), size):
        yield iterable[i:i + size]
for chunk in chunker(range(25), 4):
    print(list(chunk))
0
1
2
3
4
Lesson 1: Why Python Programming
Lesson 2: Data Types and Operators
Lesson 3: Control Flow
Lesson 4: Functions
Lesson 5: Scripting
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]
[12, 13, 14, 15]
[16, 17, 18, 19]
[20, 21, 22, 23]
[24]

create list

list_of_random_things = [1, 3.4, 'a string', True]

access list elements

list_of_random_things[0];
list_of_random_things[-1]; #last element
list_of_random_things[len(list_of_random_things) - 1]; #last element

slicing [,)

list_of_random_things[1:3] # returns [3.4, 'a string']
list_of_random_things[:2] # returns [1, 3.4]
list_of_random_things[1:] # returns all of the elements to the end of the list [3.4, 'a string', True]
[3.4, 'a string', True]

in, not in

# in, not in
'this' in 'this is a string' # True
'in' in 'this is a string' # True
'isa' in 'this is a string' # False
5 not in [1, 2, 3, 4, 6] # True
5 in [1, 2, 3, 4, 6] # False
False

Mutable and ordered

# Mutable and ordered
my_lst = [1, 2, 3, 4, 5]
my_lst[0] = 0
print(my_lst)
[0, 2, 3, 4, 5]

length of list

# length of list
len(list_of_random_things)
4

smallest and greatest element in list

# returns the smallest element of the list
min(list_of_random_things)

# returns the greatest element of the list. This works because the the max function is defined in terms of the greater than comparison operator. The max function is undefined for lists that contain elements from different, incomparable types.
max(list_of_random_things)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-51-d9d2704a9e40> in <module>
      1 # returns the smallest element of the list
----> 2 min(list_of_random_things)
      3 
      4 # returns the greatest element of the list. This works because the the max function is defined in terms of the greater than comparison operator. The max function is undefined for lists that contain elements from different, incomparable types.
      5 max(list_of_random_things)


TypeError: '<' not supported between instances of 'str' and 'int'

sort list

# returns a copy of a list in order from smallest to largest,
# leaving the list unchanged.
sorted(list_of_random_things)

join()

# join() returns a string consisting of the list elements joined by a separator string.
# Takes only a list of strings as an argument
name = "-".join(["Grace", "Kelly"])
print(name) # Grace-Kelly

Creating a new list

cities = ['new york city', 'mountain view', 'chicago', 'los angeles']
capitalized_cities = []
for city in cities:
    capitalized_cities.append(city.title())

Adding an element to the end of a list - append()

letters = ['a', 'b', 'c', 'd']
letters.append('z')
print(letters) # ['a', 'b', 'c', 'd', 'z']
# Note: letters[i] = 'z'; wouldn't work, use append()
['a', 'b', 'c', 'd', 'z']

Modifying a new list

cities = ['new york city', 'mountain view', 'chicago', 'los angeles']
for index in range(len(cities)):
    cities[index] = cities[index].title()
    
print('amir'.title())
Amir

Print a formatted string from parameters in list

items = ['first string', 'second string']
html_str = "<ul>\n"
for item in items:
    html_str += "<li>{}</li>\n".format(item)
html_str += "</ul>"
print(html_str)
<ul>
<li>first string</li>
<li>second string</li>
</ul>

Convert an iterable (tuple, string, set, dictionary) to a list - list()

# vowel string
vowelString = 'aeiou'
print(list(vowelString))

# vowel tuple
vowelTuple = ('a', 'e', 'i', 'o', 'u')
print(list(vowelTuple))

# vowel list
vowelList = ['a', 'e', 'i', 'o', 'u']
print(list(vowelList))

# All Print: ['a', 'e', 'i', 'o', 'u']
['a', 'e', 'i', 'o', 'u']
['a', 'e', 'i', 'o', 'u']
['a', 'e', 'i', 'o', 'u']

Tuples()

immutable ordered sequences of elements.

  • They are often used to store related pieces of information. The parentheses are optional when defining tuples.
  • Are ordered - can lookup elements by index.
  • Are immutable - can not be changed. You can't add and remove items from tuples, or sort them in place.

create tuple

location = (13.4125, 103.866667)

access tuple

print("Latitude:", location[0])
print("Longitude:", location[1])
Latitude: 13.4125
Longitude: 103.866667

tuple packing

# can also be used to assign multiple variables in a compact way
dimensions = 52, 40, 100

tuple unpacking

# tuple unpacking
length, width, height = dimensions
print("The dimensions are {} x {} x {}".format(length, width, height))
The dimensions are 52 x 40 x 100

Sets{}

mutable, unordered collections of unique elements.

  • Are unordered - can not lookup elements by index.
  • Are mutable - can be changed.
  • Sets support the in operator the same as lists do.
  • One application of a set is to quickly remove duplicates from a list.
  • You cannot have the same item twice and you cannot sort sets. For these two properties a list would be more appropriate.
  • You can add elements to sets using the add() method, and remove elements using the pop() method, similar to lists. Although, when you pop an element from a set, a random element is removed. Remember that sets, unlike lists, are unordered so there is no "last element".
  • Other operations you can perform with sets include those of mathematical sets. Methods like union, intersection, and difference are easy to perform with sets, and are much faster than such operators with other containers.

create sets

numbers = [1, 2, 6, 3, 1, 1, 6]
unique_nums = set(numbers)
print(unique_nums) # {1, 2, 3, 6}
{1, 2, 3, 6}
fruit = {"apple", "banana", "orange", "grapefruit"}

check for element

print("watermelon" in fruit)
False

add an element

fruit.add("watermelon")
print(fruit)
{'orange', 'apple', 'grapefruit', 'watermelon', 'banana'}

remove a random element

print(fruit.pop())
print(fruit)
orange
{'apple', 'grapefruit', 'watermelon', 'banana'}

Dicts{}

mutable data type that stores mappings of unique keys to values.

  • Are ordered - can lookup elements by key.
  • Are mutable - can be changed.
  • Dictionaries can have keys of any immutable type, like integers or tuples, not just strings. It's not even necessary for every key to have the same type!
  • We can look up values or insert new values in the dictionary using square brackets that enclose the key.

create dict

elements = {"hydrogen": 1, "helium": 2, "carbon": 6}

accessing an element's value

print(elements["helium"])
2

adding elements

elements["lithium"] = 3 

Iterating through a dictionary

# Just keys
for key in cast:
    print(key)
# Keys and values
for key, value in cast.items():
    print("Actor: {}    Role: {}".format(key, value))
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-68-7a526a0e7c66> in <module>
      1 # Just keys
----> 2 for key in cast:
      3     print(key)
      4 # Keys and values
      5 for key, value in cast.items():


NameError: name 'cast' is not defined

check whether a value is in a dictionary

# check whether a value is in a dictionary, the same way we check whether a value is in a list or set with the in keyword.
print("carbon" in elements) # True

get() looks up values in a dictionary

# get() looks up values in a dictionary, but unlike square brackets, get returns None (or a default value of your choice) if the key isn't found.
# If you expect lookups to sometimes fail, get might be a better tool than normal square bracket lookups.
print(elements.get("dilithium")) # None
print(elements.get('kryptonite', 'There\'s no such element!'))
# "There's no such element!"

Identity Operators

n = elements.get("dilithium")
print(n is None) # True
print(n is not None) # False

Equality (==) and identity (is)

a = [1, 2, 3]
b = a
c = [1, 2, 3]
print(a == b) # True
print(a is b) # True
print(a == c) # True
print(a is c) # False
# List a and list b are equal and identical.
# List c is equal to a (and b for that matter) since they have the same contents. But a and c (and b for that matter, again) point to two different objects, i.e., they aren't identical objects.
# That is the difference between checking for equality vs. identity.

Compound Data Structures

elements = {"hydrogen": {"number": 1,
                         "weight": 1.00794,
                         "symbol": "H"},
              "helium": {"number": 2,
                         "weight": 4.002602,
                         "symbol": "He"}}
helium = elements["helium"]  # get the helium dictionary
hydrogen_weight = elements["hydrogen"]["weight"]  # get hydrogen's weight
oxygen = {"number":8,"weight":15.999,"symbol":"O"}  # create a new oxygen dictionary 
elements["oxygen"] = oxygen  # assign 'oxygen' as a key to the elements dictionary
print('elements = ', elements)

Dict frequency counter

words =  ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin'];
word_counter = {}
for word in words:
    word_counter[word] = word_counter.get(word,0)+1;
print(word_counter);
# Prints {'fin': 1, 'huckleberry': 1, 'hamlet': 1, 'holmes': 1, 'adventures': 2, 'sherlock': 1, 'expectations': 1, 'great': 2, 'the': 2, 'of': 2, 'gasby': 1}

Numpy

Jupyter Notebook Viewer

Python Numpy Tutorial

  • NumPy provides Python with an extensive math library capable of performing numerical computations effectively and efficiently.
    • Even though Python lists are great on their own, NumPy has a number of key features that give it great advantages over Python lists. One such feature is speed. When performing operations on large arrays NumPy can often perform several orders of magnitude faster than Python lists. This speed comes from the nature of NumPy arrays being memory-efficient and from optimized algorithms used by NumPy for doing arithmetic, statistical, and linear algebra operations.
    • Another great feature of NumPy is that it has multidimensional array data structures that can represent vectors and matrices. Another great advantage of NumPy over Python lists is that NumPy has a large number of optimized built-in mathematical functions. These functions allow you to do a variety of complex mathematical computations very fast and with very little code (avoiding the use of complicated loops) making your programs more readable and easier to understand.
  • At the core of NumPy is the ndarray, where nd stands for n-dimensional.
    • An ndarray is a multidimensional array of elements all of the same type.
    • Unlike Python lists, all the elements of an ndarray must be of the same type. If you provide the np.array() function with a Python list that has both integers and strings, NumPy will interpret all elements as strings.
    • When we create an ndarray with both floats and integers, NumPy assigns its elements a float64 dtype. This is called upcasting. Since all the elements of an ndarray must be of the same type, in this case NumPy upcasts the integers in z to floats in order to avoid losing precision in numerical computations.
  • We refer to 1D arrays as rank 1 arrays. In general N-Dimensional arrays have rank N. Therefore, we refer to a 2D array as a rank 2 array.

Read this about how data is arranged in numpy and using reshape: https://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r

import numpy as np

Create ndarray

# Create a 1D ndarray that contains only integers
x = np.array([1, 2, 3, 4, 5])
print('x = ', x) # x = [1 2 3 4 5]
print('x has dimensions:', x.shape) # x has dimensions: (5,)
print('The elements in x are of type:', x.dtype) # The elements in x are of type: int64

# Create a rank 2 ndarray that only contains integers
Y = np.array([[1,2,3],[4,5,6],[7,8,9], [10,11,12]])
print('Y has dimensions:', Y.shape) # Y has dimensions: (4, 3)
print('Y has a total of', Y.size, 'elements') # Y has a total of 12 elements
print('Y is an object of type:', type(Y)) # Y is an object of type: class 'numpy.ndarray'
print('The elements in Y are of type:', Y.dtype) # The elements in Y are of type: int64

Create ndarray with dtype

# Specify the dtype when creating the ndarray
x = np.array([1.5, 2.2, 3.7, 4.0, 5.9], dtype = np.int64)

Save and load

# Save the array into a file
np.save('my_array', x)

# Load the saved array from current directory
y = np.load('my_array.npy')

Zeros

# Create ndarray using built-in functions
# 3 x 4 ndarray full of zeros
# np.zeros(shape)
X = np.zeros((3,4))

Ones

# a 3 x 2 ndarray full of ones
# np.ones(shape)
X = np.ones((3,2))

Full

# 2 x 3 ndarray full of fives
# np.full(shape, constant value)
X = np.full((2,3), 5)

Identity Matrix

# Identity Matrix
# Since all Identity Matrices are square, the np.eye() function only takes a single integer as an argument
# 5 x 5 Identity matrix
X = np.eye(5)

Diagonal Matrix

# Diagonal Matrix
# 4 x 4 diagonal matrix that contains the numbers 10,20,30, and 50 on its main diagonal
X = np.diag([10,20,30,50])
X
array([[10,  0,  0,  0],
       [ 0, 20,  0,  0],
       [ 0,  0, 30,  0],
       [ 0,  0,  0, 50]])

Arange

# Arange
# rank 1 ndarray that has sequential integers from 0 to 9
# x = [0 1 2 3 4 5 6 7 8 9]
x = np.arange(10)

# rank 1 ndarray that has sequential integers from 4 to 9
# [start, stop)
# x = [4 5 6 7 8 9]
x = np.arange(4,10)

# rank 1 ndarray that has evenly spaced integers from 1 to 13 in steps of 3.
# np.arange(start,stop,step)
# x = [ 1 4 7 10 13]
x = np.arange(1,14,3)

Linspace

# Linspace
# Even though the np.arange() function allows for non-integer steps,
# such as 0.3, the output is usually inconsistent, due to the finite
# floating point precision. For this reason, in the cases where
# non-integer steps are required, it is usually better to use linspace()
# becayse np.linspace() uses the number of elements we want in a
# particular interval, instead of the step between values.
# linspace returns N evenly spaced numbers over the closed interval [start, stop]
# np.linspace(start, stop, N)
# x = [ 0. 2.77777778 5.55555556 8.33333333 11.11111111 13.88888889 16.66666667 19.44444444 22.22222222 25. ]
x = np.linspace(0,25,10)

Reshape

# Reshape
# np.reshape(ndarray, new_shape)
# converts the given ndarray into the specified new_shape
x = np.arange(20)
x = np.reshape(x, (4,5))
# or
x = np.arange(20).reshape(4, 5) # does the same thing as above
# and the same thing with with linshape
y = np.linspace(0,50,10, endpoint=False).reshape(5,2)
# One great feature about NumPy, is that some functions can also be
# applied as methods. This allows us to apply different functions in
# sequence in just one line of code

Slicing

# Slicing
# ndarray[start:end]
# ndarray[start:]
# ndarray[:end]
# ndarray[<start>:<stop>:<step>]

# In methods one and three, the end index is excluded [,)
X = np.arange(20).reshape(4, 5)

# select all the elements that are in the 2nd through 4th rows and in the 3rd to 5th columns
Z = X[1:4,2:5]
# or
Z = X[1:,2:5]

# elements = a_list[<start>:<stop>:<step>]
# select all the elements in the 3rd row
v = X[2,:] # v = [10 11 12 13 14]
# select all the elements in the 3rd column
q = X[:,2] # q = [ 2 7 12 17]
# select all the elements in the 3rd column but return a rank 2 ndarray
R = X[:,2:3]
'''
[[ 2]
 [ 7]
 [12]
 [17]]
'''
# Note: Slicing creates a view, not a copy
# when we make assignments, such as: Z = X[1:4,2:5]
# the slice of the original array X is not copied in the variable Z.
# Rather, X and Z are now just two different names for the same ndarray.
# We say that slicing only creates a view of the original array.
# This means if we make changes to Z, X changes as well.
'\n[[ 2]\n [ 7]\n [12]\n [17]]\n'

Random

# Random
# 3 x 3 ndarray with random floats in the half-open interval [0.0, 1.0).
# np.random.random(shape)
X = np.random.random((3,3))
# np.random.randint(start, stop, size = shape)
# [start, stop)
X = np.random.randint(4,15,size=(3,2))

# create ndarrays with random numbers that satisfy certain statistical properties
# 1000 x 1000 ndarray of random floats drawn from normal (Gaussian)
# distribution with a mean of zero and a standard deviation of 0.1.
# np.random.normal(mean, standard deviation, size=shape)
X = np.random.normal(0, 0.1, size=(1000,1000))

Mutability

# Mutability
# Change ndarray
x[3] = 20
X[0,0] = 20

Delete

# Delete
# np.delete(ndarray, elements, axis)
x = np.array([1, 2, 3, 4, 5])
# delete the first and fifth element of x
x = np.delete(x, [0,4])

Y = np.array([[1,2,3],[4,5,6],[7,8,9]])
# delete the first row of Y
w = np.delete(Y, 0, axis=0)
# delete the first and last column of Y
v = np.delete(Y, [0,2], axis=1)
v
array([[2],
       [5],
       [8]])

Append

# Append
# np.append(ndarray, elements, axis)
# append the integer 6 to x
x = np.append(x, 6)
# append the integer 7 and 8 to x
x = np.append(x, [7,8])
# append a new row containing 7,8,9 to y
v = np.append(Y, [[10,11,12]], axis=0)
# append a new column containing 9 and 10 to y
q = np.append(Y,[[13],[14],[15]], axis=1)

Insert

# Insert
# np.insert(ndarray, index, elements, axis)
# inserts the given list of elements to ndarray right before
# the given index along the specified axis
x = np.array([1, 2, 5, 6, 7])
Y = np.array([[1,2,3],[7,8,9]])
# insert the integer 3 and 4 between 2 and 5 in x. 
x = np.insert(x,2,[3,4])
# insert a row between the first and last row of Y
w = np.insert(Y,1,[4,5,6],axis=0)
# insert a column full of 5s between the first and second column of Y
v = np.insert(Y,1,5, axis=1)

Stacking

# Stacking
# NumPy also allows us to stack ndarrays on top of each other,
# or to stack them side by side. The stacking is done using either
# the np.vstack() function for vertical stacking, or the np.hstack()
# function for horizontal stacking. It is important to note that in
# order to stack ndarrays, the shape of the ndarrays must match.
x = np.array([1,2])
Y = np.array([[3,4],[5,6]])
z = np.vstack((x,Y)) # [[1,2], [3,4], [5,6]]
w = np.hstack((Y,x.reshape(2,1))) # [[3,4,1], [5,6,2]]

Copy

# Copy
# if we want to create a new ndarray that contains a copy of the
# values in the slice we need to use the np.copy()
# create a copy of the slice using the np.copy() function
Z = np.copy(X[1:4,2:5])
#  create a copy of the slice using the copy as a method
W = X[1:4,2:5].copy()

Extract elements along the diagonal

# Extract elements along the diagonal
d0 = np.diag(X)
# As default is k=0, which refers to the main diagonal.
# Values of k > 0 are used to select elements in diagonals above
# the main diagonal, and values of k < 0 are used to select elements
# in diagonals below the main diagonal.
d1 = np.diag(X, k=1)
d2 = np.diag(X, k=-1)

Find Unique Elements

# Find Unique Elements in ndarray
u = np.unique(X)

Boolean Indexing

# Boolean Indexing
X = np.arange(25).reshape(5, 5)
print('The elements in X that are greater than 10:', X[X > 10])
print('The elements in X that less than or equal to 7:', X[X <= 7])
print('The elements in X that are between 10 and 17:', X[(X > 10) & (X < 17)])

# use Boolean indexing to assign the elements that
# are between 10 and 17 the value of -1
X[(X > 10) & (X < 17)] = -1
The elements in X that are greater than 10: [11 12 13 14 15 16 17 18 19 20 21 22 23 24]
The elements in X that less than or equal to 7: [0 1 2 3 4 5 6 7]
The elements in X that are between 10 and 17: [11 12 13 14 15 16]

Set Operations

# Set Operations
x = np.array([1,2,3,4,5])
y = np.array([6,7,2,8,4])
print('The elements that are both in x and y:', np.intersect1d(x,y))
print('The elements that are in x that are not in y:', np.setdiff1d(x,y))
print('All the elements of x and y:',np.union1d(x,y))
The elements that are both in x and y: [2 4]
The elements that are in x that are not in y: [1 3 5]
All the elements of x and y: [1 2 3 4 5 6 7 8]

Sorting

# Sorting
# When used as a function, it doesn't change the original ndarray
s = np.sort(x)
# When used as a method, the original array will be sorted
x.sort()

# sort x but only keep the unique elements in x
s = np.sort(np.unique(x))

# sort the columns of X
s = np.sort(X, axis = 0)

# sort the rows of X
s = np.sort(X, axis = 1)

Math Functions

# NumPy allows element-wise operations on ndarrays as well as
# matrix operations. In order to do element-wise operations,
# NumPy sometimes uses something called Broadcasting.
# Broadcasting is the term used to describe how NumPy handles
# element-wise arithmetic operations with ndarrays of different shapes.
# For example, broadcasting is used implicitly when doing arithmetic
# operations between scalars and ndarrays.
x = np.array([1,2,3,4])
y = np.array([5.5,6.5,7.5,8.5])
np.add(x,y)
np.subtract(x,y)
np.multiply(x,y)
np.divide(x,y)

# in order to do these operations the shapes of the ndarrays
# being operated on, must have the same shape or be broadcastable
X = np.array([1,2,3,4]).reshape(2,2)
Y = np.array([5.5,6.5,7.5,8.5]).reshape(2,2)
np.add(X,Y)
np.subtract(X,Y)
np.multiply(X,Y)
np.divide(X,Y)

# apply mathematical functions to all elements of an ndarray at once
np.exp(x)
np.sqrt(x)
np.power(x,2)
array([ 1,  4,  9, 16])

Statistical Functions

# Statistical Functions
print('Average of all elements in X:', X.mean())
print('Average of all elements in the columns of X:', X.mean(axis=0))
print('Average of all elements in the rows of X:', X.mean(axis=1))
print()
print('Sum of all elements in X:', X.sum())
print('Standard Deviation of all elements in X:', X.std())
print('Median of all elements in X:', np.median(X))
print('Maximum value of all elements in X:', X.max())
print('Minimum value of all elements in X:', X.min())
Average of all elements in X: 2.5
Average of all elements in the columns of X: [2. 3.]
Average of all elements in the rows of X: [1.5 3.5]

Sum of all elements in X: 10
Standard Deviation of all elements in X: 1.118033988749895
Median of all elements in X: 2.5
Maximum value of all elements in X: 4
Minimum value of all elements in X: 1

Broadcasting

# Broadcasting
# NumPy is working behind the scenes to broadcast 3 along the ndarray
# so that they have the same shape. This allows us to add 3 to each
# element of X with just one line of code.
print(4*X)
print(4+X)
print(4-X)
print(4/X)
# NumPy is able to add 1 x 3 and 3 x 1 ndarrays to 3 x 3 ndarrays
# by broadcasting the smaller ndarrays along the big ndarray so that
# they have compatible shapes. In general, NumPy can do this provided
# that the smaller ndarray can be expanded to the shape of the larger
# ndarray in such a way that the resulting broadcast is unambiguous.
x = np.array([1,2,3])
Y = np.array([[1,2,3],[4,5,6],[7,8,9]])
Z = np.array([1,2,3]).reshape(3,1)
print(x + Y)
print(Z + Y)
[[ 4  8]
 [12 16]]
[[5 6]
 [7 8]]
[[3 2]
 [1 0]]
[[4.         2.        ]
 [1.33333333 1.        ]]
[[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]]
[[ 2  3  4]
 [ 6  7  8]
 [10 11 12]]

Pandas

Jupyter Notebook Viewer

  • Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner.
  • Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:
    • Allows the use of labels for rows and columns
    • Can calculate rolling statistics on time series data
    • Easy handling of NaN values
    • Is able to load data of different formats into DataFrames
    • Can join and merge different datasets together
    • It integrates with NumPy and Matplotlib
  • Documentation: https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd

Series

1D array-like object that can hold many data types. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. Another big difference is that Pandas Series can hold data of different data types.

pd.Series(data, index)

Create Series

groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread']
                    )
groceries
eggs       30
apples      6
milk      Yes
bread      No
dtype: object

shape, size, values, index, ndim

print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)
Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements
The data in Groceries is: [30 6 'Yes' 'No']
The index of Groceries is: Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')

check whether an index label exists in Series

# check whether an index label exists in Series
x = 'bananas' in groceries

Accessing Elements

# Accessing Elements
# using index labels:
# single index label
print('How many eggs do we need to buy:', groceries['eggs'])
# access multiple index labels
print('Do we need milk and bread:\n', groceries[['milk', 'bread']]) 
# use loc to access multiple index labels
print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']]) 

# access elements in Groceries using numerical indices:
# use multiple numerical indices
print('How many eggs and apples do we need to buy:\n',  groceries[[0, 1]]) 
# use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 
# use a single numerical index
print('How many eggs do we need to buy:', groceries[0]) 
# use iloc (stands for integer location) to access multiple numerical indices
print('Do we need milk and bread:\n', groceries.iloc[[2, 3]])
# Since we can access elements in various ways, in order to remove
# any ambiguity to whether we are referring to an index label
# or numerical index, Pandas Series have two attributes,
# .loc and .iloc to explicitly state what we mean. The attribute
# .loc stands for location and it is used to explicitly state that
# we are using a labeled index. Similarly, the attribute .iloc stands
# for integer location and it is used to explicitly state that we are
# using a numerical index.
How many eggs do we need to buy: 30
Do we need milk and bread:
 milk     Yes
bread     No
dtype: object
How many eggs and apples do we need to buy:
 eggs      30
apples     6
dtype: object
How many eggs and apples do we need to buy:
 eggs      30
apples     6
dtype: object
Do we need bread:
 bread    No
dtype: object
How many eggs do we need to buy: 30
Do we need milk and bread:
 milk     Yes
bread     No
dtype: object
# access using Boolean Indexes
time_light[time_light<40]
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-94-0a8592ba7e52> in <module>
      1 # access using Boolean Indexes
----> 2 time_light[time_light<40]


NameError: name 'time_light' is not defined

Change Elements

# Change Elements
groceries['eggs'] = 2

Delete Elements

# Delete Elements
# doesn't change the original Series being modified
groceries.drop('apples')
# delete items from Series in place by setting keyword inplace to True
groceries.drop('apples', inplace = True)

Arithmetic Operations

# Arithmetic Operations
# we can perform element-wise arithmetic operations on Pandas Series
fruits = pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])
fruits + 2 # Adds 2 to all elements in the series
fruits - 2
fruits * 2
fruits / 2
# apply mathematical functions from NumPy to all elements of a Series
np.exp(fruits)
np.sqrt(fruits)
np.power(fruits,2)
# only apply arithmetic operations on selected items in Series
fruits['bananas'] + 2
fruits.iloc[0] - 2
fruits[['apples', 'oranges']] * 2
# you can apply arithmetic operations on a Series of mixed data
# type provided that the arithmetic operation is defined for all
# data types in the Series, otherwise you will get an error

Dataframe

Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types.

Axes

# understanding axes
df.sum()       
# sums “down” the 0 axis (rows)
df.sum(axis=0) 
# equivalent (since axis=0 is the default)
df.sum(axis=1) 
# sums “across” the 1 axis (columns)

Loading Data into DF

# Loading Data into DF
df = pd.read_csv('marauders_map.csv')

# limit which rows are read when reading in a file
pd.read_csv(‘df.csv’, nrows=10)        
# only read first 10 rows

pd.read_csv(‘df.csv’, skiprows=[1, 2]) 
# skip the first two rows of data

# randomly sample a DataFrame
train = df.sample(frac=0.75, random_column_y=1) 
# will contain 75% of the rows

test = df[~df.index.isin(train.index)] 
# will contain the other 25%

# change the maximum number of rows and columns printed (‘None’ means unlimited)
pd.set_option(‘max_rows’, None) 
# default is 60 rows

pd.set_option(‘max_columns’, None) 
# default is 20 columns
print df

# reset options to defaults
pd.reset_option(‘max_rows’)
pd.reset_option(‘max_columns’)

# change the options temporarily (settings are restored when you exit the ‘with’ block)
with pd.option_context(‘max_rows’, None, ‘max_columns’, None):
    print df
  File "<ipython-input-95-0b05ecba7fbb>", line 5
    pd.read_csv(‘df.csv’, nrows=10)
                  ^
SyntaxError: invalid character in identifier

Create Dataframe

# Create a DataFrame manually from a dictionary of Pandas Series

# create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# print the type of items to see that it is a dictionary
print(type(items)) # class 'dict'

# create a Pandas DataFrame by passing it a dictionary of Series
shopping_carts = pd.DataFrame(items)

# create a DataFrame that only has a subset of the data/columns
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

# create a DataFrame that only has selected keys
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])

# combine both of the above - selected keys for selected columns
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])

# create DataFrames from a dictionary of lists (arrays)
# In this case, however, all the lists (arrays) in the dictionary must be of the same length

# create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# create a DataFrame 
df = pd.DataFrame(data)

# create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

# create DataFrames from a list of Python dictionaries
# create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# create a DataFrame 
store_items = pd.DataFrame(items2)

# create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

Create df from Series, dicts

# Create dictionary from a bunch of Series/data
books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])
user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

# Create a dictionary with the data given above
a_dict = {'Author':authors,'Book Title':books,'User 1':user_1, 'User 2':user_2, 'User 3':user_3, 'User 4':user_4}

# Use the dictionary to create a Pandas DataFrame
book_ratings = pd.DataFrame(a_dict)
book_ratings[:5]
# convert to numpy array (remove the column names, get just the values to convert it into a numpy array)
book_ratings_numpy = book_ratings.values
book_ratings_numpy
#### create a DataFrame from a dictionary
pd.DataFrame({‘column_x’:[‘value_x1’, ‘value_x2’, ‘value_x3’], ‘column_y’:[‘value_y1’, ‘value_y2’, ‘value_y3’]})

#### create a DataFrame from a list of lists
pd.DataFrame([[‘value_x1’, ‘value_y1’], [‘value_x2’, ‘value_y2’], [‘value_x3’, ‘value_y3’]], columns=[‘column_x’, ‘column_y’])

Access Elements

# Access Elements
print()
print('How many bikes are in each store:\n', store_items[['bikes']])
print()
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print()
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print()
print('How many bikes are in Store 2:', store_items['bikes']['store 2'])
# when accessing individual elements in a DataFrame, the labels
# should always be provided with the column label first,
# i.e. in the form dataframe[column][row]
# store_items for reference:
#          bikes	glasses	pants	watches
# store 1	    20	   NaN	   30	     35
# store 2	    15	  50.0	    5	     10
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-96-2fec24e184ed> in <module>
      1 # Access Elements
      2 print()
----> 3 print('How many bikes are in each store:\n', store_items[['bikes']])
      4 print()
      5 print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])


NameError: name 'store_items' is not defined

Modify Elements

# Modify Elements
# Add new column (adds it to the end of the df)
store_items['shirts'] = [15,2]

# New column via artihmetic operations b/w columns
store_items['suits'] = store_items['pants'] + store_items['shirts']

# Add new row

# To add rows to our df, create a new df then append it to the original df
# create a dictionary from a list of Python dictionaries
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]

# create new DataFrame with the new_items and provide and index labeled store 3
new_store = pd.DataFrame(new_items, index = ['store 3'])

# append store 3 to our store_items DataFrame
store_items = store_items.append(new_store)

# insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])

Delete Element

# Delete Element

# .pop() method only allows us to delete columns, while the .drop()
# method can be used to delete both rows and columns by use of the axis keyword

# remove the new watches column
store_items.pop('new watches')

# remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)

# remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

Rename the row and column labels

# Rename the row and column labels
# change the column label
store_items = store_items.rename(columns = {'bikes': 'hats'})
# change the row label
store_items = store_items.rename(index = {'store 3': 'last store'})

Change index

# change the index to be one of the columns in the DataFrame
store_items = store_items.set_index('pants')

Dealing with NaN values (missing data)

# Dealing with NaN values (missing data)

# create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# check if we have any NaN values in our dataset
# .any() performs an or operation. If any of the values along the
# specified axis is True, this will return True.
df.isnull().any()
'''
Date   False
Open   True
High   False
Low    False
Close  False
Volume False
dtype: bool
'''

# count the number of NaN values in DataFrame
x =  store_items.isnull().sum().sum()
# count the number of non-NaN values in DataFrame
x = store_items.count()

# remove rows or columns from our DataFrame that contain any NaN values

# drop any rows with NaN values
store_items.dropna(axis = 0)

# drop any columns with NaN values
store_items.dropna(axis = 1)

# the original DataFrame is not modified by default
# to remove missing values from original df, use inplace = True
store_items.dropna(axis = 0, inplace = True)

# replace all NaN values with 0
store_items.fillna(0)

# forward filling: replace NaN values with previous values in the df,
# this is known as . When replacing NaN values with forward filling,
# we can use previous values taken from columns or rows.
# replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)

# backward filling: replace the NaN values with the values that
# go after them in the DataFrame
# replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)

# replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)

# the original DataFrame is not modified. replace the NaN values
# in place by setting inplace = True inside function
store_items.fillna(method = 'ffill', axis = 0, inplace = True)
store_items.interpolate(method = 'linear', axis = 0, inplace = True)

head, tail, describe, max, memory_usage

df.head()
df.tail()
df.describe()
# prints max value in each column
df.max()

# display the memory usage of a DataFrame
# total usage
df.info()
# usage by column
df.memory_usage()

corr

# get the correlation between different columns
df.corr()

Groupby

# Groupby
data.groupby(['Year'])
data.groupby(['Year'])['Salary']

# display the average salary per year
data.groupby(['Year'])['Salary'].mean()

# display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()

# group the data by Year and by Department
data.groupby(['Year', 'Department'])['Salary'].sum()

Replace Values

# Replace Values
s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s.map({'cat': 'kitten', 'dog': 'puppy'})
# another e.g.
df['label'] = df['label'].map({'ham':0,'spam':1})

Reading Files

# reading in a file from local computer or directly from a URL

# various file formats that can be read in out wrote out
‘’’
Format Type     Data Description      Reader           Writer
text                  CSV            read_csv          to_csv
text                 JSON            read_json         to_json
text                 HTML            read_html         to_html
text             Local clipboard  read_clipboard     to_clipboard
binary             MS Excel          read_excel        to_excel
binary            HDF5 Format        read_hdf           to_hdf
binary           Feather Format     read_feather      to_feather
binary              Msgpack         read_msgpack      to_msgpack
binary               Stata           read_stata        to_stata
binary                SAS             read_sas 
binary        Python Pickle Format   read_pickle       to_pickle
SQL                   SQL             read_sql          to_sql
SQL             Google Big Query      read_gbq          to_gbq
‘’’

# to read about different types of files, and further functionality of reading in files, visit: http://pandas.pydata.org/pandas-docs/version/0.20/io.html
df = pd.read_csv('local_path/file.csv’)
df = pd.read_csv('https://file_path/file.csv')

# when reading in tables, can specify separators, and note a column to be used as index separators can include tabs (“\t”), commas(“,”), pipes (“|”), etc.
df = pd.read_table('https://file_path/file', sep='|', index_col='column_x')
  File "<ipython-input-97-412e7a98a1e7>", line 4
    ‘’’
      ^
SyntaxError: invalid character in identifier

Summarizing

# examine the df data
df           
# print the first 30 and last 30 rows
type(df)     
# DataFrame
df.head()    
# print the first 5 rows
df.head(10)  
# print the first 10 rows
df.tail()    
# print the last 5 rows
df.index     
# “the index” (aka “the labels”)
df.columns   
# column names (which is “an index”)
df.dtypes    
# data types of each column
df.shape
# display only the number of rows
df.shape[0]
# number of rows and columns
df.values    
# underlying numpy array — df are stored as numpy arrays for effeciencies.

# summarize (describe) the DataFrame
# describe all numeric columns
df.describe()

# describe all object columns
df.describe(include=['object'])

# describe all columns
df.describe(include='all')

Working with Columns

# select a column
df['column_y']         
# select one column
type(df['column_y'])   
# determine datatype of column (e.g., Series)
df.column_y            
# select one column using the DataFrame attribute — not effective if column names have spaces

# summarize a Series/column
df.column_y.describe()   
# describe a single column
df.column_z.mean()       
# only calculate the mean
df[“column_z”].mean()    
# alternate method for calculating mean

# count the number of occurrences of each value
df.column_y.value_counts()   
# most useful for categorical variables, but can also be used with numeric variables

# filter df by one column, and print out values of another column
# when using numeric values, no quotations
df[df.column_y ==string_value”].column_z
df[df.column_y == 20 ].column_z    
 
# display the 3 most frequent occurances of column in ‘df’
df.column_y.value_counts()[0:3]

Filtering and Sorting

# boolean filtering: only show df with column_z < 20
filter_bool = df.column_z < 20    
# create a Series of booleans…
df[filter_bool]                
# …and use that Series to filter rows
df[filter_bool].describe()     
# describes a data frame filtered by filter_bool
df[df.column_z < 20]           
# or, combine into a single step
df[df.column_z < 20].column_x  
# select one column from the filtered results
df[df[“column_z”] < 20].column_x     
# alternate method 
df[df.column_z < 20].column_x.value_counts()   
# value_counts of resulting Series, can also use .mean(), etc. instead of .value_counts()

# boolean filtering with multiple conditions; indexes are in square brackets, conditions are in parens
df[(df.column_z < 20) & (df.column_y==string’)] 
# ampersand for AND condition 
df[(df.column_z < 20) | (df.column_z > 60)] 
# pipe for OR condition

# can also filter df using pandas.Series.isin 
df[df.column_x.isin([“string_1”, “string_2”])]

# display a cross-tabulation of two Series
pd.crosstab(df.column_x, df.column_y)

# alternative syntax for boolean filtering (noted as “experimental” in the documentation)
df.query('column_z < 20') 
# df[df.column_z < 20]
df.query("column_z < 20 and column_y=='string'")  
# df[(df.column_z < 20) & (df.column_y==’string’)]
df.query('column_z < 20 or column_z > 60')        
# df[(df.column_z < 20) | (df.column_z > 60)]

Sorting

# sorting
df.column_z.order()          
# sort a column
df.sort_values(‘column_z’)   
# sort a DataFrame by a single column
df.sort_values(‘column_z’, ascending=False)     
# use descending order instead

# Sort dataframe by multiple columns
df = df.sort([‘col1’,’col2',’col3'],ascending=[1,1,0])

Selecting Multiple Columns and Filtering Rows

# select multiple columns
my_cols = [‘column_x’, ‘column_y’]  
# create a list of column names…
df[my_cols]                   
# …and use that list to select columns
df[[‘column_x’, ‘column_y’]]  
# or, combine into a single step — double brackets due to indexing a list.

# use loc to select columns by name
df.loc[:, ‘column_x’]    
# colon means “all rows”, then select one column
df.loc[:, [‘column_x’, ‘column_y’]]  
# select two columns
df.loc[:, ‘column_x’:’column_y’]     
# select a range of columns (i.e., selects all columns including first through last specified)

# loc can also filter rows by “name” (the index)
df.loc[0, :]       
# row 0, all columns
df.loc[0:2, :]     
# rows 0/1/2, all columns
df.loc[0:2, ‘column_x’:’column_y’] 
# rows 0/1/2, range of columns

# use iloc to filter rows and select columns by integer position
df.iloc[:, [0, 3]]     
# all rows, columns in position 0/3
df.iloc[:, 0:4]        
# all rows, columns in position 0/1/2/3
df.iloc[0:3, :]        
# rows in position 0/1/2, all columns

#filtering out and dropping rows based on condition (e.g., where column_x values are null)
drop_rows = df[df[“column_x”].isnull()]
new_df = df[~df.isin(drop_rows)].dropna(how=all’)

Renaming, Adding, and Removing Columns

# rename one or more columns
df.rename(columns={‘original_column_1’:’column_x’, ‘original_column_2’:’column_y’}, inplace=True) 
# saves changes 

# replace all column names (in place)
new_cols = [‘column_x’, ‘column_y’, ‘column_z’]
df.columns = new_cols

# replace all column names when reading the file
df = pd.read_csv(‘df.csv’, header=0, names=new_cols)

# add a new column as a function of existing columns
df[‘new_column_1’] = df.column_x + df.column_y
df[‘new_column_2’] = df.column_x * 1000   
#can create new columns without for loops

# removing columns
df.drop(‘column_x’, axis=1)   
# axis=0 for rows, 1 for columns — does not drop in place
df.drop([‘column_x’, ‘column_y’], axis=1, inplace=True) 
# drop multiple columns

Lower-case all DataFrame column names

# Lower-case all DataFrame column names
df.columns = map(str.lower, df.columns)

# Even more fancy DataFrame column re-naming
# lower-case all DataFrame column names (for example)
df.rename(columns=lambda x: x.split('.')[-1], inplace=True)

Handling Missing Values

# missing values are usually excluded by default
df.column_x.value_counts()             
# excludes missing values

df.column_x.value_counts(dropna=False) 
# includes missing values

# find missing values in a Series
df.column_x.isnull()  
# True if missing

df.column_x.notnull() 
# True if not missing

# use a boolean Series to filter DataFrame rows
df[df.column_x.isnull()]  
# only show rows where column_x is missing

df[df.column_x.notnull()] 
# only show rows where column_x is not missing

# understanding axes
df.sum()       
# sums “down” the 0 axis (rows)

df.sum(axis=0) 
# equivalent (since axis=0 is the default)

df.sum(axis=1) 
# sums “across” the 1 axis (columns)

# adding booleans
pd.Series([True, False, True])       
# create a boolean Series
pd.Series([True, False, True]).sum() 
# converts False to 0 and True to 1

# find missing values in a DataFrame
df.isnull() 
# DataFrame of booleans
df.isnull().sum() 
# count the missing values in each column

# drop missing values
df.dropna(inplace=True)   
# drop a row if ANY values are missing, defaults to rows, but can be applied to columns with axis=1
df.dropna(how=all’, inplace=True)  
# drop a row only if ALL values are missing

# fill in missing values
df.column_x.fillna(value=NA’, inplace=True) 

# fill in missing values with ‘NA’
# value does not have to equal a string — can be set as some calculated value like df.column_x.mode(), or just a number like 0 

# turn off the missing value filter
df = pd.read_csv(‘df.csv’, header=0, names=new_cols, na_filter=False)

# Clean up missing values in multiple DataFrame columns
df = df.fillna({
 ‘col1’: ‘missing’,
 ‘col2’: ‘99.999’,
 ‘col3’: ‘999’,
 ‘col4’: ‘missing’,
 ‘col5’: ‘missing’,
 ‘col6’: ‘99’
})

# Concatenate two DataFrame columns into a new, single column - (useful when dealing with composite keys, for example)
df[‘newcol’] = df[‘col1’].map(str) + df[‘col2’].map(str)

# Doing calculations with DataFrame columns that have missing values

# In example below, swap in 0 for df[‘col1’] cells that contain null
df[‘new_col’] = np.where(pd.isnull(df[‘col1’]),0,df[‘col1’]) + df[‘col2’]
  File "<ipython-input-98-4d6cd6edb9d8>", line 47
    df.dropna(how=’all’, inplace=True)
                      ^
SyntaxError: invalid character in identifier

Handling Duplicated Values

# detecting duplicate rows
df.duplicated()

# True if a row is identical to a previous row
df.duplicated().sum()

# count of duplicates
df[df.duplicated()]

# only show duplicates
df.drop_duplicates()

# drop duplicate rows
df.column_z.duplicated()

# check a single column for duplicates
df.duplicated([‘column_x’, ‘column_y’, ‘column_z’]).sum()  
# specify columns for finding duplicates

Split-Apply-Combine

# for each value in column_x, calculate the mean column_y 
df.groupby(‘column_x’).column_y.mean()

# for each value in column_x, count the number of occurrences
df.column_x.value_counts()

# for each value in column_x, describe column_y
df.groupby(‘column_x’).column_y.describe()

# similar, but outputs a DataFrame and can be customized
df.groupby(‘column_x’).column_y.agg([‘count’, ‘mean’, ‘min’, ‘max’])
df.groupby(‘column_x’).column_y.agg([‘count’, ‘mean’, ‘min’, ‘max’]).sort_values(‘mean’)

# if you don’t specify a column to which the aggregation function should be applied, it will be applied to all numeric columns
df.groupby(‘column_x’).mean()
df.groupby(‘column_x’).describe()

# can also groupby a list of columns, i.e., for each combination of column_x and column_y, calculate the mean column_z
df.groupby([“column_x”,”column_y”]).column_z.mean()

#to take groupby results out of hierarchical index format (e.g., present as table), use .unstack() method
df.groupby(“column_x”).column_y.value_counts().unstack()

#conversely, if you want to transform a table into a hierarchical index, use the .stack() method
df.stack()

Merging and Concatenating Dataframes

#concatenating two dfs together (just smooshes them together, does not pair them in any meaningful way) - axis=1 concats df2 to right side of df1; axis=0 concats df2 to bottom of df1
new_df = pd.concat([df1, df2], axis=1)

#merging dfs based on paired columns; columns do not need to have same name, but should match values; left_on column comes from df1, right_on column comes from df2
new_df = pd.merge(df1, df2, left_on=column_x’, right_on=column_y’)

#can also merge slices of dfs together, though slices need to include columns used for merging
new_df = pd.merge(df1[[‘column_x1’, ‘column_x2’]], df2, left_on=column_x2', right_on=column_y’)

#merging two dataframes based on shared index values (left is df1, right is df2)
new_df = pd.merge(df1, df2, left_index=True, right_index=True)

Frequently Used Features

map existing values to a different set of values

df[‘column_x’] = df.column_y.map({‘F’:0, ‘M’:1})

encode strings as integer values (automatically starts at 0)

df[‘column_x_num’] = df.column_x.factorize()[0]

determine unique values in a column

df.column_x.nunique() 

count the number of unique values

df.column_x.unique()    
# returns the unique values

replace all instances of a value in a column (must match entire value)

df.column_y.replace(‘old_string’, ‘new_string’, inplace=True)

alter values in one column based on values in another column

# changes occur in place
# can use either .loc or .ix methods
df.loc[df[“column_x”] == 5, “column_y”] = 1
df.ix[df.column_x ==string_value”, “column_y”] =new_string_value

transpose data frame (i.e. rows become columns, columns become rows)

df.T

string methods are accessed via ‘str’

df.column_y.str.upper()

converts to uppercase

df.column_y.str.contains(‘value’, na=False’) 
# checks for a substring, returns boolean series

convert a string to the datetime_column format

df[‘time_column’] = pd.to_datetime_column(df.time_column)
df.time_column.dt.hour 

datetime_column format exposes convenient attributes

(df.time_column.max() — df.time_column.min()).days

boolean filtering with datetime_column format

df[df.time_column > pd.datetime_column(2014, 1, 1)]   
# also allows you to do datetime_column “math”

setting and then removing an index, resetting index can help remove hierarchical indexes while preserving the table in its basic structure

df.set_index(‘time_column’, inplace=True)
df.reset_index(inplace=True)

sort a column by its index

df.column_y.value_counts().sort_index()
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-99-0e11533b2a4d> in <module>
----> 1 df.column_y.value_counts().sort_index()


NameError: name 'df' is not defined

change the data type of a column

df[‘column_x’] = df.column_x.astype(‘float’)

change the data type of a column when reading in a file

pd.read_csv(‘df.csv’, dtype={‘column_x’:float})

create dummy variables for ‘column_x’ and exclude first dummy column

column_x_dummies = pd.get_dummies(df.column_x).iloc[:, 1:]

concatenate two DataFrames (axis=0 for rows, axis=1 for columns)

df = pd.concat([df, column_x_dummies], axis=1)

Loop through rows in a DataFrame

# Loop through rows in a DataFrame
for index, row in df.iterrows():
 print index, row[‘column_x’]

# Much faster way to loop through DataFrame rows if you can work with tuples
for row in df.itertuples():
 print(row)

Get rid of non-numeric values throughout a DataFrame

for col in df.columns.values:
 df[col] = df[col].replace(‘[⁰-9]+.-’, ‘’, regex=True)

Change all NaNs to None (useful before loading to a db)

df = df.where((pd.notnull(df)), None)

Split delimited values in a DataFrame column into two new columns

df[‘new_col1’], df[‘new_col2’] = zip(*df[‘original_col’].apply(lambda x: x.split(‘: ‘, 1)))

Collapse hierarchical column indexes

df.columns = df.columns.get_level_values(0)

change a Series to the ‘category’ data type (reduces memory usage and increases performance)

df[‘column_y’] = df.column_y.astype(‘category’)

temporarily define a new column as a function of existing columns

df.assign(new_column = df.column_x + df.spirit + df.column_y)

If you like this kernel, please give it an upvote. Thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment