Skip to content

Instantly share code, notes, and snippets.

@jtrive84
Created November 28, 2016 02:24
Show Gist options
  • Save jtrive84/1e5e8ea9cedffefa751ca54554ee8b4e to your computer and use it in GitHub Desktop.
Save jtrive84/1e5e8ea9cedffefa751ca54554ee8b4e to your computer and use it in GitHub Desktop.
An introduction to common Python data structures and iteration.

Python Data Structures and Iteration

The first part of this introduction will discuss two very useful functions: enumerate and zip:

enumerate(iterable[, start=0])
zip(iterable_1, iterable_2,...)

enumerate is a special case of zip. enumerate takes any iterable(s) (list, tuple, string, set) for an argument, and returns a list of n-tuples. enumerate associates the original sequence elements with an arbitrarily chosen starting point and returns an iterator. For example:

# List of last 5 closing prices for Amercian Express:
AXP = [59.48, 59.15, 60.69, 60.76, 59.63]

# Associate `AXP` closing prices with an index starting at 0:
enum_AXP = enumerate(AXP)
print("0-indexed starting point:")
for i in enum_AXP: print(i)
    
# You can start the enumeration at 1 by passing `start=1` to enumerate:
enum_AXP = enumerate(AXP, start=1)
print("1-indexed starting point:")
for i in enum_AXP: print(i)   

zip is a generalization of enumerate. Whereas enumerate associates the elements of an iterable with a consecutive stream of intergers, zip associates elements position-wise from any number of iterables returning a stream of n-tuples:

peril  = ['BLD_WATR', 'BLD_FIRE', 'CONT_WTHR', 'GL_AIPI', 'EB']
base   = [.0534, .0349, .0368, .0743, .0030]
expos  = [700000, 700000, 240000, 375.95, 940000]

#calling `zip` on peril, base and expos will result in 
#a 3-tuple iterator, where each item can be returned in 
#any iteration scheme (for loop, list comp, etc...):
policy = zip(peril, base, expos)
for peril in policy: print(peril)

A list of n-tuples (like policy above) can be "unpacked" into n-separate iterables by calling the zip function, but preceeding the iterable name with *. The following extracts the 3 original sequences from the list of 3-tuples created above:

policy = [('BLD_WATR', 0.0534, 700000),('BLD_FIRE', 0.0349, 700000),
          ('CONT_WTHR', 0.0368, 240000),('GL_AIPI', 0.0743, 375.95),
          ('EB', 0.003, 940000)]

#unpacking each member list:
peril, base, expos = zip(*policy)
print(peril)
print(base)
print(expos)

Note that when using zip, the resulting stream of n-tuples will only be as long as the shortest iterable passed to the function. zip works best when dealing with sequences of equal length. If the sequences in question are of unequal length, the itertools module offers the zip_longest function, which takes a fillvalue argument to replace missing values:

import itertools

s1 = [1, 2, 3]
s2 = ['a', 'b']
s3 = [100, 200, 300, 400]

result = itertools.zip_longest(s1, s2, s3, fillvalue="N/A")
for r in result: print(r)

Python Dictionaries

Dictionaries are unordered key-value pairs. Items are stored and fetched by key as opposed to positional offset.

  • The dict data structure is a mutable mapping, and values may be altered in place.

  • dict's in Python are associative arrays (Hash Tables), allowing for fast lookups independent of the number of data elements.

  • dict's are variable length, heterogeneous and arbitrarily nestable.

  • Dictionary key-value pairs are not stored by index, and therefore the order in which dict elements are returned is not guaranteed.

  • Keys need not always be strings: Any immutable object can be used as a dict key (for example, a tuple).

One way to initalize dict's is by associating keys-value pairs separated by a colon and surrounded by {}:

>>> all_pols = {'00001':564.32, '00002':1123.09, '00003':3427.65, '00004':876.38}

dict elements are retreived using the dictionary name along with the key enclosed in brackets:

>>> all_pols['00001']
564.32
>>> all_pols['00002']
1123.09
>>> all_pols['00003']
3427.65
>>> all_pols['00004']
876.38

To iterate over a dict's key-value pairs, it is necessary to only reference the dictionary name as the iterable in the for-loop:

>>> all_pols = {'00001':564.32, '00002':1123.09, '00003':3427.65, '00004':876.38}
>>> for item in all_pols: print(item, all_pols[item])
'00003' 3427.65
'00001' 564.32
'00002' 1123.09
'00004' 876.38

In Python 2, calling dict.keys() or dict.values() returned a list of the dict's keys or values. In Python 3, calling dict.keys() or dict.values() returns a dictionary view, which does not replicate the contents of the underlying dictionary, but still reflects immediately any changes to the underlying data structure. In addition, the dictionary view can be used as the iterable in any iteration scheme. If it is necessary to obtain an actual list of the dict's keys or values (or both), simply wrap the dictionary view in a call to list:

# create list from dict keys:
dict_keys_list = list(all_pols.keys())

# create list from dict values:
dict_values_list = list(all_pols.values())

# create a list of tuple pairs from dict key-values:
dict_pairs = list(all_pols.items())

# display each list. Note that order is arbitrary:
print("List of all_pols keys              : ", dict_keys_list)
print("list of all_pols values            : ", dict_values_list)
print("list of all_pols (key,value) tuples: ", dict_pairs

The comprehension syntax can also be used to generate dicts. Key-value pairs are separated by a colon, with the initialization statement surrounded by {}. The dict comprehension is useful in situations where two sequences have a relationship, and referring to the sequence elements by name instead of index offset makes more sense:

# associate a list of state names with a list of state capitals
# using a dict comprehension:
states     = ['VT', 'TN', 'NH', 'VA']
capitals   = ['Montpelier','Nashville', 'Concord', 'Richmond']
state_dict = {i:j for i,j in zip(states, capitals)}

Similiar to list comprehensions, dict comprehensions are commonly used for filtering or subsetting existing, larger dicts:

# depths is a dict of maximum depths in ft.:
depths = {'Indian':25938,'Atlantic':27490,'Pacific':35797,'Artic':17880}

#subset depths to return a dict of oceans with max depth > 20,000 ft.:
depths_gt_20k = {i:j for (i,j) in depths.items() if j > 20000}
print('depths subset: ', depths_gt_20k)

#to extract only the ocean names discarding the depths, use a list comprehension:
oceans = [ocean for ocean in depths if depths[ocean] > 20000]
print('Oceans with max depth>20000: ', oceans)

dict values can be any valid Python object. For example, we can create a dict of lists of mathematicians by country of origin:

by_country = {
    'German' : ['Gauss', 'Riemann', 'Hilbert', 'Weierstrass', 'Cantor'],
    'French' : ['Pascal', 'Fermat', 'Lagrange', 'Cauchy'],
    'British': ['Newton', 'Hamilton', 'Hardy']   
    }

#obtain reference to first element of `german` mathematician list:
first = by_country['German'][0]

#obtain reference to last element of `german` mathematician list:
last = by_country['German'][-1]

#determine the length of each key's associated list:
for country in by_country:
    iterlen = len(by_country[country])
    print("Country: {} | List Length: {}".format(country, iterlen))

Independent dict's can be combined into a single dict using update:

#average distance from sun in AU (1AU~93,000,000 miles)
inner_planets = {'Mercury':.387, 'Venus':.722, 'Earth' :1.000, 'Mars':1.520}
outer_planets = {'Jupiter':5.20, 'Saturn':9.58, 'Uranus':19.20, 'Neptune':30.10}

inner_planets.update(outer_planets)
planets = inner_planets
print(planets)

Generally, if a key is called and no such key exists in the dict, an error will be thrown. This can be avoided by using the dict.get(key[, default]) method. Set a default argument to return if a key is requested and it doesn't exist:

# Behavior without using `dict.get`:
print(planets['Pluto'])  #would throw an error

# Behavior using `dict.get`:
print(planets.get('Pluto', 'No longer a planet'))  #prevents error from being thrown

Any immutable datatype can be used for a dictionary key. Tuples can be used when more than a single field is necessary to differentiate items in the input dataset. For example, if premium is calculated at the peril-level, differentiating by policy number alone is not sufficent. Instead, we can use a combination of policy number, location and peril in the form of a tuple as the key, with the associated peril-level premium as the value:

peril_level = {
    ('00001', 1, 'BLD_FIRE') : 122.31,
    ('00001', 1, 'CONT_FIRE'):  97.64,
    ('00001', 1, 'GL_PREMOP'): 147.77,
    ('00001', 1, 'BLD_WATR') :  73.19,
    ('00001', 1, 'BLD_WTHR') : 123.41,
    ('00002', 1, 'BLD_FIRE') : 432.52,
    ('00002', 1, 'CONT_FIRE'): 400.15,
    ('00002', 1, 'GL_PREMOP'): 100.01,
    ('00002', 1, 'BLD_WATR') :  63.84,
    ('00002', 1, 'BLD_WTHR') : 126.57,   
}

#print the premium associated with key ('00001',1,'BLD_FIRE'):
peril_level[('00001',1,'BLD_FIRE')]

Here are some additional methods made available to Python's dict type:

planets = {
    'Mercury': .387, 
    'Venus'  : .722, 
    'Earth'  :1.000, 
    'Mars'   :1.520,
    'Jupiter': 5.20,
    'Saturn' : 9.58,
    'Uranus' :19.20,
    'Neptune':30.10
}

#return the number of key-value pairs (the `length` of the dict):
len(planets)              #returns `8`; same as `len(planets.keys())`
'Mars' in planets         #returns `True`
'Pluto' in planets        #returns `False`
planets.popitem()         #remove and return an arbitrary key-value pair
planets.pop('Jupiter')    #remove `Jupiter` from planets
planets.clear()           #removes all key-value pairs from dict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment