benhoyt/python-stdlib.md

## python-stdlib.md

      
    Raw
  

              python-stdlib.md
            
          
    I'm going to demo a bunch of Python builtin and stdlib functions. There's a lot to get through, so I'll be going fast, but please stop me and ask questions as we go. The goal is to give you a taste of Python's power and expressivity if you're not a Python person, or maybe teach you a few new tricks if you are already.
Built-in functions

# enumerate: iterate with index *and* item
>>> strings = ['123', '0', 'x']
>>> for i, s in enumerate(strings):
...     print(f'{i} - {s}')  # f-strings!
...
0 - 123
1 - 0
2 - x
>>> [i*s for i, s in enumerate('abc', start=1)]
['a', 'bb', 'ccc']

# all, any: are all (or any) of these items true?
>>> all(s.isdigit() for s in strings)  # True if all are integers
False
>>> import threading                   # True if any thread running
>>> any(t.is_alive() for t in threading.enumerate())
True.                                  # note: generator expressions

# min, max, round, sum: math!
>>> sides = [1.5, 3, 1.5, 3]
>>> sum(sides)                # perimeter
9.0
>>> sum(n*n for n in sides)   # sum of squares of sides
22.5
>>> sum(sides) / len(sides)   # average length of a side (/ vs //)
2.25
>>> min(sides)
1.5
>>> max(sides)
3
>>> min(42, -4, 9)
-4
>>> letters = ['aaa', 'b', 'cc']
>>> min(letters)
'aaa'
>>> min(letters, key=len)
'b'
>>> min([], default=-1)
-1
>>> round(12.34)
12
>>> round(12.89)
13
>>> round(12.3456789, ndigits=2)
12.34
>>> round(12.34, ndigits=-1)
10.0

# sorted
# ASIDE about TimSort
>>> ''.join(sorted('qwerty'))  # sorts an iterable, returns a list
'eqrtwy'
>>> d = {"foo": 2, "the": 3, "goo": 1}
>>> d.items()  # dictionaries iterate by insertion order in Python 3.7+
dict_items([('foo', 2), ('the', 3), ('goo', 1)])
>>> for k, v in sorted(d.items()):  # sort by item tuple, key then value
...     print(k, v)
foo 2
goo 1
the 3
>>> sorted(d.items(), key=lambda x: x[1])  # sort by value (only)
[('goo', 1), ('foo', 2), ('the', 3)]
>>> d["too"] = 2
>>> sorted(d.items(), key=lambda x: (x[1], x[0]), reverse=True)
[('goo', 1), ('foo', 2), ('the', 3)]       # by value then key, reverse
>>>
>>> sorted(d, key=d.get)
["goo", "foo", "the"]
>>> lst = [89, 0, 42]
>>> lst.sort()  # same as sorted but in-place (if you already have a list)
>>> lst
[0, 42, 89]

# open: opening files
>>> f = open("test.txt", encoding="utf-8")  # open file in text mode
>>> for line in f:
...     print(line.strip().upper())
“THE QUICK BROWN FOX
JUMPS OVER THE LAZY DOG.”
>>> with open("test.txt", "rb") as f:  # open in binary mode (context mgr)
...     print(f.read())
b'\xe2\x80\x9cThe quick brown fox\njumps over the lazy dog.\xe2\x80\x9d'

# filter, map (use list comprehensions instead)
>>> strings = ["foo", "x", "bar", ""]
>>> list(filter(lambda s: len(s) == 3, strings))
['foo', 'bar']
>>> [s for s in strings if len(s) == 3]
['foo', 'bar']

>>> list(map(lambda s: s.upper(), strings))
['FOO', 'X', 'BAR', '']
>>> list(map(str.upper, strings))
['FOO', 'X', 'BAR', '']
>>> [s.upper() for s in strings]
['FOO', 'X', 'BAR', '']
>>> [s.upper() for s in strings if len(s) == 3]
['FOO', 'BAR']
Collections

# dict
>>> counts = {'the': 3, 'foo': 2, 'barts': 1}
>>> counts['foo']
2
>>> counts['baz']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'baz'
>>> counts.get('baz')
>>> counts.get('baz', 0)
0
>>> 'barts' in counts
True

# set
>>> set('benjamin') - set('jambing!')
{'e'}
>>> set('jambing!') - set('benjamin')
{'g', '!'}
>>> set('benjamin') & set('jambing!')
{'j', 'n', 'i', 'a', 'm', 'b'}
>>> set('benjamin') | set('jambing!')
{'j', 'g', 'n', 'i', 'a', '!', 'm', 'b', 'e'}
>>> set('benjamin') ^ set('jambing!')
{'g', '!', 'e'}

# collections.Counter
>>> from collections import Counter
>>> c = Counter('the foo barts foo the the'.split())
>>> c
Counter({'the': 3, 'foo': 2, 'barts': 1})
>>> c['nothing']
0
>>> for k, v in c.most_common():
...     print(k, v)
the 3
foo 2
barts 1
>>> c.most_common(2)
[('the', 3), ('foo', 2)]

# collections.defaultdict
>>> from collections import defaultdict
>>> index = defaultdict(set)  # if key missing, call set() to create one
>>> for i, word in enumerate("the foo barts foo the the".split()):
...     index[word].add(i)
>>> index
defaultdict(set, {'the': {0, 4, 5}, 'foo': {1, 3}, 'barts': {2}})
>>> index['baz']
set()
>>> 4 in index['the']
True

# collections.namedtuple
>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y', 'z'])
>>> p = Point(2, 3, 4)
>>> p
Point(x=3, y=4, z=5)
>>> p[1]
4
>>> p._replace(y=6)
Point(x=3, y=6, z=5)
>>> class Point:  # if you didn't have namedtuple
...     def __init__(self, x, y, z):
...         self.x = x
...         self.y = y
...         self.z = z
...     def __str__(self):
...         return f"Point(x={self.x}, y={self.y}, z={self.z})"

# collections.OrderedDict
# Ordered by insertion order (not a "sorted dict"). Python dicts were
# unordered until Python 3.7, when they guaranteed in the language that
# dicts are ordered by insertion order. So if you're on Python 3.7+ you
# never need OrderedDict -- just use dict!

# bisect
>>> import bisect
>>> lst = ['a', 'c', 'e', 'g']  # must be sorted already
>>> bisect.bisect(lst, 'b')
1
>>> bisect.bisect(lst, 'c')
2
>>> bisect.bisect_left(lst, 'c')
1
>>> bisect.bisect_left(lst, 'A')
0
# bisect.insort() -- but note that insertion is O(N)
# see also: http://www.grantjenks.com/docs/sortedcontainers/
Text

# re
>>> import re
>>> text = 'Processed 42 files in 5 seconds.'
>>> m = re.match(r'Processed (\d+) file', text)
>>> int(m.group(1))  # note 1-based numbering; m.group(0) is entire match
42
>>> re.findall(r'\d+', text)
['42', '5']
>>> re.sub(r'\d+', 'N', text)
'Processed N files in N seconds.'

# difflib
>>> import difflib
>>> old = ['foo', 'bar', 'baz']
>>> new = ['bar', 'baz', 'buzz']
>>> for line in difflib.unified_diff(old, new, lineterm=''):
...     print(line)

>>> d = difflib.HtmlDiff()
>>> html = d.make_file(old, new)
>>> with open('diff.html', 'w') as f:
...     f.write(html)

>>> m = difflib.SequenceMatcher(None, old, new)
>>> m.ratio()
0.6666666666666666

# json
>>> import json
>>> json.dumps({"a": 1, "b": 2})
'{"a": 1, "b": 2}'
>>> print(json.dumps({"b": 2, "a": 1}, sort_keys=True, indent=4))
{
    "a": 1,
    "b": 2
}
>>> json.loads("3.14159265358979323846264338")
3.141592653589793
>>> import decimal
>>> json.loads("3.14159265358979323846264338", parse_float=decimal.Decimal)
Decimal('3.14159265358979323846264338')

# csv
>>> import csv
>>> for row in csv.DictReader(open('test.csv')):  # csv.reader() if you don't have headers
...     print(row['email'])
ben.hoyt@compass.com
foo.bar@compass.com
# use csv.writer or csv.DictWriter() to write CSVs
# also TSV parsing
# why not use str.split(',')? handles quoting, multi-line values

# hashlib
>>> import hashlib
>>> password = 'password123'
>>> hashlib.sha1(password.encode('utf-8')).hexdigest()
'cbfdac6008f9cab4083784cbd1874f76618d2a97'

# zlib
>>> import zlib
>>> zlib.compress(b'abc'*100)
b'x\x9cKLJN\x1cE\xc4!\x00\x88Mr\xd9'
>>> len(_)
15
>>> next(n for n in range(100) if len(zlib.compress(b'a'*n)) < n)
12
# also: gzip, zipfile, tarfile
Date and time

# time
import time
>>> start = time.time()
>>> time.time() - start
5.419049978256226
>>> time.sleep(1.5)
>>>

# datetime
>>> from datetime import date, datetime, timedelta, timezone
>>> date.today()
datetime.date(2019, 10, 14)
>>> date.today() + timedelta(days=7)
datetime.date(2019, 10, 21)
>>> date(2019, 12, 25) - date.today()
datetime.timedelta(days=70)

# datetime
>>> datetime.now()
datetime.datetime(2019, 10, 14, 17, 52, 0, 92412)
>>> _.isoformat()
'2019-10-14T17:52:00.092412'
>>> datetime.utcnow()
datetime.datetime(2019, 10, 14, 21, 52, 10, 964842)
>>> _.strftime('%b %d, %Y at %H:%M:%S')
'Oct 14, 2019 at 21:52:10'
>>> datetime.now(tz=timezone.utc)
datetime.datetime(2019, 10, 14, 21, 52, 18, 890299, tzinfo=datetime.timezone.utc)
>>> datetime.strptime('Jan 1, 2020', '%b %d, %Y')
datetime.datetime(2020, 1, 1, 0, 0)
Math

# math: all the usual suspects
import math
>>> math.sqrt(8)
2.8284271247461903
>>> math.cos(math.pi)
-1.0

# decimal: accurate decimal math, eg: for money
>>> from decimal import Decimal
>>> sum(Decimal('0.1') for i in range(1000))
Decimal('100.0')
>>> sum(0.1 for i in range(1000))
99.9999999999986
>>> Decimal(0.1)  # be careful converting from floats!
Decimal('0.1000000000000000055511151231257827021181583404541015625')

# fractions
>>> from fractions import Fraction
>>> Fraction(1, 2) * Fraction(1, 3)
Fraction(1, 6)
>>> sum(Fraction(1, 10) for i in range(1000))
Fraction(100, 1)
>>> Fraction(0.1)
Fraction(3602879701896397, 36028797018963968)

# random
>>> import random
>>> random.seed(0)
>>> random.random()
0.8444218515250481
>>> [random.randrange(10) for i in range(10)]
[6, 0, 4, 8, 7, 6, 4, 7, 5, 9]
>>> lst = ['foo', 'bar', 'baz']
>>> random.choice(lst)  # cf: lst[random.randrange(len(lst))]
'foo'
>>> random.choice(lst)
'baz'
>>> random.shuffle(lst)  # note no return; hard to get right yourself
>>> lst
['bar', 'foo', 'baz']
>>> random.sample(lst, 2)
['baz', 'foo']
>>> random.sample(lst, 2)
['foo', 'bar']
Functional tools

# itertools
>>> import itertools
>>> for k, v in itertools.groupby('aaabbc'):
...    print(k, list(v))
a ['a', 'a', 'a']
b ['b', 'b']
c ['c']
>>> for k, v in itertools.groupby('AaaBbC', key=str.upper):
...    print(k, list(v))
A ['A', 'a', 'a']
B ['B', 'b']
C ['C']
>>> for p in itertools.permutations('abc'): 
...     print(''.join(pp))
abc
acb
bac
bca
cab
cba
# and many more; if you need to iterate a lot, try them

# functools
>>> import functools
>>> def fib(n):
...     if n < 2:
...         return n
...     return fib(n-1) + fib(n-2)
>>> fib(35)  # takes about 5s
9227465
>>> fib = functools.lru_cache(maxsize=None)(fib)
>>> fib(35)  # almost instant
# usual to use as decorator: @lru_cache(maxsize=1000)

# reduce
>>> functools.reduce(lambda x, y: x*y, range(1, 5))
24
# but Python isn't geared towards functional programming; just write the loop
Operating system and paths

# os: a ton of OS calls (low-level file operations, etc)
>>> import os
>>> os.makedirs('foo/bar')
>>> os.makedirs('foo/bar', exist_ok=True)
>>> os.removedirs('foo/bar')
>>> open('foo.txt', 'w').close()
>>> os.remove('foo.txt')

# shutil: copy, move, chown

# scandir
>>> ee = list(os.scandir())
>>> ee
[<DirEntry 'venv'>, <DirEntry 'diff.html'>]
>>> ee[0].name
'venv'
>>> ee[0].path
'./venv'
>>> ee[0].is_dir()  # free on Linux
True
>>> ee[0].stat()  # free on Windows
os.stat_result(st_mode=16877, st_ino=1928698, st_dev=16777220, st_nlink=7, st_uid=502, st_gid=20, st_size=224, st_atime=1571177199, st_mtime=1570768467, st_ctime=1570768467)
>>> def get_size(path):
...     size = 0
...     for e in os.scandir(path):
...         if e.is_dir():
...             size += get_size(e.path)
...         else:
...             size += e.stat().st_size
...     return size
>>> get_size('.')
27851100

# walk
>>> for root, dirs, files in os.walk('.'):
...     for file in files:
...         print(os.path.join(root, file))
OUTPUT ...
# walk is great! and faster now that it uses os.scandir

# sys
>>> import sys
>>> sys.argv
['/Users/ben.hoyt/development/python-stdlib/venv/bin/ipython']
>>> sys.exit(1)
SystemExit: 1
>>> sys.platform
'darwin'
Virtual environments

# venv: now built into Python 3
$ python3 -m venv new_env
$ source new_env/bin/activate
$ pip install requests
$ pip freeze
$ deactivate
Threads and processes

# threading
>>> import threading
>>> import time
>>> def run():
...     time.sleep(0.5)
...     print(threading.current_thread().name)
>>> ts = [threading.Thread(target=run) for i in range(100)]
# can even tell print() is printing the string separately from the newline!

# NOTE: there's a lot of Fear, Uncertainty, and Doubt about Python threads.
# People say Python can't do threads because of the Global Interpreter Lock.
# It's actually a lot more nuanced than that. Python handles threads just
# fine, it just can't execute Python bytecode on multiple threads at once --
# that's what the GIL prevents. But it turns out that most of the stuff you
# want to do on threads is not execute Python bytecode, it's waiting for I/O
# like in a web server, or calling C libraries to do some number crunching
# or image processing. Python releases the GIL when waiting for I/O and when
# calling a C library.
#
# So I'm not saying the GIL is a non-issue, but it's much less of an issue
# than you might think. The GIL is an issue when you're doing CPU-bound work
# in Python itself.

# multiprocessing
>>> import multiprocessing
>>> def square(n):
...     return n*n
>>> with multiprocessing.Pool(5) as p:
...     print(p.map(square, range(100)))
>>> with multiprocessing.pool.ThreadPool(5) as p:
...     print(p.map(square, range(100)))
# a ton going on under the hood for you!

# subprocess
>>> import subprocess
>>> subprocess.run(['aws', 's3', 'ls']
OUTPUT ...
>>> _.returncode
0
>>> subprocess.run(['ls', 'foo'], check=True)
CalledProcessError ...
>>> subprocess.run(['ls'], capture_output=True, encoding='utf-8')
CompletedProcess(args=['ls'], returncode=0, stdout='diff.html\nvenv\n', stderr='')
>>> _.stdout.splitlines()
# in reality, you'd call os.listdir() or os.scandir(), but this is powerful when you need it
Internet

# urllib.request - NOTE: better to use requests (even mentioned in stdlib docs!)
>>> from urllib.request import urlopen
>>> urlopen('https://httpbin.org/json')
<http.client.HTTPResponse at 0x1072151d0>
>>> print(_.read().decode('utf-8'))
OUTPUT ...

# http.server
$ python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
Programs

# argparse
>>> import argparse
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument('--format', choices=['json', 'xml'], default='json')
>>> args = parser.parse_args()
>>> args.format
'json'
>>> args = parser.parse_args(['--format', 'xml'])
>>> args.format
'xml'
>>> args = parser.parse_args(['--format', 'csv'])
SystemExit: 2
>>> print(parser.format_help())
OUTPUT ...
# very powerful: sub-commands, positional args, different arg types

# logging
>>> import logging
>>> logging.error('Done in %g seconds', 123.4)
ERROR:root:Done in 123.4 seconds
>>> try:
...     asdf
... except NameError:
...     logging.exception('No find name')
ERROR:root:No find name
Traceback (most recent call last):
  File "<ipython-input-14-05b577b8ebd8>", line 1, in <module>
    try: asdf
NameError: name 'asdf' is not defined
# again, very powerful: Loggers, Formatters, LogRecords, full control

# fileinput
>>> import fileinput
>>> for line in fileinput.input():
...     print(line.upper())
asdf
ASDF

foo bar
FOO BAR
# handles: myscript.py foo.txt *.log *.log.gz
Other


Testing: unittest, unittest.mock, unittest.patch
Database: sqlite3
Text: textwrap, unicodedata, io, base64, mimetypes, gettext
Compression: zlib, gzip, zipfile
OS: tempfile, glob
Internet: email, smtplib
Static typing: typing, mypy