jacekbj/high-performance-python.md

## high-performance-python.md

      
    Raw
  

              high-performance-python.md
            
          
    "Python High Performance" - things I didn't know or forgot about.

I worked on Python web apps, wrote some Tornado code, crunched data in Pandas and PySpark.
Hell, I even knitted something together in OpenCV and scikit-learn (university, I miss you).
However, no task can ever give you a chance to practice each and every part of language and package ecosystem.
It's always good to sharpen your saw, so I decided to refresh my knowledge a bit and get my hands on excellent "Python High Performance" by Gabriele Lanaro.
I will share with you my notes, which consist of 3 types of notes:

snippets I want to keep,
things I didn't know,
things I forgot about and I never want to again.

The format is very loose so feel free to just skim through the topics.
This is not intended to be a tutorial, these are only my notes.
If something catches your eye or you spot a mistake/typo, please ping me at Twitter @jacekbj.
Benchmarking

time command

>> time python simulation.py
real 0m1.051s
user 0m1.022s
sys 0m0.028s

Not available on Windows.
Many metrics are available (consult manual).
Multiple processors may work in parallel, so it's possible that real < user + sys.

timeit module

Runs code in a loop n times, runs r loops.
Measures total execution time, as well as subtimes.
Can be used as:

Python package,
CLI tool,
IPython utility.

pytest-benchmark

Simplifies writing benchmarks.
An elegant way to embed them in code.
Produces nice output.
Hide it behind a flag to avoid execution with other tests.
cProfile

Performance tuning.
Analyze how long your program executes and identify bottlenecks.
Can be used as:

Python package,
CLI tool,
IPython utility.

>> python -m cProfile -o prof.out simul.py
Use GUI app to to visualise (i.e. SnakeViz).
disassemble module

import dis

dis.dis(my_func)

Prints list of bytecode instructions for each line in function.
memory_profiler

This module summarizes memory consumption of a process.
Install psutil (optional dependency) to speed up the profiler.
Use @profile decorator.
_slots_

Magic attribute which limits attributes that can be assigned to class instance.
Decreases memory consumption by not storing the variables of an instance in an internal dictionary.
Pure Python Optimizations

Data Structures

List

Good at accessing, modifying and appending.
Accessing and modifying - approx O(1).
Memory allocation triggered once list is full O(N).
Bad at add/remove at lists beginning.
Insertion or removal requires shifting all other elements O(N).
Deque

Double-ended queue.
Doubly-linked list.
Fast at add/remove at both ends.
pop and append + popleft and appendleft.
Accessing element is O(N).
list.index() and bisect module

list.index() is O(N).
bisect searches order collections with O(log(N)).
Dictionary

Hashmap, very good at insertion, deletion and access O(1).
In Python <3.6 dictionaries are unordered collections.
In Python 3.6 dictionaries store the order of insertion.
Set

Set are hash-based O(1).
Removing duplicates is O(N).
Heap and queue

Designed to quickly find the min/max in a collection.
Typical tasks: process series of incoming tasks respecting priority.
Finding the min/max is O(1).
Insertion is O(N) because of shifting (even though finding the element is O(log(N)).
Module: heapq.
queue.PriorityQueue is handy and thread- and process-safe.
Trie

Prefix tree.
Fast at matching a list of strings againsy a prefix.
Useful for search-as-you-type and autocompletion.
No standard module, use i.e. patricia-tree.
Caching and memoization

Caching - store expensive results in temporary location.
Memoization - store and reuse results of previous function call.
LRU memoization

functools.lru_cache - simple, in-memory cache for results of a function:
from functools import lru_cache

@lru_cache(max_size=16)
def mult(a, b):
  print(f'Calculating {a} x {b}')
  return a * b
LRU - least recently used, max_size restricts the number of cache items.
Use None to make the cache unbounded.
joblib

Like lru_cache but on-disk.
from joblib import Memory

memory = Memory(cachedir='/path/to/cachedir')

@memory.cache
  def mult(a, b):
    return a * b
Includes efficient handling of Numpy arrays.
Comprehensions and generators

Comprehesions are fairly optimized and should be preferred instead of for-loops.
Use generators to decrease memory consumption.
Pass data to generator

It is possible to inject value into a generator through the yield statement.
Use send method to insert data.
Generator which can receive data is called generator-based coroutine.
def repeater():
  while True:
    message = yield
    print(f"Did you say: {message}?")

generator = repeater()
generator.send(None)    # This is required to bring us to first yield.
generator.send("Hello")
generator.send("World")
Compilers

Numba - compiles small functions on the go.
Compiles functions directly into machine code.
Developed to increase performance of numerical code (mostly Numpy arrays).
Can gain a lot of performance from type specializations (telling interpreter about data types).
Pypy - analyzes code at runtime and automatically compiles slow sections.
Uses tracing JIT compilation - PyPy profiles code and identifies the most intense loops.
The compiler traces (observes) the operations and compiles optimized, interpreter-free versions.
Once the compiled, improved version is available, Pypy uses that instead of interpreted code.
Numba and Pypy are JIT compilers.
Numba focuses on compiling functions and methods while Pypy on slow loops.
Numba - for numerical code, Pypy - as possible CPython replacement.
Async programming

embarrassingly parallel problems are excellent candidates for efficient parallelization.
Async programs can deal with multiple unpredictable resources concurrently and efficiently.
Hollywood principle


Don't call us, we will call you.

threading module uses callbacks.
Futures

A future allows us to keep track of the requested resource.
concurrent.futures.Future represents a value that is not yet available.
my_future.set_result() - "solves" the future
my_future.result() - returns the result
my_future.add_done_callback() - registers a callback
asyncio (https://docs.python.org/3/library/asyncio.html)

Event loops simplify dealing with shared  variables, data structures and resources.
busy-waiting - continously polling
asyncio.get_event_loop() - returns event loop
asyncio.ensure_future() - schedules future or coroutine for execution; returns a Task instance if coroutine is passed
asyncio.gather(*coroutine_list) - gathers results of all coroutines
loop.run_until_complete() - runs until future (or coroutine) is finished
loop.call_later() - takes a delay in seconds and callback
loop.run_forever() - starts the loop
loop.stop() - stops the loop
Coroutines

Write async code so that it resembles sync code.
Couroutine - mental model: a function that can be stopped and resumed.
Coroutine can be implemented in asyncio using yield statements.
From Python 3.5 on await keyword is available.
async def hello_world():
  print("Hello, async!")

coro = hello_world()  # This is a coroutine, not function execution!
native coroutine - a couroutine defined with async def statement
Convert blocking code into non-blocking code

Convert blocking tasks into futures.
from concurrent.futures import ThreadPoolExecutor
# from concurrent.futures import ProcessPoolExecutor - use for CPU intensive tasks

executor = ThreadPoolExecutor(max_workers=3)

def wait_and_return(msg):
  time.sleep(1)
  return msg

executor.submit(wait_and_return, "Executor")  # will run immediately
# or
loop.run_in_executor(executor, wait_and_return, "Asyncio executor")  # will run once loop starts

# Result:
# <Future at 0x7ff616ff6748 state=running>
Reactive programming

Manifesto:

Responsive: The system responds immediately to the user.
Elastic: The system is capable of handling different levels of load and is able to adapt to accommodate increasing demands.
Resilient: The system deals with failure gracefully. This is achieved by modularity and avoiding having a single point of failure.
Message driven: The system should not block and take advantage of events and messages.
A message-driven application helps achieve all the previous requirements.

RxPython - RxJS in Python.
A property of a cold observable is that if we attach two subscribers, the source will be triggered twice.
publish() method will transform Observable into ConnectableObservable.
It will push data only after we call connect() method.
When data is produced independently of subscribers - it's a hot observable.
Parallel programming

Communication between processes is costly.
Two ways to handle data communication in parallel programs:

shared memory,
distributed memory.


Multiple execution contexts and shared memory space - threads.
Multiple execution contexts and separate memory space - processes.


Python can spawn and handle threads, but they can't be used to increase performance; due to the Python interpreter design, only one Python instruction is allowed to run at a time - this mechanism is called Global Interpreter Lock (GIL). What happens is that each time a thread executes a Python statement, the thread acquires a lock and, when the execution is completed, the same lock is released. Since the lock can be acquired only by one thread at a time, other threads are prevented from executing Python statements while some other thread holds the lock.


Even though the GIL prevents parallel execution of Python instructions, threads can still be used to provide concurrency in situations where the lock can be released, such as in timeconsuming I/O operations or in C extensions.


The GIL can be completely sidestepped using processes instead of threads. Processes don't share the same memory area and are independent from each other - each process has its own interpreter. Processes have a few disadvantages: starting up a new process is generally slower than starting a new thread, they consume more memory, and inter-process communication can be slow. On the other hand, processes are still very flexible, and they scale better as they can be distributed on multiple machines.

Really cool explanation of GIL: https://realpython.com/python-gil/
locking

multiprocessing.Lock - a lock can be acquired and
released through the acquire method and release, or using the lock as a context
manager.
Since the lock can be acquired by only one process at a time, this method prevents
multiple processes from executing the protected section of code at the same time.
lock = multiprocessing.Lock()

class Process(multiprocessing.Process):
  def __init__(self, counter):
    super(Process, self).__init__()
    self.counter = counter

  def run(self):
    for i in range(1000):
    with lock: # acquire the lock
      self.counter.value += 1
      # release the lock
Cython and OpenMP


Cython provides a convenient interface to perform shared-memory parallel processing through OpenMP. This lets you write extremely efficient parallel code directly in Cython without having to create a C wrapper.

Topis I skipped


Numerical computing
Improving performance with Cython
Exploring compiles in depth
Reactive programming
Distributed processing (Dask, PySpark, mpi4py)
Designing for high performance (organizing source code, environments, contenerization, CI)