I worked on Python web apps, wrote some Tornado code, crunched data in Pandas and PySpark. Hell, I even knitted something together in OpenCV and scikit-learn (university, I miss you). However, no task can ever give you a chance to practice each and every part of language and package ecosystem. It's always good to sharpen your saw, so I decided to refresh my knowledge a bit and get my hands on excellent "Python High Performance" by Gabriele Lanaro. I will share with you my notes, which consist of 3 types of notes:
- snippets I want to keep,
- things I didn't know,
- things I forgot about and I never want to again.
The format is very loose so feel free to just skim through the topics. This is not intended to be a tutorial, these are only my notes. If something catches your eye or you spot a mistake/typo, please ping me at Twitter @jacekbj.
>> time python simulation.py
real 0m1.051s
user 0m1.022s
sys 0m0.028s
- Not available on Windows.
- Many metrics are available (consult manual).
- Multiple processors may work in parallel, so it's possible that real < user + sys.
Runs code in a loop n
times, runs r
loops.
Measures total execution time, as well as subtimes.
Can be used as:
- Python package,
- CLI tool,
- IPython utility.
Simplifies writing benchmarks. An elegant way to embed them in code. Produces nice output. Hide it behind a flag to avoid execution with other tests.
Performance tuning. Analyze how long your program executes and identify bottlenecks.
Can be used as:
- Python package,
- CLI tool,
- IPython utility.
>> python -m cProfile -o prof.out simul.py
Use GUI app to to visualise (i.e. SnakeViz).
import dis
dis.dis(my_func)
Prints list of bytecode instructions for each line in function.
This module summarizes memory consumption of a process.
Install psutil
(optional dependency) to speed up the profiler.
Use @profile
decorator.
Magic attribute which limits attributes that can be assigned to class instance. Decreases memory consumption by not storing the variables of an instance in an internal dictionary.
Good at accessing, modifying and appending. Accessing and modifying - approx O(1). Memory allocation triggered once list is full O(N).
Bad at add/remove at lists beginning. Insertion or removal requires shifting all other elements O(N).
Double-ended queue.
Doubly-linked list.
Fast at add/remove at both ends.
pop
and append
+ popleft
and appendleft
.
Accessing element is O(N).
list.index()
is O(N).
bisect
searches order collections with O(log(N)).
Hashmap, very good at insertion, deletion and access O(1). In Python <3.6 dictionaries are unordered collections. In Python 3.6 dictionaries store the order of insertion.
Set are hash-based O(1). Removing duplicates is O(N).
Designed to quickly find the min/max in a collection.
Typical tasks: process series of incoming tasks respecting priority.
Finding the min/max is O(1).
Insertion is O(N) because of shifting (even though finding the element is O(log(N)).
Module: heapq
.
queue.PriorityQueue
is handy and thread- and process-safe.
Prefix tree.
Fast at matching a list of strings againsy a prefix.
Useful for search-as-you-type and autocompletion.
No standard module, use i.e. patricia-tree
.
Caching - store expensive results in temporary location.
Memoization - store and reuse results of previous function call.
functools.lru_cache
- simple, in-memory cache for results of a function:
from functools import lru_cache
@lru_cache(max_size=16)
def mult(a, b):
print(f'Calculating {a} x {b}')
return a * b
LRU - least recently used, max_size
restricts the number of cache items.
Use None
to make the cache unbounded.
Like lru_cache
but on-disk.
from joblib import Memory
memory = Memory(cachedir='/path/to/cachedir')
@memory.cache
def mult(a, b):
return a * b
Includes efficient handling of Numpy arrays.
Comprehesions are fairly optimized and should be preferred instead of for-loops. Use generators to decrease memory consumption.
It is possible to inject value into a generator through the yield
statement.
Use send
method to insert data.
Generator which can receive data is called generator-based coroutine.
def repeater():
while True:
message = yield
print(f"Did you say: {message}?")
generator = repeater()
generator.send(None) # This is required to bring us to first yield.
generator.send("Hello")
generator.send("World")
Numba - compiles small functions on the go. Compiles functions directly into machine code. Developed to increase performance of numerical code (mostly Numpy arrays). Can gain a lot of performance from type specializations (telling interpreter about data types).
Pypy - analyzes code at runtime and automatically compiles slow sections. Uses tracing JIT compilation - PyPy profiles code and identifies the most intense loops. The compiler traces (observes) the operations and compiles optimized, interpreter-free versions. Once the compiled, improved version is available, Pypy uses that instead of interpreted code.
Numba and Pypy are JIT compilers. Numba focuses on compiling functions and methods while Pypy on slow loops. Numba - for numerical code, Pypy - as possible CPython replacement.
embarrassingly parallel problems are excellent candidates for efficient parallelization.
Async programs can deal with multiple unpredictable resources concurrently and efficiently.
Don't call us, we will call you.
threading
module uses callbacks.
A future allows us to keep track of the requested resource.
concurrent.futures.Future
represents a value that is not yet available.
my_future.set_result()
- "solves" the future
my_future.result()
- returns the result
my_future.add_done_callback()
- registers a callback
Event loops simplify dealing with shared variables, data structures and resources.
busy-waiting - continously polling
asyncio.get_event_loop()
- returns event loop
asyncio.ensure_future()
- schedules future or coroutine for execution; returns a Task
instance if coroutine is passed
asyncio.gather(*coroutine_list)
- gathers results of all coroutines
loop.run_until_complete()
- runs until future (or coroutine) is finished
loop.call_later()
- takes a delay in seconds and callback
loop.run_forever()
- starts the loop
loop.stop()
- stops the loop
Write async code so that it resembles sync code. Couroutine - mental model: a function that can be stopped and resumed.
Coroutine can be implemented in asyncio
using yield
statements.
From Python 3.5 on await
keyword is available.
async def hello_world():
print("Hello, async!")
coro = hello_world() # This is a coroutine, not function execution!
native coroutine - a couroutine defined with async def
statement
Convert blocking tasks into futures.
from concurrent.futures import ThreadPoolExecutor
# from concurrent.futures import ProcessPoolExecutor - use for CPU intensive tasks
executor = ThreadPoolExecutor(max_workers=3)
def wait_and_return(msg):
time.sleep(1)
return msg
executor.submit(wait_and_return, "Executor") # will run immediately
# or
loop.run_in_executor(executor, wait_and_return, "Asyncio executor") # will run once loop starts
# Result:
# <Future at 0x7ff616ff6748 state=running>
Manifesto:
- Responsive: The system responds immediately to the user.
- Elastic: The system is capable of handling different levels of load and is able to adapt to accommodate increasing demands.
- Resilient: The system deals with failure gracefully. This is achieved by modularity and avoiding having a single point of failure.
- Message driven: The system should not block and take advantage of events and messages. A message-driven application helps achieve all the previous requirements.
RxPython - RxJS in Python.
A property of a cold observable is that if we attach two subscribers, the source will be triggered twice.
publish()
method will transform Observable
into ConnectableObservable
.
It will push data only after we call connect()
method.
When data is produced independently of subscribers - it's a hot observable.
Communication between processes is costly. Two ways to handle data communication in parallel programs:
- shared memory,
- distributed memory.
- Multiple execution contexts and shared memory space - threads.
- Multiple execution contexts and separate memory space - processes.
Python can spawn and handle threads, but they can't be used to increase performance; due to the Python interpreter design, only one Python instruction is allowed to run at a time - this mechanism is called Global Interpreter Lock (GIL). What happens is that each time a thread executes a Python statement, the thread acquires a lock and, when the execution is completed, the same lock is released. Since the lock can be acquired only by one thread at a time, other threads are prevented from executing Python statements while some other thread holds the lock.
Even though the GIL prevents parallel execution of Python instructions, threads can still be used to provide concurrency in situations where the lock can be released, such as in timeconsuming I/O operations or in C extensions.
The GIL can be completely sidestepped using processes instead of threads. Processes don't share the same memory area and are independent from each other - each process has its own interpreter. Processes have a few disadvantages: starting up a new process is generally slower than starting a new thread, they consume more memory, and inter-process communication can be slow. On the other hand, processes are still very flexible, and they scale better as they can be distributed on multiple machines.
Really cool explanation of GIL: https://realpython.com/python-gil/
multiprocessing.Lock
- a lock can be acquired and
released through the acquire
method and release
, or using the lock as a context
manager.
Since the lock can be acquired by only one process at a time, this method prevents
multiple processes from executing the protected section of code at the same time.
lock = multiprocessing.Lock()
class Process(multiprocessing.Process):
def __init__(self, counter):
super(Process, self).__init__()
self.counter = counter
def run(self):
for i in range(1000):
with lock: # acquire the lock
self.counter.value += 1
# release the lock
Cython provides a convenient interface to perform shared-memory parallel processing through OpenMP. This lets you write extremely efficient parallel code directly in Cython without having to create a C wrapper.
- Numerical computing
- Improving performance with Cython
- Exploring compiles in depth
- Reactive programming
- Distributed processing (Dask, PySpark, mpi4py)
- Designing for high performance (organizing source code, environments, contenerization, CI)