satwikkansal/scout_performance_optimizaiton.md Secret

## scout_performance_optimizaiton.md

      
    Raw
  

              scout_performance_optimizaiton.md
            
          
Table of Contents

Introduction
Optimizing performance in Python

What does "Optimizing code" mean anyway?
When to optimize?
Identifying performance bottlenecks
Profilers in Python

Different kinds of profilers

Deterministic and statistical profiling
The level at which resources are measured
Profile visualizers


Profiling CPU performance (and execution time) in Python

Useful CPU performance profilers in python

Inbuilt timing modules
profile and cProfile
pstats
line_profiler
Yappi
hotshot and coldshot
vmprof-python, pyinstrument
stacksampler, pyflame
Werkzeug Application profiler
GreenletProfiler
gprof2dot
pyprof2calltree and KCacheGrind
snakeviz, tuna, and runsnakerun


How do CPU profilers work in Python?
Approach for optimizing performance

Some low-hanging fruits

Waiting (time delays)
Too much I/O time

Logging


Way too many function calls
Object conversion
Using Non-pythonic constructs or outdated methods
Parallelizing


Some caveats (must know things) about profilers and optimizations

Heisenberg's uncertainty
Indirect dependency on the external environment
Limited supported environments
Hard to come up with a one-fits-all solution
Possible side-effects of optimization
Overfocusing on profilers


Where and how an APM (Application Performance Monitoring) tool help here?

Intensive Full Stack recording
Memory metrics
Some very useful High-level features


Conclusion


Title: Identifying performance bottlenecks in your Python code and optimizing it
Introduction

In this post, we will walk through various techniques that can be used to identify the performance bottlenecks in your python codebase and optimize them.
Optimizing performance in Python

What does "Optimizing code" mean anyway?

The term "optimization" can apply to a broad level of metrics. But two general metrics of most interest are; CPU performance (execution time) and memory footprint. For this post, you can think of an optimized code as the one which is either able to run faster or use lesser memory or both.
When to optimize?

There are no hard and fast rules. But generally speaking, you should try optimizing when you are sure that the business logic in the code is correct and not going to change soon.

"First make it work. Then make it right. Then make it fast."
~ Kent Beck

Otherwise, the effort spent on premature-optimization can go futile.
Unless you're developing a performance-intensive product or a code dependency that is going to be used by other projects which might be performance-intensive, optimizing every aspect of the code can be an overkill. For most of the scenarios, the 80-20 principle (80 percent of performance benefits may come from optimizing 20 percent of your code) will be more appropriate. According to professor Donald Knuth,

Programmers waste enormous amounts of time thinking about or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Before writing the optimized version of the code, we need to know what to optimize, "What is that 3% I should be spending my effort on?", or in other words, we need to identify the bottlenecks,
Identifying performance bottlenecks

Most of the times we make intuitive-guesses, but more often than not, our guesses are either wrong or just approximately correct. So, it's always advisable to take help of tools to get a clear definitive picture. The process of measuring resources (CPU usage, RAM, I/O, etc.) used by different sections of a program is known as "profiling," and the tools that help in profiling are usually referred to as "profilers." A profiler typically collects statistics like how often a resource is used and who is using the resource. For instance, a profiler designed for profiling execution time will measure how often and for how various long parts of the code are executed. Using a profiling mechanism becomes a necessity when the codebase grows large, and you still want to maintain efficiency.
Profilers in Python

The world of profilers in Python is pretty large, and if it's the first time that you decided to profile something, you may get overwhelmed by the number of options that you'll have. But the nice thing is some profilers are best suited for particular cases, understanding this can help you avoid a lot of stress when deciding which profiler to use, and this is exactly what we're going to cover in next few sections.
Different kinds of profilers

There are multiple criteria based on which you can segregate profilers in python,
Deterministic and statistical profiling

A profiler collects data and statistics by monitoring the events (function calls, returns, exceptions, etc.) that happen during program execution. There are broadly two approaches to go about it, either the profiler will be monitoring all the events, or it will sample after time intervals to collect that information. The former one is called deterministic profiling, and the later one is called statistical profiling. The tradeoff that we are making in these two approaches is between overhead and accuracy. You may find the terms deterministic and statistical profilers used interchangeably with "tracing" and "sampling" profilers, but they're not the same technically. Tracing simply means recording the entire call graph, and sampling simply means probing in intervals.
The level at which resources are measured

Some profilers will measure resource usage at the module level, some are capable of doing it at a function level, and some can measure at line level.
Profile visualizers

There are some tools which aren't profilers themselves, but they provide an interface to perform analysis on the data collected / reports generated by other popular profilers.
Profiling CPU performance (and execution time) in Python

Useful CPU performance profilers in python

This section will walk you though various options that are available to us. Wherever possible, we'll group similar profilers for simplification.
Inbuilt timing modules

The timing modules (time module and the timeit) are the basic modules that you can use to measure the execution time of various approaches (kind of algorithmic testing). The usage is pretty simple.
import time

start_time = time.time()

# The code to be evaluated

end_time = time.time()

# Time taken in seconds
time_taken = end_time - start_time
However, there are a few external factors that can affect time_taken. To minimize the influence of those external factors (e.g., machine load, I/O), it is often advised to run the test multiple times and choose the fastest.
Another thing to note is that time.time() measures the wall-clock time ( total time taken for the code to get completely executed) and not the CPU time (total time for which the CPU was actually processing the statements related to the code, excluding the time where it was busy somewhere else). To measure CPU time, you can use time.clock() instead.
timeit module was designed to abstract a lot of such intricacies and provide a simple interface for measuring execution times. Here's a sample usage
>>> import timeit
>>> timeit.timeit(stmt='" ".join(some_list)', setup='some_list = ["lets", "join", "some", "strings"]', number=1000, timer=time.clock)
0.00020100000000011775
The above code will run the setup statement once, and then returns the cumulative wall-clock time taken to run the main statement (stmt) 1000 times.
You can also use timeit.repeat to run the same experiment multiple time. And to measure CPU time instead of wall clock time, you can pass the timer as time.process_time). Here's an example,
>>> timeit.repeat(stmt='" ".join(some_list)', setup='some_list = ["lets", "join", "some", "strings"]', number=1000, repeat=4,  timer=time.process_time)
[0.00019499999999972317,
 0.0001930000000003318,
 0.00019100000000005224,
 0.00019799999999969842]
Also, if you're using jupyter notebook, you can use the %timeit magic function to get similar (and nicely formatted) results.
profile and cProfile

The built-in profile and cProfile modules provide deterministic profiling. Both have a similar interface, but profile is implemented in pure python, which adds to the overhead.
Usage:
$ python -m profile somefile.py *args
$ python -m cProfile somefile.py *args
Where *args are the command line arguments to be passed to your python file (if any).
This will print out results on stdout in the following format.
   844321 function calls (711197 primitive calls) in 7.343 seconds
   Ordered by: standard name
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    992    0.001    0.000    0.001    0.000 json_generator.py:49(is_interactive_statement)

You can also use the -o option to save to the results to file so that we can use them later.
What do the different columns in results mean?
ncalls is the number of calls, tottime is the time spent in a given function excluding the time spent in calls to sub-functions, percall is the time per call for a function (essentially tottime divided by ncalls), cumtime is the cumulative time spent in the function including the time spent in calls to sub-functions, the percall next to it is again the per call cumulative time, and finally, filename:lineno provides the reference to the code (at function level) related to these stats.
This and this were the sample outputs when I tried to run them on a couple of my scripts. As you can observe, there is plenty of information to go through in those outputs. If you're interested in only a subset of information, you can also use them as a python module to profile specific portions of the code.
import cProfile
p = cProfile.Profile()

# Enable profiling
p.enbale()

# Some logic that you want to profile

# Disable profiling
p.disable()

# Print the stats
p.print_stats()

# Dump the stats to a file
p.dump_stats("results.prof")
The interface of profile module is slightly different, and it also has a calibration method for correcting the overhead time from results. If you are not extending the profiler and just want to see the stats, you should always use cProfile instead of profile as it has less overhead.
However, there are few cases where cProfile and profile modules won't be helpful for you,

Event loop based applications; cProfile doesn't understand gevent or any greenlet based application.
Line by line profiling; Sometimes you will want to know the lines from which your time-consuming function is being called, and sometimes you may have difficulty in figuring out which line inside the time-consuming-function is the slowest.
Lack of visualization features. No support of exporting the stats to formats like callgrind.
Tricky do certain things like measuring CPU time, profiling multithreading programs, and so on. By default, the CProfiler runs on the main thread only (thread on which your python script is invoked). If you want to profile other threads that are spawned from your program, you might want to use threading.setprofile method of the threading module.

Nothing to worry here though, there are other tools that can complement cProfile or serve as an alternative to cProfiler in some cases. Let's discuss them next,
pstats

The pstats module can be used to parse the data generated by profile or cProfile modules. Once parsed you can do things like

Sorting based on number of calls, time, cumulative time, etc
Combining stats from multiple sources
Stripping directory names from the data
Viewing limited data (based on criterion like top n, containing regex, etc.)
Finding callers and callees of a function

Here's the typical usage
import pstats

# parse the results of cProfile
s = pstats.Stats("results.prof")

# Adding stats
s.add("more_results.prof")

# remove directory paths
s.strip_dirs()

# Sorting based on cumulative time
s.sort_stats("cumulative")

# Printing top 'n' stats
n = 3
s.print_stats(n)

function_name = "foo"

# print callers of foo
stats.print_callers('\({}'.format(function_name))

# pring callees of foo
stats.print_callers('\({}'.format(function_name))
line_profiler

line_profiler performs line by line profling collecting stats about the time spent in each line of the code being profiled.
Usage
$ pip install line_profiler
# You python program

# Use the decorator
@profile
def some_func(*args, **kwargs)
    ...
# kernprof is a convenient script to run different profilers; it comes with line_profiler
# https://github.com/rkern/line_profiler/blob/master/kernprof.py
$ kernprof -l -v python_script.py

# This will generate a binary file named python_script.py.lprof

# You can visaulize this file by
$ python -m line_profiler python_script.py.lprof
Here's a sample output (complete output here)
Timer unit: 1e-06 s

Total time: 0.000825 s
File: json_generator.py
Function: generate_code_block at line 26

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    26                                           @profile
    27                                           def generate_code_block(statements, output):
    28                                               global sequence_num
    29                                               result = {
    30       278        109.0      0.4     13.2          "type": "code",
    31       278        143.0      0.5     17.3          "sequence_num": sequence_num,
    32       278        109.0      0.4     13.2          "statements": statements,
    33       278        220.0      0.8     26.7          "output": output
    34                                               }
    35       278        130.0      0.5     15.8      sequence_num += 1
    36       278        114.0      0.4     13.8      return result

Total time: 0.014138 s
File: json_generator.py
Function: parse_example_parts at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           @profile
    56                                           def parse_example_parts(lines, example_title_line):
    57                                               parts = {
    58        57         38.0      0.7      0.3          "build_up": [],

The column names are kind of self-explanatory. Hits are the number of times the line is executed during the run, and "% time" is the propotion of time spent on the line. (It'll sum up to 100 for every function).
Yappi

Yappi (Yet Another Python Profiler) is a profiler which is an attempt to improve upon some of the lacking features of cProfile. Yappi can,

Profile a multi-threaded python application out of the box. It also provides the functionality of obtaining per-thread function statistics.
Measure CPU time easily.
Export the results to callgrind format as well as pstat formats.

Usage
$ pip install yappi
import yappi

# The default clock is set to CPU, but you can switch to wall clock
# yappi.set_clock_type("wall")
yappi.start()

# Call your functions or perform execution here
some_func()

# Getting all the stats
yappi.get_func_stats().print_all()
Here's the sample  output (complete output here).
name                                  ncall  tsub      ttot      tavg
../lib/python3.6/pprint.py:47 pprint  2      0.000050  5.181650  2.590825
..pprint.py:138 PrettyPrinter.pprint  2      0.000043  5.181576  2.590788
..print.py:154 PrettyPrinter._format  101..  0.234201  5.181516  0.000508
...py:180 PrettyPrinter._pprint_dict  148..  0.047156  4.244369  0.002868
..3 PrettyPrinter._format_dict_items  148..  0.149224  4.232629  0.002860
...py:207 PrettyPrinter._pprint_list  999/2  0.016574  4.169317  0.004173
..py:350 PrettyPrinter._format_items  999/2  0.086727  4.169248  0.004173
..nerator.py:226 convert_to_notebook  1      0.001365  3.705809  3.705809
../pprint.py:391 PrettyPrinter._repr  15608  0.192905  3.533774  0.000226
..pprint.py:400 PrettyPrinter.format  15608  0.120687  3.277539  0.000210
..python3.6/pprint.py:490 _safe_repr  931..  1.554210  3.156852  0.000034
..ython3.6/json/__init__.py:120 dump  1      0.124881  0.937511  0.937511
...6/json/encoder.py:412 _iterencode  19093  0.116629  0.749296  0.000039
..on/encoder.py:333 _iterencode_dict  488..  0.329734  0.632647  0.000013
..t.py:244 PrettyPrinter._pprint_str  1020   0.277267  0.576017  0.000565
..python3.6/pprint.py:94 _safe_tuple  30235  0.352826  0.559835  0.000019
..on/encoder.py:277 _iterencode_list  297..  0.201149  0.515704  0.000017
...6/pprint.py:84 _safe_key.__init__  60470  0.207009  0.207009  0.000003
..n3.6/pprint.py:87 _safe_key.__lt__  32160  0.114377  0.114377  0.000004
..enerator.py:60 parse_example_parts  57     0.043338  0.104288  0.001830
.._generator.py:173 convert_to_cells  112    0.004576  0.044347  0.000396
..62 inspect_and_sanitize_code_lines  278    0.011167  0.037915  0.000136
Just like cProfiler, the metrics are collected at function level. tsub is the total time spent in the function excluding the subcalls, ttot is the total time including them, ncall is the number of calls, tavg is the per call time (ttot divided by ncall).
hotshot and coldshot

hotshot is also a built-in module for profiling. The difference between hotshot and cProfile (or profile) is that hotshot doesn't interfere as much as cProfile does. The idea behind hotshot is to minimize the profiling overhead by doing all the data processing at the end. However, the module is no longer maintained, and the python docs say that it might be dropped in future versions.
The interface of hostshot is similar to that of cProfile; you can read the docs here.
coldshot was an attempt to improve upon hotshot; the major feature is the ability to visualize results in a tool like RunSnakeRun (discussed later).
vmprof-python, pyinstrument

vmprof-python is a statistical line profiler for Python. Due to statistical nature, vmprof-python has a low overhead.
Usage
$ pip install vmprof
$ python -m vmprof --lines -o results.dat somefile.py
$ vmprofshow --lines results.dat
You can omit the --lines option to profile at function level rather. Follwing is the sample output (complete output here)
Line #     Hits   % Hits  Line Contents
=======================================
   226                    def convert_to_notebook(parsed_json):
   227                        result = {
   228                            "cells": [],
   229                            "metadata": {},
   230                            "nbformat": 4,
   231                            "nbformat_minor": 2
   232                        }
   233                        for example in parsed_json:
   234                            parts = example["parts"]
   235                            build_up = parts.get("build_up")
   236                            explanation = parts.get("explanation")
   237                            notebook_path = "test.ipynb"
   238
   239                            if(build_up):
   240        4      2.4              result["cells"] += convert_to_cells(build_up)
   241
   242                            if(explanation):
   243                                result["cells"] += convert_to_cells(explanation)
   244
   245      153     90.0      pprint.pprint(result, indent=2)
   246        1      0.6      with open(notebook_path, "w") as f:
   247       12      7.1          json.dump(result, f)

Since it's a statistical profiler and collects data by probing the stack frame in fixed intervals, it might not be as accurate as expected particularly in the cases when there are not enough repetitions (or recursion), and the program runs quickly.
pyinstrument is another actively maintained statistical profiler that records the call stack every 1ms. pyinstrument can measure CPU time, and it also supports full-stack recording, meaning, it generates output like (expanded output generated using --show-all option here:
  _     ._   __/__   _ _  _  _ _/_   Recorded: 23:00:06  Samples:  583
 /_//_/// /_\ / //_// / //_'/ //     Duration: 3.023     CPU time: 0.519
/   _/                      v3.0.3

Program: json_generator.py

3.022 <module>  json_generator.py:11
├─ 1.583 pprint  pprint.py:47
│     [137 frames hidden]  pprint, re
│        0.606 _pprint_str  pprint.py:244
└─ 1.421 convert_to_notebook  json_generator.py:226
   ├─ 1.381 pprint  pprint.py:47
   │     [155 frames hidden]  pprint, re
   └─ 0.034 dump  json/__init__.py:120
         [17 frames hidden]  json

To view this report with different options, run:
    pyinstrument --load-prev 2019-06-27T23-00-06 [options]

Usage:
$ pip install pyinstrument
$ python -m pyinstrument some_file.py
It also provides middlewares for instrumenting Django and flask; you can read more about that here.
stacksampler, pyflame

Stacksampler is a statistical profiler written in just 100 lines of code. It records stack every 5ms and works well with gevent-based applications. To visualize the results, you can use the complementary stackcollector agent, which will generate the flame graphs for you. A flame graph has the call stack at the y-axis and percent resource utilization at x-axis (CPU time in our case), which enables you to visualize what part of the program is taking what percent of the time. However, due to its simple implementation, there might be some cases (like profiling dynamically spawned processes/threads) where stacksampler won't give the results you want.
Another interesting profiling tool when it comes to generating flame graphs is pyflame (designed by Uber). The functioning of pyflame is different and more optimized from that of stacksampler (discussed later). The coolest feature of the flame graph is the possibility of calculating a "diff" of two flame graphs. This is immensely helpful when your application starts performing slowly in a recent release. By computing the diff, you can see what changed in the call stack, and how much time it is taking, saving time in debugging performance regressions. However, as per the docs,

Are BSD / OS X / macOS Supported?
Pyflame uses a few Linux-specific interfaces, so, unfortunately, it is the only platform supported right now.

So I wasn't able to test it out and present the output in this post :(
Werkzeug Application profiler

Werkzeug is a library consisting of various utilities for WSGI web applications. Flask web framework also wraps werkzeug. One utility that werkzeug library provides is Application Profiler middleware. It uses cProfile to profile each request execution. You can see the usage in docs here.
GreenletProfiler

GreenlentProfiler is a profiler suited for profiling gevent based processes. gevent is a coroutine based networking library that provides a high-level synchronous API on top of the libev event loop. It is used to provide concurrency (but no parallelism) to improve speed by executing I/O bound tasks (network calls) asynchronously under the hood. Here's the post by the owner of GreenletProfiler explaining how cProfiler and YAPPI aren't much helpful for gevent based applications (with examples), and to use GreenletProfiler instead.
gprof2dot

gprof2dot is a tool that you can use to visualize the results of profilers like cProfiler and hotshot (and lots of other language profilers) into what is called a dot graph.

Use cProfile / Hotshot to get results in .dot file
Has a django fontend for visualizing (called graphi)

Usage
$ pip install gprof2dot

# To install dot on osx
$ brew install graphviz
$ python -m cProfile -o results.prof somefile.py
$ gprof2dot -n0 -e0 -f pstats -o results.dot results.prof
$ dot -Tpng -o results.png results.dot
Here's the sample output (view full-size image here), which consists of nodes

Here total time % represents the percentage of time spent in the function, self time % is the time spent in the function excluding sub-function calls, and total calls is the total number of calls.
pyprof2calltree and KCacheGrind

KCacheGrind is a very popular profile data visualization tool. To visualize your python program using it, you can convert cProfile results into the callgrind form using pyprof2calltree and then visualize them using KCacheGrind.
Usage:
$ python -m cProfile -o results.prof somefile.py
$ pyprof2calltree -k -i results.prof
Here are a couple of screenshots,


You can visualize the call graph, caller graph, callee graph, and even the source code level statistics using kcachegrind.
snakeviz, tuna, and runsnakerun

sankeviz is a profile visualizer which lets you convert the output of the cProfile module into browser-based visualizations like sunburst diagrams and nice graphs. Tuna (inspired by SnakeViz) and runsankerun (early tool in this category) are some other options that you can explore.
Snakeviz usage
$ pip install snakeviz
$ python -m cProfile -o results.prof some_file.py
$ snakeviz results.prof
Here's the screenshot of the output it generates,


How do CPU profilers work in Python?

Although you don't need to know how the CPU profilers work under the hood to be able to use them, but knowing how they work will help you reason about their capabilities and overheads.
Most of the profilers will function as a part of your Python process and access the stack frames (which will contain meta information like line number, statements, etc.).
The tracing profilers work by registering a callback to interpreter events (function calls, returns, line execution, exceptions, etc.). When the event is triggered, the callback function gets called, which has the logic for recording stack frames and analyzing them. The C-interface functions for registering callbacks are PyEval_SetProfile and PyEval_SetTrace. The difference between the two lies in the events they monitor. Do check out the Profiling and tracing section of the python docs for more details. Here's the ceval.c module in CPython implementation (which by the way, is the heart of python) where these functions are defined.
For checking how they are used, you can see the implementation of the callback functions in the source codes of different tracing profilers. For instance, in the built-in cProfiler, this is the line of code where callback is registered, and this is where the callback function is defined.
Coming to sampling profilers, most of them use setitimer system call in Linux which asks the OS to send signals after fixed intervals. And just like the callback functions in case of tracing profilers, there are signal handlers which will record the stack frame at that moment for analysis. The "fixed intervals" can be wall time or CPU time, depending on the arguments passed to the system call. Here's documentation for settimer. You can explore source codes of different profilers and see how they are using it. For instance, this is the line of code is stacksampler where the settimer signal is registered, this is where the signal handler is defined, and this is the implementation of the signal handler.
Now you can also figure out when and why tracing profilers can have high overhead. If the program that you are profiling generates a lot of events (Example, a recursive function), it will cause the callback function to execute at every event, resulting in overhead. The ovehead of sampling profilers will still be roughly the same in these cases. However, one caveat with OS signal based sampling profilers is that the Signals can interrupt system calls (so that signal handler can run). This can lead to IOErrors and may even conflict with other libraries/program that depends on the same signal the profiler is depending on.
That being said, the differentiation between the implementations of tracing and sampling profilers is not very clear-cut. There are also some sampling profilers like Pyinstrument that have shifted to PyEval_SetProfile approach, wherein the callback function will not record the stack frames until the "fixed interval" has elapsed (to reduce the overhead). And some sampling profilers (like pyflame) run in a different process altogether making use of ptrace system call to record the information for analysis. For those interested, here's a very interseting blog post by Uber engineering on how exactly pyflame works.
Now that we are aware of how profilers work and all the tools that are available to help us in identifying bottlenecks, let's discuss the rough approach that we can follow to eventually optimize our code.
Approach for optimizing performance

Optimization with the help of profiling is like debugging. A profiler (in most of the cases) won't tell you how to change your code, but it'll help you validate your hypotheses. Usually, the process goes something like this; You identify a culprit section of code, then you come up with some hypothesis

Doing this in X way instead of Y might increase the performance.

And then you try that out to validate your hypotheses. If the hypothese turns out to be true, you can get the performance improvement results in a quantified manner using the profilers itself.
You can find out the functions which are expensive (typically having a high value of the product of "Number of calls" and "Execution time per call"), and explore design decisions like adding a cache (if the input domain of the function is limited). You can go down a level deeper and use a line profiler to figure out what lines inside the function are the most time consuming and try to see if there are alternative ways.
However, synthetic benchmarks aren't enough. Sometimes things can vary a lot between production and development environment. Profiling in production is helpful. However, though, in most of the cases, we would not want to bear the overhead of operating a deterministic profiler in the production environment.
Some low-hanging fruits

Following are some of the easy to fix ideas that developers often come across while profiling for CPU performance.
Waiting (time delays)

This the most obvious low-hanging fruit. Synchronous waiting in any form (time.sleep, thread.wait, lock.acquire, etc.) will block your program to execute anything on CPU. This blocked CPU time can better be utilized by other areas of your program, which don't depend on the wait time.
Too much I/O time

When your program is waiting for I/O, it is not utilizing the CPU. Some examples include disk reads, database queries, network communication (RPCs, REST API calls, etc.), etc.
Logging

Since logging to disk (or screen) is also an I/O bound task, you should be cautious about the things you're logging and the logging levels in production.
The work around with I/O bound taks and time delays is to execute them asynchronously (see asyncio for python3).
Way too many function calls

Every function call has an overhead. While profiling, you can try to classify expensive function calls as

Genuine calls
Un-necessary calls (alogrithmic issues)

This can be done using caller and callee info; time spent info and profile visualizations.
In case of genuine calls, you might want to explore implementing your function at a lower level (using cTypes, or Cython, or C) or use the libraries that already do so. For instance matrix calculations are much-much faster with Numpy than with native Python because numpy is built in CPython and uses SIMD architecture. Similarly, there are libraries like pandas, scipy, pycrypto, lxml, gstreamer, etc. which you can use for doing specific things instead of doing them on your own. There's also another generic suggestion that is hidden here "Don’t reinvent the wheel, look for existing stuff first".
Object conversion

There are some object conversion tasks that are CPU intensive and might not be necessary for your use-case. Examples include,

Serialization
Encoding & decoding, Encryption & decryption
Compression and decompression
Un-necessary conversion from one class to another

Using Non-pythonic constructs or outdated methods

There are multiple ways to do the same things in Python, and their performance impact can differ drastically. Here are a few instances where you should be careful about performance impacts,

String concatenation
Naive string parsing, and using regexes
Loading huge data in RAM

Parallelizing

Parallelizing expensive tasks is also a good idea for optimizing performance. As you might know already, multithreading in python will provide concurrency and not actual parallelization. For that, you can use the multiprocessing module. Ideally, you can spawn upto as many as 1.5 times the number of cores processes (you might need to tweak this ratio a bit to find the "sweet-spot" for your application). Creating more processes also helps if your program uses cooperative multithreading (e.g., Greenlets in gevent). You can also explore the feasiblity of using tools like Dask, and PySpark (see our post on distributed machine learning with pyspark).
Some caveats (must know things) about profilers and optimizations

Heisenberg's uncertainty

Heisenberg's uncertainty principle in quantum mechanics states that the more precisely you know the position of something, the less precisely you can tell its speed and momentum. A similar parallel can be drawn when it comes to profiling, i.e., by measuring one thing closely, you have actually weakened your understanding of the other. You have your code operating one way, and you try to measure it really deeply, put tons of hooks into it to do this monitoring using profilers, and it might actually change the performance of your code. So the final numbers may look somewhat different. Not even the code injected with profiler hooks will run x% slower, this x% might not be evenly distributed as well. So if you want high-precision or you're looking at the percentage-based metrics, this is something you should keep in mind.
Indirect dependency on the external environment


The profiling results might depend on external circumstances (Load avg on the machine, network speed, etc.). So profiling with an intent to compare two programs, you should try to keep the external conditions as much similar as possible.

Limited supported environments

Some tools might support limited versions of Python or only specific operating systems.
Hard to come up with a one-fits-all solution

For instance, a statistical profiler is less likely to distort timings of a program that has a lot of function calls (like a Django template renderer). It also means that the profiler misses recording information about some of the function frames. However, the missed function frames are the fast running ones. This might be an acceptable tradeoff in the production environment since the overhead is very less and profilers are expected to find the slowest code. But in the development environment, we will want to have the definite answers to everything, so we should use deterministic profiler there. Similar tradeoffs exist when it comes to visualization, report formats, etc.
Possible side-effects of optimization

The are also some side-effects of optimization.

Writing optimized code is slow. It requires a lot of human-effort (and that too experts) and time. Examples include writing computing libraries like numpy and tensorflow.
Optimization can compromise with readablity. An optimized code will raise the bar for the level of expertise that a programmer should have to be able to understand it. Depending on the use-case you might have to end up writing code in lower level languages C, Cython, CUDA, etc.
Optimization can also compromise with extensiblity / maintainablity. Some optimizations like Caching, lazy execution, parallelized executions, and hacky-patches are based on assumptions that things will be behaving in a certain way. And if the requirements on which these assumptions are based change in the near future, the code needs to be changed to account for that. A couple of examples can be,

Caching becoming ineffective due to change in assumptions
The side-effects of problems like race condition may suddendly become evident in case of high concurrent scale.
The space-time tradeoff that you took in your design to improve the speed has somehow become a bottleneck now.


Overfocusing on profilers

Writing code that does the job is one thing; writing it in an optimized way is a different game altogether. It not only requires the familiarity with differenet programming language constructs, but it also demands in-depth knowledge of how things work under the hood. Of course, there are tools and tips and tricks out there to help you optimize the code. But trust me, having a better picture of what's happening at a lower level, will help you go a long way in writing optimized code in the first place.
Where and how an APM (Application Performance Monitoring) tool help here?

A good APM actively monitors different parts of your applications and can help you locate the hotspots for performance improvements very quickly. An APM may not be a silver bullet for optimizing your application, but it's a collection of lead bullets that can help in figuring out bottlenecks. For instance, Scout provides,
Intensive Full Stack recording

We mostly covered tools specific to profiling execution speeds native to python, but your application can be much more than a program that runs on CPU. For instance, if you have a web application that serves decent traffic, it is highly likely that your stack will have components databases, external API integrations, consumers and producers, and so on. And sometimes, the hotspots for optimizations can be outside the python runtime. Though there are tools available to profile each ot these components individually, a good APM tool will provide you the big-picture-overview of the system and the capability to dig deeper into the components. It'll record information at different layers of the stack. This is particularly useful for large applications.
Scout works on the concept of transactions (and traces). So it is determisintic in the sense every single transaction under monitoring is measured. To collect statistics related to different components scout utilizes techniques like signals and events, patching, etc.
Memory metrics

We didn't cover python-specific memory profilers in this post (since the post is already too long!), but scout has features like,

Object allocation tracking (garbage collection)
Ability to track process memory increases in RAM and tie them to specific transactions.

Some very useful High-level features

You can attach meaningful context to transactions in Scout to help correlate and identify performance problems to a specific context (users, actions, etc.). Scout ties each transaction to the Git version and will point you directly to the line of code responsible. Scout can aggregate results for various machines in your cluster.
If you have a web application running on a python web framework like Django or Flask, you can integrate flask right out-of-the box with a few lines of configuration.
Scout Usage
$ pip install scout-apm
If you have a flask, django, bottle, falcon, or pyramid application you can configure scout in a few lines and get moving.
In Django
# settings.py
INSTALLED_APPS = [
    "scout_apm.django",  # should be listed first
    # ... other apps ...
]

# Scout settings
SCOUT_MONITOR = True
SCOUT_KEY = "[AVAILABLE IN THE SCOUT UI]"
SCOUT_NAME = "A FRIENDLY NAME FOR YOUR APP."
In flask
from scout_apm.flask import ScoutApm

# Setup a flask 'app' as normal

# Attach ScoutApm to the Flask App
ScoutApm(app)

# Scout settings
app.config["SCOUT_MONITOR"] = True
app.config["SCOUT_KEY"] = "[AVAILABLE IN THE SCOUT UI]"
app.config["SCOUT_NAME"] = "A FRIENDLY NAME FOR YOUR APP"
Scout will automatically instrument code related to following libraries,

PyMongo
Requests
UrlLib3
Redis
ElasticSearch
Jinja2

To monitor something else, you can use custom instrumentation. You can use the @scout_apm.api.BackgroundTransaction decorator or with scout_apm.api.BackgroundTransaction to specify the portion of the code you want to monitor, and if within that code if you want to measure a specific piece of work, you can use @scout_apm.api.instrument() decorator or with with scout_apm.api.instrument context manager for that,
import scout_apm.api

config = {'name': 'My Test App',
          'key': 'MY_SCOUT_KEY',
          'monitor': True}

scout_apm.api.install(config=config)

@scout_apm.api.BackgroundTransaction("Some name")
def some_func():
    # some work
    sub_func()
    # some more work

@scout_apm.api.instrument("Sub function")
def sub_func():
    # some work


Conclusion

In this post, we went-through different tools and techniques to identify performance bottlenecks in a python application, and some ideas to optimize performance of the application. We discussed about different profiling and visualizing tools available for profiling CPU performance in python, and their typical usage. We briefly discussed about the approach that we should follow when trying to optimize and some caveats associated with profiling and optimization. At last, we had a look at some of the features of Scout that can be helpful during profiling large applications.