Python Performance Profiling

Liam Keegan, SSC

2026.01.28

Course Outline

  • Python Profiling
    • Introduction
    • Profiling Python code
    • Profiling multiprocessing Python code
    • Profiling compiled extensions
    • Profiling GPU code
    • Memory profiling
  • Hands on
    • Python code
    • Python code with multiprocessing
    • Pytorch code

Profiling

Profiling

Profiling means finding out where your code spends time when it is running.

This is the key first step to making your code run faster - identifying "bottlenecks" or "hotspots" that account for a large fraction of the runtime, which you can then optimise.

You can also profile memory usage, I/O usage, network use, or any other resource that you care about.

Why profile?

Trying to make your code faster (or use less memory), without first understanding which parts are slow (or use too much memory), is a guessing game.

The end result is often wasted effort optimising code that doesn't actually result in a significant change in the overall performance.

For example, if one function is responsible for 90% of the execution time, then it's probably a waste of time optimising anything else in the codebase.

Without profiling, attempting to improve performance can be very inefficient.

Profiler output

To profile our code we will use a tool called a profiler. There are a variety of profilers which can produce all kinds of different outputs.

But there are two key types of output:

  • Trace (time series data, e.g. a timeline)
    • What is happening as a function of time, i.e. WHEN things happen
  • Profile (time-averaged data, e.g. a flamegraph)
    • How much each function contributes in total, i.e. WHAT happens

Profiler output

Trace output: x-axis is timestamps

Profile output: x-axis is total amount of time

Types of profilers

Profilers come in two flavours:

  • Tracing (deterministic) profilers
  • Sampling (statistical) profilers

They both record information about what your code is doing as it runs, in particular

  • Which functions get called
  • Which functions do these functions call
  • How long do the functions run for

Tracing profilers

Instrument and record every single function call in your program.

  • Advantages
    • Accurate (in the sense that every single function call is recorded)
    • Deterministic (no statistical sampling of function calls)
    • Can be very granular (down to individual lines within a function)
  • Disadvantages
    • Large overhead (which can reduce accuracy by affecting the performance)
    • Generates a lot of data (can be too much to easily deal with)
  • Examples
    • cProfile
    • line_profiler

Sampling profilers

Periodically interrupt execution and record what is running.

  • Advantages
    • Significantly less overhead than tracing profilers
    • Typically less intrusive (e.g. don't require changes in the code)
  • Disadvantages
    • Missing data (statistically not an issue but may be if you zoom in on a trace/timeline)
    • Biased against rare events (may be a problem if you're worried about worst case / latency)
    • Not deterministic (if you run it again you will sample different function calls)
  • Examples
    • py-spy
    • perf

A note on benchmarking

Benchmarking and profiling are both concerned with performance, but they are answering slightly different questions:

  • Profiling is about understanding where and why your code is slow
    • Typically used to find out where and how you can improve performance
  • Benchmarking is about understanding how fast your code is
    • Typically used to compare performance between alternative implementations

A benchmark typically runs the same piece of code many times, and records the average (or best) runtime for that piece of code.

Profiling Python code: cProfile

cProfile

  • Built-in tracing (deterministic) profiling tool in Python
  • Records count and execution time of every function call
  • Profiles only Python-level code

Easy to use, nothing to install, can just prepend cProfile to your script:

python -m cProfile pipeline.py

This command runs the pipeline.py script and prints a summary of the profiling data to the console

cProfile output for each function

  • ncalls — number of times the function was called
    • For recursive functions shows two values: total_calls (including recursive) / primitive_calls (excluding recursive)
  • tottime — time spent inside the function body itself, excluding time spent in functions it calls
  • percall (tottime) — tottime / ncalls, the average self-time per call
  • cumtime — cumulative time spent in the function including all subcalls
    • This is usually the most useful column for finding bottlenecks
  • percall (cumtime) — cumtime / primitive_calls, the average cumulative time per (non-recursive) call
  • filename:lineno(function) — source file, line number, and function name
    • Built-ins appear in braces, e.g. {built-in method builtins.exec}

cProfile output example

    58372623 function calls (58372593 primitive calls) in 10.982 seconds

  ncalls   tottime   percall   cumtime   percall filename:lineno(function)

       1    7.054      7.054   10.504     10.504 pipeline.py:93(find_similar_records)

49995000    2.895      0.000    2.895     0.000 {built-in method builtins.abs}

 7181275    0.581      0.000    0.581     0.000 {method 'append' of 'list' objects}

       1    0.008      0.008    0.208     0.208 pipeline.py:83(compute_statistics)

  10000     0.085      0.000    0.156     0.000 pipeline.py:30(normalize_name)

       1    0.007      0.007    0.143     0.143 pipeline.py:10(generate_records)

  10000     0.053      0.000    0.093     0.000 pipeline.py:26(generate_values)

cProfile visualization

  • This text output is useful, but not very easy to read!
  • Typically we write the profiling data to a file then visualise with another tool
  • Snakeviz provides an interactive visual version of the profiling data
    • pip install snakeviz
  • gprof2dot + Graphviz provides a call graph
    • pip install gprof2dot
    • Graphviz is usually pre-installed on linux, on mac: brew install graphviz

To output the profile data to a file we use the -o option:

python -m cProfile -o out.prof pipeline.py

cProfile + snakeviz

snakeviz out.prof

cProfile + gprof2dot

gprof2dot -f pstats out.prof | dot -Tsvg -o out.svg

cProfile as a library

  • For more control over what gets profiled you can use cProfile as a library in your code
  • One way to do this is using it as a context manager:
import cProfile

with cProfile.Profile() as pr:

    # ... do the work that you want to profile here ...

    pr.dump_stats("out.prof")

cProfile summary

  • Built-in profiling tool in Python
  • Use snakeviz and gprof2dot to visualise the output
  • Provides a nice overview
  • Good way to identify hotspots, understand where time is being spent

Profiling Python code: line_profiler

line_profiler

  • Line-by-line tracing profiling tool in Python
  • Records count and execution time of every line of code within a function
  • Profiles only Python-level code
  • Not suited for multi-threaded, multi-processing, or asynchronous code

Install with pip

pip install line_profiler

line_profiler use

In your code import the profile decorator:

from line_profiler import profile

And decorate any functions you want to profile with

@profile

By default this does nothing unless the LINE_PROFILE env var is set.

To see the profiling output, run your script normally (with this env var set):

LINE_PROFILE=1 python script.py

line_profiler output

line_profiler alternative use

If you just want to profile every line of script.py you can instead do:

python -m kernprof -lv -p script.py script.py

This doesn't require any modifications to the source code.

But it may result in a huge amount of output depending on how large your code is.

line_profiler summary

  • Provides line-level profiling data
  • Useful to investigate which parts of a function is expensive
  • For example a function with a bunch of numpy/pytorch calls
  • Good way to identify hotspots within a function

Profiling Python code: pyinstrument

pyinstrument

  • Sampling (statistical) profiling tool in Python
  • Displays call stack and timeline in a web browser

Install with pip

pip install pyinstrument

Replace python with pyinstrument when running your code.

Multiple different renders are available, html is a good option:

pyinstrument -r html script.py

Pyinstrument output

You can choose between a call stack view and a timeline view:

Pyinstrument summary

  • Sampling (statistical) profiling tool in Python
  • Simple to use, no need for additional tools to visualise output
  • Offers a timeline view which is not available with cProfile

Profiling Python code: memray

memray

  • Memory profiling tool in Python
  • Displays call stack and timeline in a web browser

Install with pip

pip install memray

Run your script with memray

memray run -o output.bin script.py

Generate a flamegraph of the memory usage

memray flamegraph output.bin

memray

Flame graph and timeline of memory usage:

Other Python profilers

Many other python profilers are available, including

  • austin
  • py-spy
  • yappi
  • scalene
  • viztracer

Each have their pros and cons, and we'll cover some of these later in the course - but the key concepts covered in this course apply to all of them.

Python profiling summary

  • So far we looked at four python profiling tools
  • Each allows us to understand performance of our code in different ways
  • cProfile: flame graphs and call graphs
  • line_profiler: profile individual lines of code within a function
  • pyinstrument: call stack and timeline
  • memray: memory allocation flame graph and timeline

These tools are often sufficient, even when your Python code is calling compiled or gpu code, to learn how best to improve the performance of your code.

But sometimes you need to go deeper…

Profiling multiprocessing code

Profiling multiprocessing code

If you use multiprocessing with "spawn" in your python code, then each spawned process is a new process with its own Python interpreter.

Typical profiling will not include any data from these other Python interpreters.

It is possible to use cProfile if you launch it within each sub-process, then manually combine the results at the end, but this is not very convenient.

A better option is to use a profiler that can automatically detect and profile any spawned sub-processes, such as py-spy or austin.

Profiling multiprocessing code: py-spy

py-spy

  • Sampling (statistical) profiling tool in Python
  • Can attach to a running process
  • Can optionally profile subprocesses
  • Can optionally also profile native (Cython/C/C++) extensions
  • Note: Python 3.14 not yet supported (as of 2026/01/22)

Install with pip

pip install py-spy

Use py-spy record to generate an svg flame graph:

py-spy record -o profile.svg -- python script.py

Without subprocesses

Multiprocessing with 12 workers: normal profiling only shows module loading and argparse. All the multiprocessing work is done in subprocesses that are not visible to the profiler:

With subprocesses

With the -s or --subprocess flag, py-spy is able to profile all the spawned subprocesses, and now we see profiling data for all 12 workers in addition to the original Python process:

Profiling multiprocessing code: viztracer

viztracer

  • Sampling (statistical) tracing tool in Python
  • Can attach to a running process
  • Supports all variants of multithreading / multiprocessing

Install with pip

pip install viztracer

Use viztracer to generate a result.json trace file, then open with ui.perfetto.dev:

viztracer script.py

Multiprocessing traces

Trace output can be very informative when dealing with multithreaded code, for example to see which workers are active when. Here is a viztracer trace of the same code displayed using ui.perfetto.dev:

Profiling compiled code

Profiling native extensions

If your python code calls a compiled/native (Cython / C / C++) extension, the profilers we've looked at so far aren't able to see what the extension is doing.

Most of the time this is not a problem, since typically you didn't write the extension / aren't going to modify it, so there is no need to profile what is going on inside it - the only thing relevant to you is how long your Python code that calls the extension is running for.

If, however, you are developing your own compiled extension, then it can be helpful to see both Python and the compiled code in the profiling traces

py-spy

If we run py-spy record on this script we see that all the time is spent in line 3 of sim.py, but since this calls a compiled extension, we can't dig any deeper:

py-spy --native

If we add the --native flag (and compile our extension with debugging symbols) then we can also see profiling data for the compiled functions:

py-spy --native -f chrometrace

With -f chrometrace we generate a trace that we can view in ui.perfetto.dev:

py-spy record -s -n -o trace.json -f chrometrace python sim.py

Memory use of native extensions

If we use memray as we did before with this script, we see that model.simulate() allocates a bunch of memory, but there is no more information since all these allocations occur within the compiled module:

Memory use of native extensions

If we call memray with the --native option we also see compiled code memory info:

Profiling GPU code

torch.profiler: CPU

To use the pytorch profiler, we need to modify our code and wrap the bit we want to profile with a profile context manager (in the same way we did with cProfile). For example, to profile the CPU activity:

from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU]) as prof:
    # ...do some work here...
print(prof.key_averages().table(sort_by="cpu_time_total"))

torch.profiler: CPU output

The output is a table of functions with their self and total CPU times.

Here self = tottime in cProfile, and total = cumtime in cProfile

torch.profiler: tensor shapes

It can also be helpful to see the shapes of the tensor inputs to these functions, which can be done by setting record_shapes=True:

from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    # ...do some work here...
print(prof.key_averages(group_by_input_shape=True).table())

torch.profiler: tensor shapes output

torch.profiler: GPU

If we are using a gpu then we can add CUDA (for nvidia GPUs) or XPU (for Intel gpus) to the activities we want to profile:

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:

    # ...do some work...

print(prof.key_averages().table(sort_by="cuda_time_total"))

torch.profiler: GPU output

torch.profiler: memory

To also see the RAM used by functions, set profile_memory=True:

from torch.profiler import profile

with profile(profile_memory=True) as prof:
    # ...do some work here…

print(prof.key_averages().table(sort_by="cuda_memory_usage"))

torch.profiler: memory output

torch.profiler: Traces

The torch profiler can also output traces that can be viewed in ui.perfetto.dev:

from torch.profiler import profile, ProfilerActivity

with profile(
  activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA],
  profile_memory = True,
  record_shapes = True,
  with_stack = True
) as prof:
    # ...do some work...
prof.export_chrome_trace("trace.json")

torch.profiler: Traces output

torch.profiler: schedule

Often you don't want to profile everything - for example in a long training run you may just want to profile a couple of training steps. You can use profiler.schedule for this:

def trace_handler(p):
    p.export_chrome_trace(f"/tmp/trace_{p.step_num}.json")


with profile(
    activities=activities,
    schedule=torch.profiler.schedule(wait=5, warmup=1, active=1, repeat=3),
    on_trace_ready=trace_handler,
) as p:
    for idx in range(100):
        # ...model training step...
        p.step()

HolisticTraceAnalysis

If you are training models using many GPU nodes, then profiling (and understanding the profiling data) becomes more challenging.

Holistic Trace Analysis (HTA) is a library that may be useful for this:

hta.readthedocs.io

They also provide example Jupyter Notebooks:

github.com/facebookresearch/HolisticTraceAnalysis/tree/main/examples

(I haven't used it though)

JAX

If you are using jax, it also has a profiler that can generate json traces that you can then view using in ui.perfetto.dev:

import jax


f = jax.jit(lambda x: x @ x)

x = jax.random.normal(jax.random.key(0), (4096, 4096))

f(x).block_until_ready()


with jax.profiler.trace("/tmp/jax_trace.json"):

    y = f(x)

    y.block_until_ready()

Hands on profiling

Hands on profiling

If you haven't already done so, clone the repo and install the dependencies:

git clone https://github.com/ssciwr/python-performance-profiling

cd python-performance-profiling

pip install -r requirements.txt

Hands on: example1

Example 1: pipeline.py

  • The script takes a few seconds to run:
    • time python pipeline.py
  • There are some tests included:
    • python -m pytest
  • Profile with cProfile:
    • inspect console output
      • python -m cProfile
    • snakeviz
    • call graph output
  • What is the current bottleneck?

Example 1: pipeline_opt.py

Main bottleneck: count_similar_record_pairs

  • More than 90% of the execution time is spent in this function
  • O(n^2) algorithm compares means of all pairs of records
  • Can replace with O(n log n) if we first sort the means

  • Run tests to ensure our implementation is correct
  • Re-run profiling - is that already fast enough? If so no need to profile further!
    • Otherwise look for the next bottleneck

Example 1: pipeline_opt2.py

  • Python script loading and module import now non-negligible
    • Consider increasing n so that the main pipeline dominates the runtime
  • New bottleneck is now compute_statistics
    • In particular the normalize_name function
  • Use line_profiler
    • no obvious bottleneck here - instead "death by a thousand cuts"
  • Replace function with call to specialized library like PyICU or Unidecode
    • Compiled specialized libraries - typically significantly faster than pure Python implementation
  • Re-run profiling - is that already fast enough? If so no need to profile further!
    • Otherwise look for the next bottleneck

Example 1: pipeline_opt3.py

  • Variance and mean calculations now the bottleneck
    • Consider increasing n so that the main pipeline dominates the runtime
  • We calculate the mean twice, once for the mean, once for the variance
    • This is wasted computation
  • Combine in one function that calculates both together
    • Same result with less work
  • Could/should consider using a library e.g. numpy for this
    • But would need to refactor to use numpy arrays everywhere to avoid conversion costs
  • Re-run profiling - is that already fast enough? If so no need to profile further!
    • Otherwise look for the next bottleneck

Example 1: possible further steps

With each new round of profiling, the first question to ask is if you are done.

The gains typically go down after the first few rounds, while the costs (e.g. difficulty of implementation, testing and code maintenance) typically increase.

You also need to take care that you are not "over-fitting" to your test case, but that the improvements translate to real use of the code with real data.

Further steps here could include

  • Using numpy types instead of python lists
  • Using numba to just-in-time compile numerical code
  • Porting this script to a faster language like rust or c++

Hands on: example2

Example 2

More pure Python code: reads a text file, modifies it, writes it to an output file.

  • Generate some test data
    • bash generate_data.sh
  • Run the code
    • python fix_headers.py test.fastq out.fastq
  • Profile the code
    • Compare cProfile, pyinstrument and py-spy
    • Use line_profiler
    • Use memray, see how the memory use changes with the data size

Hands on: example3

Example 3

Python multiprocessing code that does some calculations

  • Run the code using 4 workers
    • python script.py --workers=4
  • Profile the code
    • Use py-spy to generate a flame graph
    • Use viztracer to generate a trace and view using ui.perfetto.dev
    • Use the --ignore_c_function option for viztracer to reduce trace size
  • Let's pretend the work_unit function is already optimised
  • Investigate what happens as we change
    • --size
    • --tasks
    • --workers
    • --chunksize

Example 3 multiprocessing scenarios

  • Good parallel performance
    • --workers 4 --tasks 32 --size 300000 --chunksize 1
  • Overhead dominates (tasks too small - actually gets slower with more workers)
    • --workers 4 --tasks 50000 --size 10 --chunksize 1
  • Waiting for one worker to finish (too few / too different tasks)
    • --workers 4 --tasks 10 --size 500000 --chunksize 1
  • Balance vs overhead tradeoff (smaller chunks = better balance but more overhead)
    • --workers 4 --tasks 120000 --size 500 --chunksize 1
    • --workers 4 --tasks 120000 --size 500 --chunksize 250
    • --workers 4 --tasks 120000 --size 500 --chunksize 12000

Hands on: example4

Example 4

Pytorch code that does some calculations

  • Install pytorch: pytorch.org/get-started/locally
  • Run the code
    • python calc.py --workers=4
  • Inspect the trace using ui.perfetto.dev
  • Compare with calc_batched.py
  • If you don't have a gpu, use the included example traces
  • If you do have a gpu, try different options for the profiler
    • Scheduler
    • Label sections

Summary

Summary

In this course we covered:

  • What profiling is and why it is useful
  • Different ways to profile your code
  • Profiling CPU use of Python code
  • Profiling CPU use of compiled extensions
  • Profiling GPU use of pytorch / jax code
  • Profiling memory use
  • Hands on profiling examples

Next steps

  • Profile your code!
    • Pick a profiler and try it, e.g. pyinstrument -r html script.py
  • Try out a few different profilers
    • Compare your results with e.g. cProfile or py-spy
  • Get familiar with ui.perfetto.dev
    • Many profilers (not just Python ones) can export traces to be displayed here
  • SSC consultations