Lunch Time Python¶

Lunch 6: numba¶

numba is a just-in-time (JIT) compiler for Python. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)

Lunch Time Python, Scientific Software Center, Heidelberg University

Motivation¶

  • Many reasons to use Python, but performance not one of them
  • What to do when a Python function is too slow?
  • Ideally, find a library (e.g. numpy) with an equivalent function
  • Otherwise:
    • use PyPy instead of CPython (if all your libraries are available)
    • write a fortan function and compile with f2py or fortranmagic
    • write a C function and compile with Cython
    • write a C++ function and compile using pybind11 or ipybind
    • magically make your slow Python function faster (numba)

numba installation¶

  • Conda: conda install numba
  • Pip: python -m pip install numba

Vector reduction example¶

Toy example: implement a vector reduction operation:

r(x,y) = $ \sum_i \cos(x_i) \sin(y_i) $

Some random vectors to benchmark our functions:

In [1]:
import numpy as np

x = np.random.uniform(low=-1, high=1, size=5000000)
y = np.random.uniform(low=-1, high=1, size=5000000)

Python¶

In [2]:
import math


def r_python(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [3]:
r_python(x, y)
Out[3]:
-1446.1487984663088
In [4]:
%timeit r_python(x,y)
1.85 s ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy¶

In [5]:
def r_numpy(x_vec, y_vec):
    return np.dot(np.cos(x_vec), np.sin(y_vec))
In [6]:
r_numpy(x, y)
Out[6]:
-1446.1487984664032
In [7]:
%timeit r_numpy(x,y)
36.8 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Cython¶

In [8]:
# pip install cython
%load_ext cython
In [9]:
%%cython

import math

def r_cython(x_vec, y_vec):
    s = 0
    for x,y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [10]:
r_cython(x, y)
Out[10]:
-1446.1487984663088
In [11]:
%timeit r_cython(x,y)
1.28 s ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
%%cython

import math
# use C math functions
from libc.math cimport sin, cos

# use C types instead of Python types
def r_cython(double[:] x_vec, double[:] y_vec):
    cdef double s = 0
    cdef int i
    for i in range(len(x_vec)):
        s += cos(x_vec[i])*sin(y_vec[i])
    return s
In [13]:
r_cython(x, y)
Out[13]:
-1446.1487984663088
In [14]:
%timeit r_cython(x,y)
113 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fortran¶

In [15]:
if "google.colab" in str(get_ipython()):
    !pip install fortran-magic -qqq
%load_ext fortranmagic
In [16]:
%%fortran

subroutine r_fortran(x_vec, y_vec, res)
    real, intent(in) :: x_vec(:), y_vec(:)
    real, intent(out) :: res
    integer :: i, n
    n = size(x_vec)
    res = 0
    do i=1,n
        res = res + cos(x_vec(i))*sin(y_vec(i))
    enddo
endsubroutine r_fortran
In [17]:
r_fortran(x, y)
Out[17]:
-1446.0894775390625
In [18]:
%timeit r_fortran(x,y)
80.8 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

C++ / pybind11¶

In [19]:
if "google.colab" in str(get_ipython()):
    !pip install git+https://github.com/aldanor/ipybind.git -qqq
%load_ext ipybind
In [20]:
%%pybind11

#include <pybind11/numpy.h>
#include <math.h>
PYBIND11_PLUGIN(example) {
    py::module m("example");
    m.def("r_pybind", [](const py::array_t<double>& x, const py::array_t<double>& y) {
        double sum{0};
        auto rx{x.unchecked<1>()};
        auto ry{y.unchecked<1>()};
        for (py::ssize_t i = 0; i < rx.shape(0); i++){
            sum += std::cos(rx[i])*std::sin(ry[i]);
        }
        return sum;
    });
    return m.ptr();
}
In [21]:
r_pybind(x, y)
Out[21]:
-1446.1487984663088
In [22]:
%timeit r_pybind(x, y)
106 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numba¶

In [23]:
from numba import jit


@jit
def r_numba(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
/tmp/ipykernel_2159/1652357933.py:5: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def r_numba(x_vec, y_vec):
In [24]:
r_numba(x, y)
Out[24]:
-1446.1487984663088
In [25]:
# pure python with numba JIT
%timeit r_numba(x,y)
111 ms ± 62.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numba compilation¶

Two compilation modes

  • nopython mode (default)
    • Fast because it doesn't access the Python C API
    • Needs to be able to infer the native (C) types of all values
  • object mode (fallback)
    • Slow because it uses Python objects and the Python C API
    • Only used if nopython mode is not possible
    • To raise an error instead of falling back to this, set nopython=True or use @njit

Numba function signatures¶

You can optionally explicitly specify the function signature. Use cases:

  • you want the function to be compiled when it is defined rather than when it is first called
  • you need fine-grained control over types (e.g. if you want 32-bit floats)
In [26]:
from numba import float32


@jit(float32(float32, float32))
def sum(a, b):
    return a + b
/tmp/ipykernel_2159/4188174656.py:4: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit(float32(float32, float32))
In [27]:
sum(1, 0.99999999)
Out[27]:
2.0

Numba options¶

  • nopython=True disable Object mode fallback
  • nogil=True release the Python Global Interpreter Lock (GIL)
  • cache=True cache the compiled funtions on disk
  • parallel=True enable automatic parallelization

Parallelization¶

  • set parallel=True option to enable
  • use prange to explicitly parallelize a loop over a range
In [28]:
from numba import jit, prange


@jit(parallel=True)
def r_numba(x_vec, y_vec):
    s = 0
    for i in prange(len(x_vec)):
        s += math.cos(x[i]) * math.sin(y[i])
    return s
/tmp/ipykernel_2159/1651332874.py:4: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit(parallel=True)
In [29]:
r_numba(x, y)
Out[29]:
-1446.1487984664693
In [30]:
%timeit r_numba(x,y)
53.4 ms ± 65.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

NumPy universal functions¶

  • a numpy ufunc is a function that operates on scalars
  • can create one using @numba.vectorize and use it like built-in numpy ufuncs
In [31]:
from numba import vectorize, float64


@vectorize([float64(float64, float64)], target="parallel")
def r(x, y):
    return np.cos(x) * np.sin(y)
In [32]:
r(2, 3)
Out[32]:
-0.05872664492762098
In [33]:
r(x, y)
Out[33]:
array([ 0.37173157,  0.01734083,  0.62410497, ..., -0.66204876,
        0.4349269 ,  0.32355773])
In [34]:
np.sum(r(x, y))
Out[34]:
-1446.1487984663831
In [35]:
%timeit np.sum(r(x,y))
88 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Advanced features¶

  • Ahead of Time (AoT) compilation
    • the compiled module only depends on NumPy
  • Flexible specializations
    • @generated_jit decorator for compile-time logic, e.g. type specializations
  • Stencil
    • @stencil decorator for creating a stencil to apply to an array
  • C callbacks
    • @cfunc decorator to generate a C-callback (e.g. to pass to scipy.integrate)
  • CUDA support
    • compile CUDA kernels to run on a GPU
  • see numba.readthedocs.io for more