• SSC Lunch Time Python
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Lunch Time Python¶

Lunch 6: numba¶

No description has been provided for this image

numba is a just-in-time (JIT) compiler for Python. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)

Lunch Time Python, Scientific Software Center, Heidelberg University

Motivation¶

  • Many reasons to use Python, but performance not one of them
  • What to do when a Python function is too slow?
  • Ideally, find a library (e.g. numpy) with an equivalent function
  • Otherwise:
    • use PyPy instead of CPython (if all your libraries are available)
    • write a fortan function and compile with f2py or fortranmagic
    • write a C function and compile with Cython
    • write a C++ function and compile using pybind11 or ipybind
    • magically make your slow Python function faster (numba)

numba installation¶

  • Conda: conda install numba
  • Pip: python -m pip install numba

Vector reduction example¶

Toy example: implement a vector reduction operation:

r(x,y) = $ \sum_i \cos(x_i) \sin(y_i) $

Some random vectors to benchmark our functions:

In [1]:
import numpy as np

x = np.random.uniform(low=-1, high=1, size=5000000)
y = np.random.uniform(low=-1, high=1, size=5000000)

Python¶

In [2]:
import math


def r_python(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [3]:
r_python(x, y)
Out[3]:
-2023.021642550174
In [4]:
%timeit r_python(x,y)
824 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy¶

In [5]:
def r_numpy(x_vec, y_vec):
    return np.dot(np.cos(x_vec), np.sin(y_vec))
In [6]:
r_numpy(x, y)
Out[6]:
-2023.021642549968
In [7]:
%timeit r_numpy(x,y)
134 ms ± 310 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Cython¶

In [8]:
# pip install cython
%load_ext cython
In [9]:
%%cython

import math

def r_cython(x_vec, y_vec):
    s = 0
    for x,y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [10]:
r_cython(x, y)
Out[10]:
-2023.021642550174
In [11]:
%timeit r_cython(x,y)
1.01 s ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
%%cython

import math
# use C math functions
from libc.math cimport sin, cos

# use C types instead of Python types
def r_cython(double[:] x_vec, double[:] y_vec):
    cdef double s = 0
    cdef int i
    for i in range(len(x_vec)):
        s += cos(x_vec[i])*sin(y_vec[i])
    return s
In [13]:
r_cython(x, y)
Out[13]:
-2023.021642550174
In [14]:
%timeit r_cython(x,y)
100 ms ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fortran¶

In [15]:
if "google.colab" in str(get_ipython()):
    !pip install fortran-magic -qqq
%load_ext fortranmagic
In [16]:
%%fortran

subroutine r_fortran(x_vec, y_vec, res)
    real, intent(in) :: x_vec(:), y_vec(:)
    real, intent(out) :: res
    integer :: i, n
    n = size(x_vec)
    res = 0
    do i=1,n
        res = res + cos(x_vec(i))*sin(y_vec(i))
    enddo
endsubroutine r_fortran
In [17]:
r_fortran(x, y)
Out[17]:
-2023.0682373046875
In [18]:
%timeit r_fortran(x,y)
28.2 ms ± 196 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

C++ / pybind11¶

In [19]:
if "google.colab" in str(get_ipython()):
    !pip install git+https://github.com/aldanor/ipybind.git -qqq
%load_ext ipybind
In [20]:
%%pybind11

#include <pybind11/numpy.h>
#include <math.h>
PYBIND11_PLUGIN(example) {
    py::module m("example");
    m.def("r_pybind", [](const py::array_t<double>& x, const py::array_t<double>& y) {
        double sum{0};
        auto rx{x.unchecked<1>()};
        auto ry{y.unchecked<1>()};
        for (py::ssize_t i = 0; i < rx.shape(0); i++){
            sum += std::cos(rx[i])*std::sin(ry[i]);
        }
        return sum;
    });
    return m.ptr();
}
In [21]:
r_pybind(x, y)
Out[21]:
-2023.021642550174
In [22]:
%timeit r_pybind(x, y)
98 ms ± 827 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numba¶

In [23]:
from numba import jit


@jit
def r_numba(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [24]:
r_numba(x, y)
Out[24]:
-2023.021642550174
In [25]:
# pure python with numba JIT
%timeit r_numba(x,y)
101 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numba compilation¶

Two compilation modes

  • nopython mode (default)
    • Fast because it doesn't access the Python C API
    • Needs to be able to infer the native (C) types of all values
  • object mode (fallback)
    • Slow because it uses Python objects and the Python C API
    • Only used if nopython mode is not possible
    • To raise an error instead of falling back to this, set nopython=True or use @njit

Numba function signatures¶

You can optionally explicitly specify the function signature. Use cases:

  • you want the function to be compiled when it is defined rather than when it is first called
  • you need fine-grained control over types (e.g. if you want 32-bit floats)
In [26]:
from numba import float32


@jit(float32(float32, float32))
def sum(a, b):
    return a + b
In [27]:
sum(1, 0.99999999)
Out[27]:
2.0

Numba options¶

  • nopython=True disable Object mode fallback
  • nogil=True release the Python Global Interpreter Lock (GIL)
  • cache=True cache the compiled funtions on disk
  • parallel=True enable automatic parallelization

Parallelization¶

  • set parallel=True option to enable
  • use prange to explicitly parallelize a loop over a range
In [28]:
from numba import jit, prange


@jit(parallel=True)
def r_numba(x_vec, y_vec):
    s = 0
    for i in prange(len(x_vec)):
        s += math.cos(x[i]) * math.sin(y[i])
    return s
In [29]:
r_numba(x, y)
Out[29]:
-2023.0216425498936
In [30]:
%timeit r_numba(x,y)
34 ms ± 95 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

NumPy universal functions¶

  • a numpy ufunc is a function that operates on scalars
  • can create one using @numba.vectorize and use it like built-in numpy ufuncs
In [31]:
from numba import vectorize, float64


@vectorize([float64(float64, float64)], target="parallel")
def r(x, y):
    return np.cos(x) * np.sin(y)
In [32]:
r(2, 3)
Out[32]:
-0.05872664492762098
In [33]:
r(x, y)
Out[33]:
array([-0.36093561, -0.67954435,  0.63029642, ...,  0.56882316,
        0.80986332,  0.35954116])
In [34]:
np.sum(r(x, y))
Out[34]:
-2023.021642549969
In [35]:
%timeit np.sum(r(x,y))
56.4 ms ± 164 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Advanced features¶

  • Ahead of Time (AoT) compilation
    • the compiled module only depends on NumPy
  • Flexible specializations
    • @generated_jit decorator for compile-time logic, e.g. type specializations
  • Stencil
    • @stencil decorator for creating a stencil to apply to an array
  • C callbacks
    • @cfunc decorator to generate a C-callback (e.g. to pass to scipy.integrate)
  • CUDA support
    • compile CUDA kernels to run on a GPU
  • see numba.readthedocs.io for more