Lunch Time Python¶
Lunch 6: numba¶
numba is a just-in-time (JIT) compiler for Python. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.
Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)
Lunch Time Python, Scientific Software Center, Heidelberg University
Motivation¶
- Many reasons to use Python, but performance not one of them
 - What to do when a Python function is too slow?
 - Ideally, find a library (e.g. numpy) with an equivalent function
 - Otherwise:
- use PyPy instead of CPython (if all your libraries are available)
 - write a fortan function and compile with f2py or fortranmagic
 - write a C function and compile with Cython
 - write a C++ function and compile using pybind11 or ipybind
 - magically make your slow Python function faster (numba)
 
 
numba installation¶
- Conda: 
conda install numba - Pip: 
python -m pip install numba 
Vector reduction example¶
Toy example: implement a vector reduction operation:
r(x,y) = $ \sum_i \cos(x_i) \sin(y_i) $
Some random vectors to benchmark our functions:
In [1]:
import numpy as np
x = np.random.uniform(low=-1, high=1, size=5000000)
y = np.random.uniform(low=-1, high=1, size=5000000)
Python¶
In [2]:
import math
def r_python(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [3]:
r_python(x, y)
Out[3]:
604.1295065127233
In [4]:
%timeit r_python(x,y)
891 ms ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy¶
In [5]:
def r_numpy(x_vec, y_vec):
    return np.dot(np.cos(x_vec), np.sin(y_vec))
In [6]:
r_numpy(x, y)
Out[6]:
604.1295065127337
In [7]:
%timeit r_numpy(x,y)
134 ms ± 238 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Cython¶
In [8]:
# pip install cython
%load_ext cython
In [9]:
%%cython
import math
def r_cython(x_vec, y_vec):
    s = 0
    for x,y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [10]:
r_cython(x, y)
Out[10]:
604.1295065127233
In [11]:
%timeit r_cython(x,y)
943 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
%%cython
import math
# use C math functions
from libc.math cimport sin, cos
# use C types instead of Python types
def r_cython(double[:] x_vec, double[:] y_vec):
    cdef double s = 0
    cdef int i
    for i in range(len(x_vec)):
        s += cos(x_vec[i])*sin(y_vec[i])
    return s
In [13]:
r_cython(x, y)
Out[13]:
604.1295065127233
In [14]:
%timeit r_cython(x,y)
100 ms ± 68.4 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Fortran¶
In [15]:
if "google.colab" in str(get_ipython()):
    !pip install fortran-magic -qqq
%load_ext fortranmagic
In [16]:
%%fortran
subroutine r_fortran(x_vec, y_vec, res)
    real, intent(in) :: x_vec(:), y_vec(:)
    real, intent(out) :: res
    integer :: i, n
    n = size(x_vec)
    res = 0
    do i=1,n
        res = res + cos(x_vec(i))*sin(y_vec(i))
    enddo
endsubroutine r_fortran
In [17]:
r_fortran(x, y)
Out[17]:
604.1292724609375
In [18]:
%timeit r_fortran(x,y)
27.4 ms ± 375 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
C++ / pybind11¶
In [19]:
if "google.colab" in str(get_ipython()):
    !pip install git+https://github.com/aldanor/ipybind.git -qqq
%load_ext ipybind
In [20]:
%%pybind11
#include <pybind11/numpy.h>
#include <math.h>
PYBIND11_PLUGIN(example) {
    py::module m("example");
    m.def("r_pybind", [](const py::array_t<double>& x, const py::array_t<double>& y) {
        double sum{0};
        auto rx{x.unchecked<1>()};
        auto ry{y.unchecked<1>()};
        for (py::ssize_t i = 0; i < rx.shape(0); i++){
            sum += std::cos(rx[i])*std::sin(ry[i]);
        }
        return sum;
    });
    return m.ptr();
}
In [21]:
r_pybind(x, y)
Out[21]:
604.1295065127233
In [22]:
%timeit r_pybind(x, y)
97.8 ms ± 187 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numba¶
In [23]:
from numba import jit
@jit
def r_numba(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [24]:
r_numba(x, y)
Out[24]:
604.1295065127233
In [25]:
# pure python with numba JIT
%timeit r_numba(x,y)
100 ms ± 61.2 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba compilation¶
Two compilation modes
nopythonmode (default)- Fast because it doesn't access the Python C API
 - Needs to be able to infer the native (C) types of all values
 
objectmode (fallback)- Slow because it uses Python objects and the Python C API
 - Only used if 
nopythonmode is not possible - To raise an error instead of falling back to this, set 
nopython=Trueor use@njit 
Numba function signatures¶
You can optionally explicitly specify the function signature. Use cases:
- you want the function to be compiled when it is defined rather than when it is first called
 - you need fine-grained control over types (e.g. if you want 32-bit floats)
 
In [26]:
from numba import float32
@jit(float32(float32, float32))
def sum(a, b):
    return a + b
In [27]:
sum(1, 0.99999999)
Out[27]:
2.0
Numba options¶
nopython=Truedisable Object mode fallbacknogil=Truerelease the Python Global Interpreter Lock (GIL)cache=Truecache the compiled funtions on diskparallel=Trueenable automatic parallelization
Parallelization¶
- set 
parallel=Trueoption to enable - use 
prangeto explicitly parallelize a loop over arange 
In [28]:
from numba import jit, prange
@jit(parallel=True)
def r_numba(x_vec, y_vec):
    s = 0
    for i in prange(len(x_vec)):
        s += math.cos(x[i]) * math.sin(y[i])
    return s
In [29]:
r_numba(x, y)
Out[29]:
604.1295065127323
In [30]:
%timeit r_numba(x,y)
34.1 ms ± 218 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
NumPy universal functions¶
- a numpy 
ufuncis a function that operates on scalars - can create one using 
@numba.vectorizeand use it like built-in numpy ufuncs 
In [31]:
from numba import vectorize, float64
@vectorize([float64(float64, float64)], target="parallel")
def r(x, y):
    return np.cos(x) * np.sin(y)
In [32]:
r(2, 3)
Out[32]:
-0.05872664492762098
In [33]:
r(x, y)
Out[33]:
array([-0.40239533,  0.34221466, -0.26402782, ..., -0.54781304,
        0.09943108,  0.0503303 ])
In [34]:
np.sum(r(x, y))
Out[34]:
604.1295065127357
In [35]:
%timeit np.sum(r(x,y))
56.5 ms ± 138 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Advanced features¶
- Ahead of Time (AoT) compilation
- the compiled module only depends on NumPy
 
 - Flexible specializations
@generated_jitdecorator for compile-time logic, e.g. type specializations
 - Stencil
@stencildecorator for creating a stencil to apply to an array
 - C callbacks
@cfuncdecorator to generate a C-callback (e.g. to pass to scipy.integrate)
 - CUDA support
- compile CUDA kernels to run on a GPU
 
 - see numba.readthedocs.io for more