Lunch Time Python¶
Lunch 6: numba¶
numba is a just-in-time (JIT) compiler for Python. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.
Press Spacebar
to go to the next slide (or ?
to see all navigation shortcuts)
Lunch Time Python, Scientific Software Center, Heidelberg University
Motivation¶
- Many reasons to use Python, but performance not one of them
- What to do when a Python function is too slow?
- Ideally, find a library (e.g. numpy) with an equivalent function
- Otherwise:
- use PyPy instead of CPython (if all your libraries are available)
- write a fortan function and compile with f2py or fortranmagic
- write a C function and compile with Cython
- write a C++ function and compile using pybind11 or ipybind
- magically make your slow Python function faster (numba)
numba installation¶
- Conda:
conda install numba
- Pip:
python -m pip install numba
Vector reduction example¶
Toy example: implement a vector reduction operation:
r(x,y) = $ \sum_i \cos(x_i) \sin(y_i) $
Some random vectors to benchmark our functions:
In [1]:
import numpy as np
x = np.random.uniform(low=-1, high=1, size=5000000)
y = np.random.uniform(low=-1, high=1, size=5000000)
Python¶
In [2]:
import math
def r_python(x_vec, y_vec):
s = 0
for x, y in zip(x_vec, y_vec):
s += math.cos(x) * math.sin(y)
return s
In [3]:
r_python(x, y)
Out[3]:
1693.484794895606
In [4]:
%timeit r_python(x,y)
752 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy¶
In [5]:
def r_numpy(x_vec, y_vec):
return np.dot(np.cos(x_vec), np.sin(y_vec))
In [6]:
r_numpy(x, y)
Out[6]:
1693.4847948957686
In [7]:
%timeit r_numpy(x,y)
132 ms ± 360 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Cython¶
In [8]:
# pip install cython
%load_ext cython
In [9]:
%%cython
import math
def r_cython(x_vec, y_vec):
s = 0
for x,y in zip(x_vec, y_vec):
s += math.cos(x) * math.sin(y)
return s
In [10]:
r_cython(x, y)
Out[10]:
1693.484794895606
In [11]:
%timeit r_cython(x,y)
890 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
%%cython
import math
# use C math functions
from libc.math cimport sin, cos
# use C types instead of Python types
def r_cython(double[:] x_vec, double[:] y_vec):
cdef double s = 0
cdef int i
for i in range(len(x_vec)):
s += cos(x_vec[i])*sin(y_vec[i])
return s
In [13]:
r_cython(x, y)
Out[13]:
1693.484794895606
In [14]:
%timeit r_cython(x,y)
102 ms ± 146 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Fortran¶
In [15]:
if "google.colab" in str(get_ipython()):
!pip install fortran-magic -qqq
%load_ext fortranmagic
In [16]:
%%fortran
subroutine r_fortran(x_vec, y_vec, res)
real, intent(in) :: x_vec(:), y_vec(:)
real, intent(out) :: res
integer :: i, n
n = size(x_vec)
res = 0
do i=1,n
res = res + cos(x_vec(i))*sin(y_vec(i))
enddo
endsubroutine r_fortran
In [17]:
r_fortran(x, y)
Out[17]:
1693.455322265625
In [18]:
%timeit r_fortran(x,y)
61.5 ms ± 136 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
C++ / pybind11¶
In [19]:
if "google.colab" in str(get_ipython()):
!pip install git+https://github.com/aldanor/ipybind.git -qqq
%load_ext ipybind
In [20]:
%%pybind11
#include <pybind11/numpy.h>
#include <math.h>
PYBIND11_PLUGIN(example) {
py::module m("example");
m.def("r_pybind", [](const py::array_t<double>& x, const py::array_t<double>& y) {
double sum{0};
auto rx{x.unchecked<1>()};
auto ry{y.unchecked<1>()};
for (py::ssize_t i = 0; i < rx.shape(0); i++){
sum += std::cos(rx[i])*std::sin(ry[i]);
}
return sum;
});
return m.ptr();
}
In [21]:
r_pybind(x, y)
Out[21]:
1693.484794895606
In [22]:
%timeit r_pybind(x, y)
98.6 ms ± 68.6 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numba¶
In [23]:
from numba import jit
@jit
def r_numba(x_vec, y_vec):
s = 0
for x, y in zip(x_vec, y_vec):
s += math.cos(x) * math.sin(y)
return s
In [24]:
r_numba(x, y)
Out[24]:
1693.484794895606
In [25]:
# pure python with numba JIT
%timeit r_numba(x,y)
101 ms ± 117 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba compilation¶
Two compilation modes
nopython
mode (default)- Fast because it doesn't access the Python C API
- Needs to be able to infer the native (C) types of all values
object
mode (fallback)- Slow because it uses Python objects and the Python C API
- Only used if
nopython
mode is not possible - To raise an error instead of falling back to this, set
nopython=True
or use@njit
Numba function signatures¶
You can optionally explicitly specify the function signature. Use cases:
- you want the function to be compiled when it is defined rather than when it is first called
- you need fine-grained control over types (e.g. if you want 32-bit floats)
In [26]:
from numba import float32
@jit(float32(float32, float32))
def sum(a, b):
return a + b
In [27]:
sum(1, 0.99999999)
Out[27]:
2.0
Numba options¶
nopython=True
disable Object mode fallbacknogil=True
release the Python Global Interpreter Lock (GIL)cache=True
cache the compiled funtions on diskparallel=True
enable automatic parallelization
Parallelization¶
- set
parallel=True
option to enable - use
prange
to explicitly parallelize a loop over arange
In [28]:
from numba import jit, prange
@jit(parallel=True)
def r_numba(x_vec, y_vec):
s = 0
for i in prange(len(x_vec)):
s += math.cos(x[i]) * math.sin(y[i])
return s
In [29]:
r_numba(x, y)
Out[29]:
1693.4847948957886
In [30]:
%timeit r_numba(x,y)
34.2 ms ± 359 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
NumPy universal functions¶
- a numpy
ufunc
is a function that operates on scalars - can create one using
@numba.vectorize
and use it like built-in numpy ufuncs
In [31]:
from numba import vectorize, float64
@vectorize([float64(float64, float64)], target="parallel")
def r(x, y):
return np.cos(x) * np.sin(y)
In [32]:
r(2, 3)
Out[32]:
-0.05872664492762098
In [33]:
r(x, y)
Out[33]:
array([ 0.00958956, 0.01427726, -0.20416025, ..., 0.56578324, 0.55769682, -0.15520663])
In [34]:
np.sum(r(x, y))
Out[34]:
1693.4847948957656
In [35]:
%timeit np.sum(r(x,y))
55.3 ms ± 476 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Advanced features¶
- Ahead of Time (AoT) compilation
- the compiled module only depends on NumPy
- Flexible specializations
@generated_jit
decorator for compile-time logic, e.g. type specializations
- Stencil
@stencil
decorator for creating a stencil to apply to an array
- C callbacks
@cfunc
decorator to generate a C-callback (e.g. to pass to scipy.integrate)
- CUDA support
- compile CUDA kernels to run on a GPU
- see numba.readthedocs.io for more