• SSC Lunch Time Python
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Lunch Time Python¶

Lunch 6: numba¶

No description has been provided for this image

numba is a just-in-time (JIT) compiler for Python. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)

Lunch Time Python, Scientific Software Center, Heidelberg University

Motivation¶

  • Many reasons to use Python, but performance not one of them
  • What to do when a Python function is too slow?
  • Ideally, find a library (e.g. numpy) with an equivalent function
  • Otherwise:
    • use PyPy instead of CPython (if all your libraries are available)
    • write a fortan function and compile with f2py or fortranmagic
    • write a C function and compile with Cython
    • write a C++ function and compile using pybind11 or ipybind
    • magically make your slow Python function faster (numba)

numba installation¶

  • Conda: conda install numba
  • Pip: python -m pip install numba

Vector reduction example¶

Toy example: implement a vector reduction operation:

r(x,y) = $ \sum_i \cos(x_i) \sin(y_i) $

Some random vectors to benchmark our functions:

In [1]:
import numpy as np

x = np.random.uniform(low=-1, high=1, size=5000000)
y = np.random.uniform(low=-1, high=1, size=5000000)

Python¶

In [2]:
import math


def r_python(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [3]:
r_python(x, y)
Out[3]:
-1006.4435935997055
In [4]:
%timeit r_python(x,y)
768 ms ± 4.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy¶

In [5]:
def r_numpy(x_vec, y_vec):
    return np.dot(np.cos(x_vec), np.sin(y_vec))
In [6]:
r_numpy(x, y)
Out[6]:
-1006.4435935997382
In [7]:
%timeit r_numpy(x,y)
132 ms ± 121 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Cython¶

In [8]:
# pip install cython
%load_ext cython
In [9]:
%%cython

import math

def r_cython(x_vec, y_vec):
    s = 0
    for x,y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [10]:
r_cython(x, y)
Out[10]:
-1006.4435935997055
In [11]:
%timeit r_cython(x,y)
905 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
%%cython

import math
# use C math functions
from libc.math cimport sin, cos

# use C types instead of Python types
def r_cython(double[:] x_vec, double[:] y_vec):
    cdef double s = 0
    cdef int i
    for i in range(len(x_vec)):
        s += cos(x_vec[i])*sin(y_vec[i])
    return s
In [13]:
r_cython(x, y)
Out[13]:
-1006.4435935997055
In [14]:
%timeit r_cython(x,y)
100 ms ± 80 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fortran¶

In [15]:
if "google.colab" in str(get_ipython()):
    !pip install fortran-magic -qqq
%load_ext fortranmagic
In [16]:
%%fortran

subroutine r_fortran(x_vec, y_vec, res)
    real, intent(in) :: x_vec(:), y_vec(:)
    real, intent(out) :: res
    integer :: i, n
    n = size(x_vec)
    res = 0
    do i=1,n
        res = res + cos(x_vec(i))*sin(y_vec(i))
    enddo
endsubroutine r_fortran
In [17]:
r_fortran(x, y)
Out[17]:
-1006.4127807617188
In [18]:
%timeit r_fortran(x,y)
27.9 ms ± 282 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

C++ / pybind11¶

In [19]:
if "google.colab" in str(get_ipython()):
    !pip install git+https://github.com/aldanor/ipybind.git -qqq
%load_ext ipybind
In [20]:
%%pybind11

#include <pybind11/numpy.h>
#include <math.h>
PYBIND11_PLUGIN(example) {
    py::module m("example");
    m.def("r_pybind", [](const py::array_t<double>& x, const py::array_t<double>& y) {
        double sum{0};
        auto rx{x.unchecked<1>()};
        auto ry{y.unchecked<1>()};
        for (py::ssize_t i = 0; i < rx.shape(0); i++){
            sum += std::cos(rx[i])*std::sin(ry[i]);
        }
        return sum;
    });
    return m.ptr();
}
In [21]:
r_pybind(x, y)
Out[21]:
-1006.4435935997055
In [22]:
%timeit r_pybind(x, y)
97.7 ms ± 78.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numba¶

In [23]:
from numba import jit


@jit
def r_numba(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s
In [24]:
r_numba(x, y)
Out[24]:
-1006.4435935997055
In [25]:
# pure python with numba JIT
%timeit r_numba(x,y)
100 ms ± 124 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numba compilation¶

Two compilation modes

  • nopython mode (default)
    • Fast because it doesn't access the Python C API
    • Needs to be able to infer the native (C) types of all values
  • object mode (fallback)
    • Slow because it uses Python objects and the Python C API
    • Only used if nopython mode is not possible
    • To raise an error instead of falling back to this, set nopython=True or use @njit

Numba function signatures¶

You can optionally explicitly specify the function signature. Use cases:

  • you want the function to be compiled when it is defined rather than when it is first called
  • you need fine-grained control over types (e.g. if you want 32-bit floats)
In [26]:
from numba import float32


@jit(float32(float32, float32))
def sum(a, b):
    return a + b
In [27]:
sum(1, 0.99999999)
Out[27]:
2.0

Numba options¶

  • nopython=True disable Object mode fallback
  • nogil=True release the Python Global Interpreter Lock (GIL)
  • cache=True cache the compiled funtions on disk
  • parallel=True enable automatic parallelization

Parallelization¶

  • set parallel=True option to enable
  • use prange to explicitly parallelize a loop over a range
In [28]:
from numba import jit, prange


@jit(parallel=True)
def r_numba(x_vec, y_vec):
    s = 0
    for i in prange(len(x_vec)):
        s += math.cos(x[i]) * math.sin(y[i])
    return s
In [29]:
r_numba(x, y)
Out[29]:
-1006.4435935996374
In [30]:
%timeit r_numba(x,y)
34.3 ms ± 473 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

NumPy universal functions¶

  • a numpy ufunc is a function that operates on scalars
  • can create one using @numba.vectorize and use it like built-in numpy ufuncs
In [31]:
from numba import vectorize, float64


@vectorize([float64(float64, float64)], target="parallel")
def r(x, y):
    return np.cos(x) * np.sin(y)
In [32]:
r(2, 3)
Out[32]:
-0.05872664492762098
In [33]:
r(x, y)
Out[33]:
array([-0.49655387, -0.1621126 , -0.17655024, ...,  0.6104694 ,
        0.54327212,  0.22563231])
In [34]:
np.sum(r(x, y))
Out[34]:
-1006.4435935997466
In [35]:
%timeit np.sum(r(x,y))
57.7 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Advanced features¶

  • Ahead of Time (AoT) compilation
    • the compiled module only depends on NumPy
  • Flexible specializations
    • @generated_jit decorator for compile-time logic, e.g. type specializations
  • Stencil
    • @stencil decorator for creating a stencil to apply to an array
  • C callbacks
    • @cfunc decorator to generate a C-callback (e.g. to pass to scipy.integrate)
  • CUDA support
    • compile CUDA kernels to run on a GPU
  • see numba.readthedocs.io for more