How to speed up python code?

16 minute read

Published: May 14, 2021

Hello everybody! Today I’ll talk about performance in Python and how to improve it when dealing with nested for loops and mathematical operations.

I’ll benchmark an Euler Monte Carlo simulation using four different implementations: 1000, 10000, 100000, 200000 and 500000 paths. The implementations I’ll compare are:

Plain Python code.
Vectorized Numpy implementation.
Numba implementation.
Cython implementation.

Moreover, I’ll explain how to turn the plain Python implementation into the optimized versions. Note that the timing results will correspond to Python 3 Google Compute Engine backend of Google Cloud notebooks.

Let’s get started!

Plain Python code

In order to solve numerically the equation:

$X^{k}_{n+1} = X^{k}_{n} + \mu(t, X_{n})\Delta t + \sum_{j=1}^{p}\sigma^{k, j}(t, X_{n})\Delta W^{j}$

we need to simulate P paths across T times with a Δ𝑡 time step. We have to determine the drift and diffusion functions, which in this case for simplicity will be constants. We also need a random normal number for each path and each time. Putting these things toghether, a Python implementation of an Euler simulation looks like:

import numpy as np

def simulate_python(num_paths, num_time_steps, time_step, drift, diffusion, initial):
    x = np.zeros((num_paths, num_time_steps + 1))
    z = np.random.normal(0, 1, (num_paths, num_time_steps + 1))

    for p in range(num_paths):
        x[p, 0] = initial
        for t in range(1, num_time_steps + 1):
            x[p, t] = x[p, t - 1] * (1 + drift * time_step + diffusion * np.sqrt(time_step) * z[p, t])

    return x

Timing it with the magic command timeit, yields:

%%timeit
simulate_python(num_paths=100000, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)

1 loop, best of 5: 50.1 s per loop

Vectorized Numpy implementation

The above implementation has one nested loop written in Python which will be slow. As the simulation of the paths are independent, we can remove the Python loop over the paths and re-write the code to do the loops in C thanks to the numpy vectorization. In order to do that, we can use slicing in the axis corresponding to the paths. We do this in the x and z arrays. The Euler simulation then turns into:

def simulate_vectorized(num_paths, num_time_steps, time_step, drift, diffusion, initial):
    x = np.zeros((num_paths, num_time_steps + 1))
    z = np.random.normal(0, 1, (num_paths, num_time_steps + 1))

    x[:, 0] = np.tile(np.array([initial]), (num_paths, 1)).T

    for t in range(1, num_time_steps + 1):
        x[:, t] = x[:, t - 1] * (1 + drift * time_step + diffusion * np.sqrt(time_step) * z[:, t - 1])

    return x

Now, timing the vectorized implementation we arrive to:

%%timeit
simulate_vectorized(num_paths=100000, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)

1 loop, best of 5: 2.38 s per loop

In the vectorized code we use the tile numpy function to set the initial condition for x at time 0 for all paths. Note that the loop over the paths is no longer explicitly written. Thanks to numpy, the loop will be done in C and will have much more speed than the Python one.

Let’s move to another variation of the Euler simulation!

Numba implementation

Using the Numba package we can improve the execution of the simulation by compiling the slow interpreted Python code into fast machine code. This can be done by adding the @jit decorator with the flag of nopython mode enabled. As it is said in Numba web page this allows compiling the decorated function so that it will run entirely without the involvement of the Python interpreter. This mode is the recommended one in order to get the best performance. Taking this into account the Euler simulation code can be written as:

from numba import jit 

@jit(nopython=True, parallel=False)
def simulate_python_jit(num_paths, num_time_steps, time_step, drift, diffusion, initial):
    x = np.zeros((num_paths, num_time_steps + 1))
    z = np.random.normal(0, 1, (num_paths, num_time_steps + 1))

    for p in range(num_paths):
        x[p, 0] = initial
        for t in range(1, num_time_steps + 1):
            x[p, t] = x[p, t - 1] * (1 + drift * time_step + diffusion * np.sqrt(time_step) * z[p, t])

    return x

Testing the performance of the jit-compiled code gives us:

%%timeit
simulate_python_jit(num_paths=100000, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)

1 loop, best of 5: 1.42 s per loop

Note that we only needed to add @jit(nopython=True) in the line before the function definition.

Cython implementation

Another option we have in order to improve the performance of the Python loops, is to use Cython which is a superset of Python wich allows to achieve near the C performance by statically declaring the C types involved in our functions or Python classes. This helps the compiler to translate the Cython code to efficient C code. In order to do that we need to create a setup.py file and a .pyx file with the statically typed code.

From the plain Python code we changed some things:

1) We typed inputs, the output and some internal variables of simulate() method. We also turned that python method into a cython method by puting cdef instead of def at the definiton. For the internal variables were typed using cdef succeeded by the corresponding cython type. For the output, and the random numbers, since they are arrays, we used double[:, :] wich means that they are two-dimensional numpy arrays of doubles.

2) Another important performance change we made is to change the sqrt numpy function into the cython sqrt function. That was done by importing sqrt function from libc.math.

3) We also used two decorators before the function definition, which are boundscheck and wraparound. The first one allows to turn off the bounds checking in arrays, while the second one allows to turn off the negative indexing.

Before executing the modified code we have to load Cython. This is done as follows:

%load_ext Cython

After doing that, and taking into account the 3 steps described above, the cythonized Euler simulation code will look like:

%%cython

import numpy as np
from libc.math cimport sqrt
cimport numpy as np
import cython


def simulate_cython(int num_paths, int num_time_steps, double time_step, double drift, double diffusion, double initial):
    return simulate(num_paths, num_time_steps, time_step, drift, diffusion, initial)


@cython.boundscheck(False)
@cython.wraparound(False)
cdef double[:, :] simulate(int num_paths, int num_time_steps, double time_step, double drift, double diffusion, double initial):
    
    cdef double[:, :] x = np.zeros((num_paths, num_time_steps+1))
    cdef double[:, :] z = np.random.normal(0, 1, (num_paths, num_time_steps+1))
    cdef int t, p
    for p in range(num_paths):
        x[p, 0] = initial
        for t in range(1, num_time_steps + 1):
                x[p, t] = x[p, t - 1] * (1 + drift * time_step + diffusion * sqrt(time_step) * z[p, t])

    return x

Finally, timing the Cython code we get:

%%timeit
simulate_cython(num_paths=100000, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)

1 loop, best of 5: 1.22 s per loop

Speed comparison

Now, we are going to compare the four implementations for different quantity of paths, and then we will plot it:

import numpy as np

paths = [100, 1000, 10000, 100000, 200000]

python_times = []
for n_paths in paths:
  time_lasted = %timeit -o simulate_python(num_paths=n_paths, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)
  python_times.append(time_lasted.best)

vectorized_times = []
for n_paths in paths:
  time_lasted = %timeit -o simulate_vectorized(num_paths=n_paths, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)
  vectorized_times.append(time_lasted.best)

jit_times = []
for n_paths in paths:
  time_lasted = %timeit -o simulate_python_jit(num_paths=n_paths, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)
  jit_times.append(time_lasted.best)

cython_times = []
for n_paths in paths:
  time_lasted = %timeit -o simulate_cython(num_paths=n_paths, num_time_steps=252, time_step=1/252, drift=1.0, diffusion=1.0, initial=0.5)
  cython_times.append(time_lasted.best)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(15, 10))
ax.grid(False)
plt.style.use('ggplot')

ax.set_xscale('log')
ax.set_yscale('log')

ax.set_xlabel("log number of paths", fontsize=22)
ax.set_ylabel("log time (s)", fontsize=22)
ax.tick_params(axis='x', labelsize=22)
ax.tick_params(axis='y', labelsize=22)

ax.plot(paths, python_times, marker='o', markersize=3, linestyle='dashed', linewidth=1.4, label='Python')
ax.plot(paths, vectorized_times, marker='o', markersize=3, linestyle='dashed', linewidth=1.4, label='Vectorized')
ax.plot(paths, jit_times, marker='o', markersize=3, linestyle='dashed', linewidth=1.4, label='Jit')
ax.plot(paths, cython_times, marker='o', markersize=3, linestyle='dashed', linewidth=1.4, label='Cython')
plt.legend(prop={'size': 15})

We see that we have a speed up of at most 100x by using Cython, Jit or vectorizing the initial python code. The vectorized version seems to be a bit slower than Jit and Cython for low and high quantities of paths.

To sum up, we showed different ways to improve mathematical python code that has nested for loops. Using Jit, Cython or vectorizing the code was a good alternative to produce more efficient code.

Share on

Twitter Facebook LinkedIn

Martín Kuperman