Fast Reductions in xtensor with SIMD

xtensor has a wealth of mathematical functions available already: we support a lot of the NumPy API right there in C++.

One of the areas where we were a bit slower than NumPy in the past were reductions. Reductions are operations where the number of dimensions is reduced by aggregating the values over one or multiple dimension, for example by summing them up. As you may, or may not know, xtensor has a lazy execution model. This means that when you are calling auto s = xt::sum(a); nothing is computed! Only when you either assign this expression to an in-memory datastructure, or when you access an element, the sum get’s evaluated. This is great for a number of use-cases, for example when you only want to compute the sum of one row of a very large array.

However, it turns out that there are a few optimizations that one can do when a greedy execution model is wanted (and which NumPy does). Greedy in this context means that we execute the sum as we call it. When doing a greedy reduction, we can optimize the order of traversal of the summed-up array to enable high-performance SIMD (auto-)vectorization of the operation.

In this blogpost I will highlight why our greedy reduction is faster than the lazy one, and how you can implement your own reduction functions utilizing the xtensor infrastructure.

In the following image a 2D array is shown, with a row-wise memory layout. Each row contains four values. Below the 2D view, the actual 1D in-memory layout is shown. Overall, this looks like a proper reduction over axis 1!

The reduction loop in pseudo-code looks something like this:

for (i in axis0):
for (j in axis1):
result[i] += result[i + j]
Memory layout and row-wise reduction order

However, when we naively reduce over axis 0 things get a bit more confusing: we select a lot of values in memory which are not next to each other! Recent processors with the SIMD instruction set (and more specifically AVX) can load up to 4 neighboring values to memory at once, and then operate on them. This is unfortunately totally useless in this reduction scheme, as we pick values that are far apart from each other.

The loop for this reduction scheme would look something like:

for (i in axis0):
for (j in axis1):
result[i] += input[j * 4]
Naive reduction scheme along a column

In order to gain performance and use SIMD instructions, we need to optimize the reduction scheme. This results in increasing the index of the in-buffer and the out-buffer at the same time, distributing values as they are aligned in memory!

The loop now looks like this:

for (i in axis0):
for (j in axis1):
result[i + j] += input[i + j]
Auto-vectorization optimized reduction

We employ a couple more tricks in the xtensor fast reducers to merge loops when multiple axes are given as argument etc. but for this, it’s better to take a look at the code.

To quickly highlight how you can benefit from these fast reduction schemes for multidimensional arrays, the following is a quick tutorial on how to write your own reduction functions in xtensor. The function we’ll try to write is count_nonzeros , a function that also exists in NumPy. The function counts the number of non-zero entries in a specified axis.

So, our reduction function, in pseudo-code should look like this:

non_zero_reduction(red, rhs):
if (val != 0)
return red + 1
else
return red

By default, reductions in xtensor are initialized by the identity function (ie. the initial value of red in the function above is the same as the first value of the axis). However, since we want to start counting at 0 or 1 depending on if the first value is non_zero, we have to write a custom initialization function. In pseudo-code:

non_zero_initialization(rhs):
if (rhs != 0)
return 1
else
return 0

The following is the same functionality in C++

I work as a scientific and robotics software developer for QuantStack in Paris and Berlin. We do Open Source for a living!