Faster xtensor!
Recently, we’ve landed some changes that will make xtensor substantially faster. With this series of blog posts we want to highlight some of the technological achievements of the past 2 months and before.
Allocations on the stack
Knowing the size of the memory that is to be allocated at compile time can improve the runtime performance a lot — the compiler can already reserve the required space on the stack (instead of having to call malloc to get it from the heap). This is especially significant for small objects that are created frequently — such as shapes. We have introduced a new std::vector
-like type called svector
which has a small amount of storage allocated on the stack, and only requires memory from the heap when it outgrows the storage on the stack. For xarray's shapes we have therefore set the default to svector<std::size_t, 4>
, so that the allocation of a shape-type that holds up to 4 dimensions is almost free. This shape-container has been inspired by xtensor-contributor Ullrich Köthes work on Vigra and tiny_array, as well as LLVM’s small_vector.
xtensorf, a new container
A major improvement for small arrays and matrices is the arrival of xtensorf
(note the f at the end). While xtensor
is already a fast datastructure when you know the number of dimensions for your container, xtensorf
takes it up a notch. With xtensorf
you define the shape as part of the type, and the size and the strides are computed at compile time, making the allocation of this container virtually free. Of course, you loose the ability to resize or reshape. But other than that, you can freely mix- and match the fixed shape xtensor with xarray or xtensor types.
xsimd
We have a seperate package called xsimd
which wraps the low-level compiler SIMD intrinsics with nice C++ wrappers (SIMD stands for single-instruction multiple-data). The SIMD instruction set has a bunch of special instructions which allow the processor to e.g. multiply 2, 4, or 8 values at the same time (depending on the hardware and the C++ type).
But beyond wrapping basic operations such as the multiplication, addition …, the xsimd package contains many optimized functions for computing more complicated functions such as sin
, cos
, errfc
, exp
… This is done using algorithms which allow to compute these functions on more than one item at a time. We have to do a big shoutout to boost.SIMD
on which’s gigantic shoulders we're standing here.
Using xsimd to compute the sin
of 100 values, you can expect a speedup of around 4-8 times!
You can enable xsimd in xtensor by adding the -DXTENSOR_ENABLE_XSIMD
definition in your cmake:
Note that the xsimd headers have to be available in your system, installation instructions here: http://xsimd.readthedocs.io/
xtensor-blas with BLAS
We have a dedicated xtensor package to bind to BLAS libraries and use their highly optimized routines to compute things like the Matrix-Matrix product or Matrix-Vector product. xtensor-blas ships the fabulous FLENS
library, which actually contains pure-C++ implementations for all BLAS routines. But by linking against one of the available BLAS libraries (such as OpenBLAS or Intel MKL) one can achieve greatly increased speed for the dot
product and all other routines.
In order to link against a BLAS library, use the following cmake:
xtensor-blas can be installed with conda, or from source. Instructions here: http://xtensor-blas.readthedocs.io/
xtensor immediate vs. lazy reductions
Like any good scientist, the default in xtensor is to favor lazy approaches. However, in reductions quite good speedups can be achieved by performing operations immediately. This is due to a couple of factors, but one big one is memory locality. By reducing immediately and choosing a clever reduction scheme, the entire memory block has to be iterated over only once. On average, using the immediate reduction mode is 2–4 times faster than the lazy one (depending on the reduction axes). More about this in this blog post.
To use the immediate reduction mode, pass the following tag like this (note that the return value of this is a xarray
, and not a lazy xfunction
).