Building an Open Source, Continuous Benchmark System
At QuantStack we strive to write high-performance open source software, such as xtensor and xsimd. A key component of modern software development is continuous integration testing. There is an abundance of free CI services (Travis, AppVeyor, Azure…), but running benchmarks continuously is more complicated — that’s why we built our own cheap open source solution on top of Concourse and OpenStack.
Concourse is a continuous integration software, and we use it for two purposes: multi-project continuous integration, and continuous benchmarking. It’s easy to setup on a low cost server, using docker-compose. One can, for example, run it on the cheapest OVH server, that costs around 3 Euro a month — but you can run it on any cheap server that you have root access to.
Each job in Concourse runs in it’s own docker environment, on the same server (since we have gone the easy way and run the Concourse “worker” on the same server and start it with the same docker-compose). In theory, Concourse supports multiple workers on distributed servers, but we haven’t explored that possibility yet.
You can check out our live Concourse CI here: http://22.214.171.124:8080/
A key notion in Concourse are pipelines. Pipelines determine how the “build process” flows, and what happens next. For the multi-project integration, we trigger a rebuild of the entire xstack whenever one of our core-dependencies changes. For example, our lowest-level library xtl is used in xtensor. If a commit lands on xtl master, we trigger a build of xtensor, and in turn a build of xtensor-python, xtensor-fftw, and all other dependent libraries. This increases our test-converage drastically, since we also test with all client libraries at once and gives us a lot more confidence when releasing a new xtensor version.
A problem that is harder to tackle than continuous integration is continuous benchmarking, because it requires a stable environment (same hardware) that is not used by any other process during the benchmark process. For this, it becomes necessary to either obtain dedicated hardware, or boot up a server somewhere.
Thanks to cloud computing, booting up a server somewhere has never been easier, and cheaper since we boot it up only for short “bursts” of time, when someone pushes a new commit on master! We use OpenStack on OVH.com here, since they are a very reliable European player and support OpenStack as a first class citizen. So everytime a new benchmark should run, we boot up a dedicated machine on OVH cloud using the OpenStack Python client from the build script:
openstack server create bench --flavor b2–7 --image ubuntu --key-name bench_key
Once the machine is ready, the CI process logs in using SSH and our private key that we supplied above. All benchmarking commands are then executed on the OpenStack server: downloading & installing xtl, xsimd and xtensor, and running the benchmarks. Through our selection of the
flavor we make sure that we obtain a consistent benchmarking environment with 4 dedicated CPU's. Additionally, we are adding a persistent volume that we use to store the results of each benchmark run.
We are using the fantastic ASV (Airspeed Velocity) Python package, that not only measures the execution times, but also publishes the measurements (and a history of measurements) to a dedicated GitHub page.
Once this is done, the dedicated OpenStack machine is shut down — this saves a good deal of money since we only run benchmarks a maximum of one or two times a day, and we only pay for the hours the machine is actually booted up.
You can find our continuous benchmarking results here: https://wolfv.github.io/xtensor-asv (expect them to improve soon!)
How can I do this for Python/Julia/R/C++?
The good news is that the continuous benchmarking code is language agnostic — you can use it from Python, Julia, R and (like us) C++.
The entire infrastructure is Open Source, and we documented everything necessary in the ReadMe on GitHub:
Contribute to wolfv/xstack-ci development by creating an account on GitHub.
The usage of ASV with Python is extremely nice (one just needs to add some
benchmark_... functions to the Python code). Languages other than Python are slightly more tedious (we had to write a small wrapper to import JSON results from Google Benchmark to ASV), but it can be done — and I am sure we can all help improve ASV to, for example, automatically import JSON or CSV results.