Making conda fast again
Explaining how we’ve created the mamba prototype, a solver for conda environments that is hopefully fast enough to support a conda-forge with hundreds of thousands of packages.
You might have seen the announcement on Twitter: at QuantStack we’ve been working on making a prototype of a conda-compatible package manager called mamba. Conda is a great tool to distribute data science packages. The community-led conda-forge comprises tons of awesome packages. The Anaconda company supplies us with recent and well integrated compilers. And conda-build is simply amazing to build binaries across different platforms (Windows, Linux and OS X). At QuantStack we use conda all the time to package Python, C++, Julia and R packages, and ship them to clients around the world.
However, due to the growth of conda-forge (it’s got over 60'000 packages for Linux right now) many users have made experience with the “conda: Solving environment” spinner. It’s been frustratingly slow for a while now.
At QuantStack, one of our main expertises is building High Performance applications for customers, mainly using C++.
To make conda faster we propose to
- Build a Python extension using C++, pybind11 and compile it with all optimizations enabled
- Use the existing libsolv library, that powers package managers like Fedora’s DNF or OpenSUSEs zypper and (like conda) performs SAT solving to satisfy all package dependencies correctly
- For faster parsing of the repodata.json (already 35 MB of JSON for conda-forge) we use a library called simdjson which enables high speed parsing
With the prototype, we manage to solve environments in seconds, as demonstrated in the following video:
This prototype is already available on conda-forge. Existing conda users can install it easily by executing
conda install mamba -c conda-forge/label/mamba-alpha -c conda-forge
The source code for all of mamba can be found on github: https://github.com/QuantStack/mamba
The code is re-using as much from conda as possible. We re-implemented only the repository parsing, and the solving. Thanks to using the existing libsolv abstractions, mamba’s total lines of code are roughly 300 lines of Python, mostly adapted from conda, and 600 lines of C++ for parsing the JSON and adding all rules to libsolv. We try to keep this library as small as possible, which also makes it easier to debug and reason about.
We’re currently ironing out some low-hanging bugs, but actively looking into ways to further fund this work. We already have some promising leads, but if you know of an organization or company willing to sponsor some days of development for these tools, that would be great. We’re doing this with the goal of upstreaming the work into the original conda package manager at some point in the future.
Until this happens: the Anaconda team has also released a very interesting blog post with tips on how to make conda faster https://www.anaconda.com/understanding-and-improving-condas-performance/
__ __ __ __
/ \ / \ / \ / \
/ \/ \/ \/ \
███████████████/ /██/ /██/ /██/ /█████████████████████████████
/ / \ / \ / \ / \ \____
/ / \_/ \_/ \_/ \ o \__,
/ _/ \_____/ `
We still have some work to do:
- libsolv has never been run on Windows before we at QuantStack made a Windows port week ago. We’re currently upstreaming the changes. Just in case you know the equivalent of
fcntl(store->pagefd, F_SETFD, FD_CLOEXEC);on Windows we’d glad to hear from you on this PR https://github.com/openSUSE/libsolv/pull/306
- Thankfully, the libsolv maintainers (especially Michael Schroeder) have already implemented conda version matching exactly to the Python specifications (https://github.com/openSUSE/libsolv/commit/67d113f336327f3e1adc384bee2990951b2b13c1)! However, we have not yet had the time to make use of it. We definitely need to integrate this work to get the exact version ordering as expected from conda.
- We need to verify that the conda test suite is passing so that we get a chance at upstreaming this work eventually. This includes evaluating the optimization strategies used by libsolv vs conda.
- Cache parsed repository data into
.solvfiles, the libsolv binary format. Using this caching format makes repo loading a matter of milliseconds.
QuantStack is located in the center of Europe (Paris). We build Open Source Software for a living — from creating fresh conda packages to robot applications, from high performance computing to interactive C++ and Jupyter widgets. If you’re interested in our services, do not hesitate to drop us a line. http://quantstack.net/