Trimming down pyarrow’s conda footprint (Part 2 of X)·
We have again reduced the footprint of creating a conda environment with
This time we have done some detective work on the package contents and removed contents from
pyarrow that are definitely not needed at runtime.
In the last blog post about the
pyarrow environment size, we already halved the size of the environment.
This time the improvement is a bit more than 10%, 48 MiB in absolute terms.
Still, this is a considerable reduction when looking forward that we also want to trim the basic
pyarrow installation even more done.
Measuring the status quo
To see how well we can improve on the status quo, we will be measuring the size of the conda environment of different use cases:
- A basic Python installation from conda-forge:
conda create … python=3.8 -c conda-forge
- Pandas without any optional dependencies:
conda create … python=3.8 pandas -c conda-forge
- PyArrow with the same Python version:
conda create …python=3.8 pyarrow -c conda-forge
- Pandas with the necessary parts of
pyarrowto read Parquet files:
conda create … python=3.8 pyarrow pandas -c conda-forge(for now)
- Nightly builds of
conda create .. python=3.8 pyarrow -c arrow-nightlies -c conda-forge
Overall this results in a bash script that creates these environments using the current latest versions from conda-forge:
#!/bin/bash set -e TIMESTAMP=$(date +%Y%m%d%H%M%S) mkdir $(pwd)/$TIMESTAMP conda create -p $(pwd)/$TIMESTAMP/baseline -y python=3.8 --override-channels -c conda-forge conda create -p $(pwd)/$TIMESTAMP/pandas -y python=3.8 pandas --override-channels -c conda-forge conda create -p $(pwd)/$TIMESTAMP/pyarrow -y python=3.8 pyarrow --override-channels -c conda-forge conda create -p $(pwd)/$TIMESTAMP/parquet -y python=3.8 pyarrow pandas --override-channels -c conda-forge conda create -p $(pwd)/$TIMESTAMP/pyarrow-nightly -y python=3.8 pyarrow --override-channels -c arrow-nightlies -c conda-forge du -schl $(pwd)/$TIMESTAMP/* 2>/dev/null
With the changes of the previous blog post in place, we have a look at the above described conda environment at the timestamp
20200825110024, the state after the last blog post.
The next attempt to minimise the size is to look at the largest packages by size in the
We can therefore use the DataFrame returned by the
gather_files function we defined in the previous blog post.
df.groupby("package")['size'].sum().sort_values().tail(10) // 1024 // 1024 we get the 10 largest packages with their size in megabytes:
|package||size in MiB|
To visualise the contents of the package, we re-use the
plot_size_by_suffix function to show the different file types that make up the individual packages.
From the above images, we can see that in all those packages, most of the space is used for shared libraries or Python code.
There are two notable exceptions though.
pyarrow contains a vast array of different file types, hinting that we are shipping some intermediate build files. We will take a close look later.
First, we should have a look at what is causing
thrift-cpp’s chart to report most of its space usage as “other”.
thift-cpp is “other”
Given the DataFrame of of all files in the environment, we can query it for the 10 largest files in the
thrift-cpp package using the following pandas code:
df_thrift = df.query("package == 'thrift-cpp'").sort_values(by="size") df_thrift.tail(10)
Most of the files are no surprise, shared libraries and headers that define what is in the shared libraries. One large outlier is the
thrift binary though.
This is the
thrift compiler that turns a Thrift definition into source code.
While this is needed during build, it is not relevant at runtime.
We work around this issue by splitting
thrift-cpp into two conda packages:
libthrift which contains everything needed at runtime and
thrift-compiler that only contains the compiler.
pyarrow contains a vast set of file types
It is not necessarily bad to have a vast set of file types in a Python package with native code.
pyarrow is using Cython and also support building Cython modules on top of it.
We should just be careful that the package only ships with the files that are needed for the end-users and not any temporary files generated during the build.
As with the
thrift-cpp package, we take a look at the list of largest files, in this case the top 30 as we are interested in variety of files.
df_pyarrow = df.query("package == 'pyarrow'").sort_values(by="size") df_pyarrow['name'] = df_pyarrow['name'].str.slice(len("lib/python3.8/site-packages/pyarrow/")) df_pyarrow.tail(30)
This actually contains many groups of files that shouldn’t be in the final package:
As the largest consumer of space, the
tests/ folder including the test data is included.
While this is typical with a lot of packages, it takes more space in the
pyarrow package than it does in other packages.
To get rid of them in the main package but still provide the end-users the possibility to run the tests locally to check their installation, we have introduced a new package
pyarrow-tests in the PR upstream and on conda-forge.
There are two other bits that contribute a bit to the size but are also conceptually wrong in that they install things that are already installed by the
This happens because
setup.py also caters for the case where the Python package is installed in a stand-alone fashion (e.g. when you install the pre-built wheels).
Then you also need to ship the relevant C++ parts.
But as we have the separate
arrow-cpp package in the
conda ecosystem, we don’t have a need for that in the
pyarrow conda package.
The main thing here is that we reship the C++ headers in the
…/site-packages/pyarrow/include/arrow folder while they are already installed in
To not install them, we introduced a new option for the
PYARROW_BUNDLE_ARROW_CPP_HEADERS that we can use to disable the vendoring of the includes in the Python package in the upstream PR and also backported that to the feedstock.
Additionally, we also ship the
plasma-store-server binary in both the
arrow-cpp and the
As it is built as part of the C++ build, it’s natural home is the C++ package.
In the conda setting, it should also be placed on the
PATH and thus reside in
As we change the default location, we also adjusted the Python code that searches for it and also backported it into the conda package for the current release
thrift-cpp cleaned up, at timestamp
20201026200633, we now observe the following sizes. Note that meanwhile new versions of most packages were released, thus the increase in the size of the
pandas environment which we otherwise didn’t touch.
With the largest packages now cleaned from “suspicious” contents, we won’t see any significant changes anymore by looking at individual files.
Rather the next step will be to look at splitting packages, partly by their build- vs run-time components (discussion on that started in CFEP-20) and into their functional components.
The latter will be important especially for
arrow-cpp as we often don’t need heavy-weights like Gandiva.
Splitting this off should give us a drastic improvement in the Parquet reading case above.
Title picture: Photo by ANDI WHISKEY on Unsplash