- Uwe's Blog

Automating miniforge updates using Github Actions
· 04 May 2021
miniforge and its variants miniforge-pypy and mambaforge-* are the base installers for using conda with conda-forge as the default source for packages. They will provide you with a basic conda installation to get started. This means that as part of that, the newest installers should also bring the newest...
The implications of pickling ML models
· 26 Apr 2021
When you have trained a machine learning model (pipeline), you will make predictions directly afterwards to assess its quality. When using the model actually for something useful, we also want to make predictions with it at a later point in time. This forces us to store the model to disk and think of a way to serialise it.
Deploying conda environments in (Docker) containers - The Cheatsheet!
· 03 Mar 2021
Deploying conda environments inside a container looks like a straight-forward conda install. But with a bit more love for details, you can optimise the process so that the build is faster and the resulting container much smaller.
Deploying conda environments in (Docker) containers - how to do it right
· 01 Mar 2021
Deploying conda environments inside a container looks like a straight-forward conda install. But with a bit more love for details, you can optimise the process so that the build is faster and the resulting container much smaller.
Apache Arrow on the Apple M1
· 11 Jan 2021
In the previous blog post I explained how I got a well-working setup on my M1 MacBook. With that in place, I mostly worked on my main work setup running. But as a core Apache Arrow developer, I was also very eager to spend the extra mile and get Arrow (the C++ and Python part) working on the M1....
The first two weeks with the Apple M1
· 04 Jan 2021
Apple recently published new computers that contain their new M1 processors. I was quite excited about them because of the promises made by various benchmarks regarding performance and energy consumption but also because it is also a new platform. Most things won’t work there and some assumption on how we work today have to change if you want to use...
Fast JDBC access in Python using pyarrow.jvm (2020 edition)
· 30 Dec 2020
About a year ago, I have benchmarked access databases through JDBC in Python. Recently, the maintainer of jpype gave me a heads-up that they significantly improved performance on their side. While this is actually the library I’m comparing my pyarrow.jvm-based approach to, I have a high appreciation for any performance tuning that is...
Calculating levenshtein distances with fletcher
· 08 Dec 2020
Levenshtein distance is a typical measure to compare two different strings. It gives you the minimal number of add, remove and replace operations to transition from one string to another.
Trimming down pyarrow’s conda footprint (Part 2 of X)
· 28 Oct 2020
We have again reduced the footprint of creating a conda environment with pyarrow. This time we have done some detective work on the package contents and removed contents from thrift-cpp and pyarrow that are definitely not needed at runtime.
Removing Python as a dependency of R
· 19 Oct 2020
Surprisingly Python was a runtime dependency of R on conda-forge. As R doesn’t need Python to run, this was a bit weird. We got rid of this by splitting up the GLib package.
Trimming down pyarrow’s conda footprint (Part 1 of X)
· 08 Sep 2020
We have substantially reduced the footprint of creating a conda environment with pyarrow. While working on this, we have also substantially reduced the size of a base Python installation from conda-forge. All this was done without disabling any functionality. We reduced the size of a conda environment for pyarrow by nearly 50% and reduced the “pyarrow tax” for...

Building R Arrow on Windows: A tale of two compilers
· 14 Jun 2020
Windows support for Apache Arrow is pretty good. There are Python wheels, Python conda packages and a binary build for R on CRAN. One thing that has been missing though for a long time has been a conda package for R Arrow on Windows. Thanks to a lot of experimentation and some important suggestions by Isuru Fernando (Thanks!), we...

The one pandas internal I teach all my new colleagues: the BlockManager
· 24 May 2020
When new members join our team, they usually are already fluent in data analysis with pandas and know their way around the typical quirks. They know that they should use vectorised functions where possible and avoid using apply with a slow Python callable. There are two main reasons, I teach them the BlockManager quite...

Fletcher 0.3: A status report on the mission to get pandas hooked on Apache Arrow
· 25 Feb 2020
It has been now nearly two years since the idea came up to use pandas’ new ExtensionArray interface to provide columns in pandas that are backed by Apache Arrow. fletcher was started as a prototype project to show how this idea can be brought together. Since then there has been quite...

Fast JDBC access in Python using pyarrow.jvm
· 17 Nov 2019
While most databases are accessible via ODBC where we have an efficient way via turbodbc to turn results into a pandas.DataFrame, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use

← Newer Page: 2 of 4 Older →