Uwe’s Blog

My writing about data engineering, opensource development, general programming and thoughts about engineering culture.

  • Let people invite themselves to Google Calendar entries using AppScript

    If you want to organise an event with a group of people within your Google Workspace, you can invite the whole workspace or ask around who wants to attend. It has been the norm at my current workplace to post in Slack and let people react with an emoji if they wish to attend. This was convenient as any attendee...

  • The implications of pickling ML models

    When you have trained a machine learning model (pipeline), you will make predictions directly afterwards to assess its quality. When using the model actually for something useful, we also want to make predictions with it at a later point in time. This forces us to store the model to disk and think of a way to serialise it.

  • Apache Arrow on the Apple M1

    In the previous blog post I explained how I got a well-working setup on my M1 MacBook. With that in place, I mostly worked on my main work setup running. But as a core Apache Arrow developer, I was also very eager to spend the extra mile and get Arrow (the C++ and Python part) working on the M1....

  • The first two weeks with the Apple M1

    Apple recently published new computers that contain their new M1 processors. I was quite excited about them because of the promises made by various benchmarks regarding performance and energy consumption but also because it is also a new platform. Most things won’t work there and some assumption on how we work today have to change if you want to use...

  • Calculating levenshtein distances with fletcher

    Levenshtein distance is a typical measure to compare two different strings. It gives you the minimal number of add, remove and replace operations to transition from one string to another.

  • Trimming down pyarrow’s conda footprint (Part 2 of X)

    We have again reduced the footprint of creating a conda environment with pyarrow. This time we have done some detective work on the package contents and removed contents from thrift-cpp and pyarrow that are definitely not needed at runtime.

  • Removing Python as a dependency of R

    Surprisingly Python was a runtime dependency of R on conda-forge. As R doesn’t need Python to run, this was a bit weird. We got rid of this by splitting up the GLib package.

  • Trimming down pyarrow’s conda footprint (Part 1 of X)

    We have substantially reduced the footprint of creating a conda environment with pyarrow. While working on this, we have also substantially reduced the size of a base Python installation from conda-forge. All this was done without disabling any functionality. We reduced the size of a conda environment for pyarrow by nearly 50% and reduced the “pyarrow tax” for...

  • Building R Arrow on Windows: A tale of two compilers

    Windows support for Apache Arrow is pretty good. There are Python wheels, Python conda packages and a binary build for R on CRAN. One thing that has been missing though for a long time has been a conda package for R Arrow on Windows. Thanks to a lot of experimentation and some important suggestions by Isuru Fernando (Thanks!), we...

  • The one pandas internal I teach all my new colleagues: the BlockManager

    When new members join our team, they usually are already fluent in data analysis with pandas and know their way around the typical quirks. They know that they should use vectorised functions where possible and avoid using apply with a slow Python callable. There are two main reasons, I teach them the BlockManager quite...