# Writing a boolean array for pandas that can deal with missing values

·When working with missing data in `pandas`

, one often runs into issues as the main way is to convert data into `float`

columns.
`pandas`

provides efficient/native support for boolean columns through the `numpy.dtype('bool')`

.
Sadly, this `dtype`

only supports `True/False`

as possible values and no possibility for storing missing values.
Additionally, `numpy`

uses a whole byte to store the `True/False`

information while a single bit would be sufficient.

## The current way in pandas

By intuition, one would simply create a boolean array with missings by calling either `pandas.Series([True, None, False])`

or `pandas.Series([True, numpy.nan, False])`

.
This sadly creates a `Series`

with an `object`

dtype.
While already having an `object`

dtype means that all operations are run in native Python and not accelerated by `numpy`

, the behaviour of reduction functions like `any()`

is feeling a bit weird at first.
The actual algorithm implements `Series([x, y, z]).any()`

as `x or y or z`

which seems reasonable but doesn’t match the docstring of `pandas.Series.any`

dealing with missing data.
The result of `any()`

should always be a boolean, irrespectively if the data contains missing values:

skipna : bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

In the case of `pd.Series([False, np.nan]).any(skipna=False)`

, we get the surprising result of `False or np.nan = np.nan`

.
The behaviour of `any()`

using `None`

as a missing value representation it even less intuitive as the operation is not commutative.
I’ve opened a discussion about this behaviour upstream and it may be changed soon to match the docstring but this still won’t get us a fast solution.

As previously mentioned, for dealing with missing values in `pandas`

, you normally have to resort to `float`

columns.
The same is true in the case of booleans with missing values.
Using `pd.Series([False, np.nan]).astype(float).any(skipna=False)`

, we get `True`

as the result.
This is what `pandas`

docstring also describes as the correct behaviour of `any()`

on arrays with missing values.
While we now have a working solution, we ended up using the wrong data type (`float`

instead of `bool`

) and we waste even more space for storing a tri-state (true/false/missing).

## Making it more efficient

Some time ago, I have started a prototype(!) implementation of a `pandas.ExternsionArray`

implementation backed by Apache Arrow.
This project is called fletcher and is available via PyPI and conda-forge.
Apache Arrow uses two memory buffers to represent a column for booleans with missing values.
It uses one bitmask for storing whether a row is missing or has a value and a second bitmap whether the value is true or false.
This then uses 2 bit instead of 8 bit (`numpy.dtype('bool')`

) or even 16-64bit in the case of a float column.

While we don’t present a boolean array that is on-par with the functionality you would get with a float-typed column, we want to have a minimally working, efficient boolean column.
`fletcher`

in its current form already supports the creation of a column and the proxing of basic pandas functionality that is independent of the data type.
It doesn’t support type-specific functionality though.
Thus we need to implement the `any()`

and `all()`

reduction on this column.
As Apache Arrow does not yet support these operations as a built-in, we need to implement them on our own.

### Testing the implementation

Following good software engineering practices, we are going to implement this using test-driven development.
As we already have a working implementation and would like to implement an alternative one using a more efficient storage, we can use the existing one as a reference for checking the correctness of our new implementation.
This is a perfect case for using property-based testing as we are able to retrieve the correct result for a given input data.
As a helpful tool to implement property-based testing on Python, we are using `hypothesis`

.

For writing such a unit test, we first need to take care of the creation of random input data.
`pandas.Series.any`

takes two relevant arguments in our case, the data itself and a flag `skipna`

whether to skip missing values or consider them as part of the algorithm.
The generation of the `skipna`

parameter is straight-forward though the `st.booleans()`

strategy which either emits `True`

or `False`

.
For our data parameter, there is no predefined strategy that would generate the expected data.
Thus we need to construct a combined strategy out of `hypothesis`

existing strategies.
Using `st.one_of(st.booleans(), st.none())`

we can create a strategy that either returns `True`

or `False`

or `None`

as a missing value.
By wrapping this into `st.lists(…)`

, we generate a list of these values instead of a single scalar.

```
import hypothesis.strategies as st
from hypothesis import example, given
@given(data=st.lists(st.one_of(st.booleans(), st.none())), skipna=st.booleans())
```

`hypothesis`

is really good at generating edge cases for the given input types but we also want to be explicit in testing them.
Thus we can use its `example`

decorator to enforce that the empty list is always passed in as test data.

```
@example([], False)
@example([], True)
```

Now that the random input generation is solved, we need to take care to run the code and test its properties.
Testing the properties is simple as we can test all properties of the result by comparing it against a reference result.
As mentioned in the first section, for dealing with boolean arrays with missing values in pandas, we need a `float`

-typed `Series`

.

```
import pandas as pd
import pyarrow as pa
arrow = pa.array(data, type=pa.bool_())
pandas = pd.Series(data).astype(float)
assert any_op(arrow, skipna) == pandas.any(skipna=skipna)
```

One speciality of the `fletcher`

implementation is that it doesn’t only work on continuous arrays but also on chunked arrays.
As such, we will have code in the later implementation that reduces the result of `any(…)`

on separate memory regions together.
This is an edge case our above defined data generation doesn’t cover.
Thus we take the random data and split it in half to get a chunked array and test this also leads to the very same result.

```
if len(data) > 2:
arrow = pa.chunked_array(
[data[: len(data) // 2], data[len(data) // 2 :]], type=pa.bool_()
)
assert any_op(arrow, skipna) == pandas.any(skipna=skipna)
```

Putting all pieces together, we get the following unit test:

```
@given(data=st.lists(st.one_of(st.booleans(), st.none())), skipna=st.booleans())
@example([], False)
@example([], True)
def test_any_op(data, skipna):
arrow = pa.array(data, type=pa.bool_())
pandas = pd.Series(data).astype(float)
assert any_op(arrow, skipna) == pandas.any(skipna=skipna)
# Split in the middle and check whether this still works
if len(data) > 2:
arrow = pa.chunked_array(
[data[: len(data) // 2], data[len(data) // 2 :]], type=pa.bool_()
)
assert any_op(arrow, skipna) == pandas.any(skipna=skipna)
```

### Implementing `any(…)/all(…)`

reductions

With the unit test in place, we can now start implementing the functions. The first major difference to NumPy booleans is that we don’t have a bytemask but a bitmask. Thus means that every byte in memory contains eight values. As we will be working value-by-value we first need to write some code that given a position selects the correct memory address and from the byte there the correct bit. As Arrow columns store the value and the missing flag in two separate memory buffers, we can use the same index but different base address to select these values.

```
byte_offset = i // 8
bit_offset = i % 8
mask = np.uint8(1 << bit_offset)
valid = valid_bits[byte_offset] & mask
value = data[byte_offset] & mask
```

The all implementation is quite straight-forward as adhering to the `pandas`

definition of `skipna`

means that the value of it is actually irrelevant.
Thus we come to the following solution.
The only important aspect here is that we give `numba`

the hint that `valid`

and `value`

are boolean variables.

```
@numba.njit(locals={"valid": numba.bool_, "value": numba.bool_})
def _all_op(length, valid_bits, data):
# This may be specific to Pandas but we return True as long as there is not False in the data.
for i in range(length):
…fetch valid and value…
if valid and not value:
return False
return True
```

For the end-user facing operation, we provide a function that takes a `pyarrow.Array`

instead of the individual memory buffers.
Unpacking those can be done by calling `arr.buffers()`

where in the case of a `BooleanArray`

the first is the valid bitmap (this is true for all Arrow arrays) and the second the memory buffer holding the actual values.
Additionally, we don’t only support passing in `pyarrow.Array`

objects but also `pyarrow.ChunkedArray`

objects that are made up of a list of `pyarrow.Array`

objects. Here we need to call the `all_op`

on a per-array basis and merge the results at the end again.

```
def all_op(arr, skipna):
if isinstance(arr, pa.ChunkedArray):
return all(all_op(chunk, skipna) for chunk in arr.chunks)
# skipna is not relevant in the Pandas behaviour
return _all_op(len(arr), *arr.buffers())
```

The `any`

operation is slightly more complex as we here also need to take care of the differing behaviour depending on the setting of `skipna`

.
Thus the user-facing function dispatches to different jitted-functions depending on whether we want to take missing values into account.

```
def any_op(arr, skipna):
if isinstance(arr, pa.ChunkedArray):
return any(any_op(chunk, skipna) for chunk in arr.chunks)
if skipna:
return _any_op_skipna(len(arr), *arr.buffers())
return _any_op(len(arr), *arr.buffers())
```

In the case of `skipna=True`

, the we return `True`

if one element satisfies the condition `valid and value`

.
For `skipna=False`

we interpret missing values as `True`

(as defined in `pandas`

documentation) and thus the `any_op`

is `True`

whether there is an entry for which `(valid and value) or (not valid)`

holds.

#### A last, small special case

There is sadly one more thing to handle which is that Arrow allows to omit the valid bitmap in the case when there are no nulls (i.e. `null_count == 0`

).
This helps to save a bit of memory in the quite-often case of no missing data (in our case this amounts to a saving of half of the used memory) but sadly introduces another branch in our code.
For the all operation, we can use the same code as in the case where we have a valid bitmap but instead of accessing it, we simply assume that every entry is valid.
Therefore the non-null case of the all operation can be implemented using:

```
@numba.njit(locals={"value": numba.bool_})
def _all_op_nonnull(length, data):
for i in range(length):
byte_offset = i // 8
bit_offset = i % 8
mask = np.uint8(1 << bit_offset)
value = data[byte_offset] & mask
if not value:
return False
return True
```

### Performance

While we already can fallback in `pandas`

to using float{32,64} columns to correctly work with boolean columns with missings, one thing we also want to achieve is a better performance.
As boolean operations are quite simple, the best impact in performance is the reduced memory usage we get by using Arrow.

To compare the performance, we have written some simple benchmarks using airspeed velocity.
In the case of *all*, we have the taken the most complex version where all entries are `True`

and thus the whole array must be scanned to compute a valid result.
We do the same most-expensive version for *any* but here we set all entries to `False`

.
To also test the performance difference with having missing data, we added another pair of benchmarks where we set the last entry to None/NAN/missing.
For NumPy this makes no difference but in the Arrow case, the valid bitmap is actually allocated and must then be checked for every entry.

Running the benchmarks on my MacBook with Python 3.7 gives the following results:

```
% asv run --python=same -b 'boolean*'
· Discovering benchmarks
· Running 12 total benchmarks (1 commits * 1 environments * 12 benchmarks)
[ 0.00%] ·· Benchmarking existing-py_Users_uwe_miniconda3_envs_fletcher_bin_python
[ 4.17%] ··· Running (boolean.BooleanAll.time_fletcher--)......
[ 29.17%] ··· Running (boolean.BooleanAny.time_numpy--)..
[ 54.17%] ··· boolean.BooleanAll.time_fletcher 16.2±0.6ms
[ 58.33%] ··· boolean.BooleanAll.time_fletcher_withna 16.3±0.4ms
[ 62.50%] ··· boolean.BooleanAll.time_numpy 35.6±1ms
[ 66.67%] ··· boolean.BooleanAll.time_numpy_withna 37.8±2ms
[ 70.83%] ··· boolean.BooleanAny.time_fletcher 15.9±0.1ms
[ 75.00%] ··· boolean.BooleanAny.time_fletcher_withna 16.5±0.8ms
[ 79.17%] ··· boolean.BooleanAny.time_numpy 35.9±0.6ms
[ 83.33%] ··· boolean.BooleanAny.time_numpy_withna 35.2±0.2ms
[ 87.50%] ··· boolean.BooleanAny.track_size_fletcher 2097152
[ 91.67%] ··· boolean.BooleanAny.track_size_fletcher_withna 4194304
[ 95.83%] ··· boolean.BooleanAny.track_size_numpy 67108864
[100.00%] ··· boolean.BooleanAny.track_size_numpy_withna 67108864
```

From the results, we can see that runtime-wise, we are at least **2.1x** times faster than the NumPy implementation in any case and can achieve an increase of upto 2.32x in speed in the most extreme case.
More remarkable though is the improvement in memory usage.
We have been fair to NumPy/pandas and already used `float32`

instead of the typically allocated `float64`

type.
Still, for the case with no missing data, we can save a factor of **32** on RAM whereas with missings, we still have a factor of **16** while getting correct results.
The `np.bool`

dtype would also have saved us a factor of 4 but as mentioned in the beginning, the results with missing data where inconsistent and thus sometimes incorrect.

### Conclusion

In this blog post, we have shown on how a simple operation on a `pyarrow.Array`

or `pyarrow.ChunkedArray`

can be implemented using `numba`

and `hypothesis`

for property based testing.
This was especially helpful as there is already an existing but memory-wasting implementation available.
We could use that implementation to verify that our new implementation has the same behaviour with a different underlying storage.

With the Arrow-based implementation we now have a custom boolean type that uses only 2 bits of storage per row instead of the existing implementation that need between 8 and 64 bit and at the same type achieve a speedup of **2** in the runtime.

The work described here has already been merged into `fletcher`

1, 2, 3 and released as `0.2.0`

on PyPI and conda-forge.