Python property-based testing

In Python, one of the most established frameworks for property-based testing is Hypothesis. It has been pre-installed on your user sessions of the practicals platform, along with the pytest framework which provides a convenient way to run your tests.

To demonstrate how Hypothesis works, we will reimplement a few classical linear algebra primitives using Python lists of floating point numbers, as well as other popular methods, hopefully getting some appreciation of all the work that is saved when we use NumPy instead of rolling our own code along the way.

A first example

Let’s start with a simple vector sum implementation (that you documented earlier!):

def vector_sum(vec, vec2):
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

We can write a property-based test for it using Hypothesis as follows:

from hypothesis import given
from hypothesis.strategies import floats, lists

@given(lists(floats()), lists(floats()))
def test_sum(vec, vec2):
    result = vector_sum(vec, vec2)
    for i in range(len(result)):
        assert result[i] == vec[i] + vec2[i]

Notice how we design the test as a function that takes certain inputs, whose nature is specified via the @given decorator.

In any case if we execute the test defined above, it fails:

$ pytest sum.py
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.2.2, pluggy-1.0.0
rootdir: /tp/homes/tmp_user_1/tests
plugins: hypothesis-6.70.0
collected 1 item                                                               

sum.py F                                                                 [100%]

=================================== FAILURES ===================================
___________________________________ test_sum ___________________________________

    @given(lists(floats()), lists(floats()))
>   def test_sum(vec, vec2):

sum.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sum.py:12: in test_sum
    result = vector_sum(vec, vec2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

vec = [0.0], vec2 = []

    def vector_sum(vec, vec2):
        output = []
        for i in range(len(vec)):
>           output.append(vec[i] + vec2[i])
E           IndexError: list index out of range
E           Falsifying example: test_sum(
E               vec=[0.0],
E               vec2=[],
E           )

sum.py:7: IndexError
=========================== short test summary info ============================
FAILED sum.py::test_sum - IndexError: list index out of range
============================== 1 failed in 1.06s ===============================

What the failure message tells us is that if it is called with one list of length 1 and one empty list, our vector_sum() crashes at the line output.append(vec[i] + vec2[i]). That’s because as currently written, the code assumes that both input lists are of the same length (i.e. both input vectors have the same dimensionality). Which is not true of an arbitrary pair of vectors.

Such assumptions are really easy to make when writing code on a bad day, and they are hard to find because as soon as the assumption is ingrained in your brain, any hand-written test you write is likely to use an input that follows the same assumption. But here, it only took a bit of randomness to expose it. Such is the power of property-based testing.

Now that we found an unexpected input configuration, we need to think about how we want to handle it. Passing in two vectors of different dimensionality looks like a usage error from the caller of the function, so perhaps just crashing with a more self-explanatory error message is enough?

def vector_sum(vec, vec2):
    assert len(vec) == len(vec2), "Can't sum vectors of different lengths"
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

With this change, our test would fail with a nicer error messages, but still fail. We also need to teach the test that failing with an AssertionError is the expected behavior when vectors of different length are passed in. And while we’re at it, we will also add a check that ensures the output has the same length as the inputs:

@given(lists(floats()), lists(floats()))
def test_sum(vec, vec2):
    try:
        result = vector_sum(vec, vec2)
    except AssertionError:
        # We only except an assertion failure if the length differs
        assert len(vec) != len(vec2)
        # In that case, the test can't continue: there is no result
        return

    # If the test passes, then all input and output lengths should match
    assert len(vec) == len(vec2)
    assert len(vec) == len(result)

    # Value check is as before
    for i in range(len(result)):
        assert result[i] == vec[i] + vec2[i]

Gimme gimme gimme a NaN

And with these changes, we get… a rather different test failure:

$ pytest sum.py
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.2.2, pluggy-1.0.0
rootdir: /tp/homes/tmp_user_1/tests
plugins: hypothesis-6.70.0
collected 1 item                                                               

sum.py F                                                                 [100%]

=================================== FAILURES ===================================
___________________________________ test_sum ___________________________________

    @given(lists(floats()), lists(floats()))
>   def test_sum(vec, vec2):

sum.py:12: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

vec = [0.0], vec2 = [nan]

    @given(lists(floats()), lists(floats()))
    def test_sum(vec, vec2):
        try:
            result = vector_sum(vec, vec2)
        except AssertionError:
            # We only except an assertion failure if the length differs
            assert len(vec) != len(vec2)
            # In that case, the test can't continue: there is no result
            return
    
        # If the test passes, then all input and output lengths should match
        assert len(vec) == len(vec2)
        assert len(vec) == len(result)
    
        # Value check is as before
        for i in range(len(result)):
>           assert result[i] == vec[i] + vec2[i]
E           assert nan == (0.0 + nan)
E           Falsifying example: test_sum(
E               vec=[0.0],
E               vec2=[nan],
E           )

sum.py:27: AssertionError
=========================== short test summary info ============================
FAILED sum.py::test_sum - assert nan == (0.0 + nan)
============================== 1 failed in 1.67s ===============================

Here, the problem is that floating point math is complicated (a topic for a another course). If we claim to accept any list of floating-point numbers as input, then we must account not just for “normal” numbers, but also for special numbers that the IEEE-754 standard defines, such as infinities and NaN (Not a Number). Alas, the latter does not honor normal math identities like x == x.

Now we need to think about what we want, as there are several reasonable ways to handle this:

We could check inputs and fail with an error if they cannot be summed with reasonable results. This would be appropriate in projects where we care a lot about not performing meaningless computations (and eventually emitting meaningless results), but checking inputs may be costly if we expect high-dimensional input vectors.
We could warn users that they cannot expect sane results if the function is fed with special IEEE-754 numbers. In that case, rather than modifying the code of the vector_sum() function, we would modify its documentation to clarify the situation, then modify test_sum() to ignore results on pathological input.

Since we have already shown an example of the first strategy (error-checking in the function being tested), we will now show an example of the second strategy:

import math
from hypothesis import given
from hypothesis.strategies import floats, lists

# Will produce unexpected results on special IEEE-754 numbers: inf, NaN...
def vector_sum(vec, vec2):
    assert len(vec) == len(vec2), "Can't sum vectors of different lengths"
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

@given(lists(floats()), lists(floats()))
def test_sum(vec, vec2):
    try:
        result = vector_sum(vec, vec2)
    except AssertionError:
        assert len(vec) != len(vec2)
        return

    assert len(vec) == len(vec2)
    assert len(vec) == len(result)

    for i in range(len(result)):
        # Do not check results on non-finite numbers (NaN and inf)
        if math.isfinite(vec[i]) and math.isfinite(vec2[i]):
            assert result[i] == vec[i] + vec2[i]

After this change, the test passes, showing that the vector_sum() function now behaves in the way that the test expects it to:

$ pytest sum.py
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.2.2, pluggy-1.0.0
rootdir: /tp/homes/tmp_user_1/tests
plugins: hypothesis-6.70.0
collected 1 item                                                               

sum.py .                                                                 [100%]

============================== 1 passed in 1.18s ===============================

So, what can we make of all this?

Testing is not just about finding bugs in the function being tested. It’s also about forming a better understanding of the problem we’re facing, and making conscious choices about what inputs we want to handle and how. A test failure indicates that the test assertions do not accurately reflect how the function being tested works, but this mismatch can be a problem in either the function or the test code.
Property-based testing is very effective at forcing us to think about uncomfortable edge cases that our brain would rather ignore, like special floating-point numbers.
By freeing us from inefficient human generation of interesting test inputs, property based testing lets us focus on more interesting questions about how our function should behave, which is what testing should be about.
Property based testing is not a panacea however. In this example, it is attention to detail that led us to notice that as initially formulated, the test did not check that the output was as long as the inputs. Similarly, if a function has a very narrow edge cases that is only triggered by very specific inputs, random search is unlikely to hit that configuration. In this case, you may need to hint the algorithm or outright force it to go through known-tricky inputs.

Exercise

You will find the above property-based testing example in the ~/robust-programming/tests subdirectory of your home directory on the practicals server, where it is called sum.py. Alongside it, you will find incomplete files for the following three exercises on property-based test:

Sorting a list: sort_list.py
Dot product: dot.py
Condorcet method: condorcet.py

The goal of these exercises is for you to fill them. You will probably not have the time to complete all exercises, so choose based on your level.

Beginner: sorting a list

Let’s suppose we write a function that takes a list of integers as input, and returns a sorted version of the list (yes, I know there are ready-made functions for that…):

def sort_list(alist: list) -> list:
    """ sort the input list in a naive way """
    if len(alist) <= 1:
        return alist

    out = []
    alist_tmp = alist.copy()
    while alist_tmp:
        out.append(min(alist_tmp))
        alist_tmp.pop(alist_tmp.index(min(alist_tmp)))
    return out

Exercise: find and implement properties that characterise the sort function.

Medium: dot product of two vectors

Write a function that computes the dot product of two vectors, defined as lists of floating point numbers as above, then write some tests that assert that it does so correctly. To avoid duplicating the full dot product computation in the test (which is unlikely to help you find any bugs, since the test and the function being tested would share any misunderstanding from your side), try the following strategy:

Start by generating a vector where only one component is not zero, and compute the dot product of that vector with itself. The result is the squared norm of the vector, which in this scenario should be very easy to predict. You may find the following test skeleton useful, feel free to check out the Hypothesis documentation to learn more about the various functions that are being used in the @given decorator.

from hypothesis import given
from hypothesis.strategies import floats, integers, just, tuples

max_len = 1024

@given(
    integers(min_value = 1, max_value = max_len).flatmap(
        lambda length: tuples(
            floats(),
            integers(min_value = 0, max_value = length - 1),
            just(length)
        )
    )
)
def test_norm2_easy(params):
    coord, idx, length = params
    assert length >= 1
    assert idx < length

Now generate two vectors of this sort and compute their dot product. Again, the result should be easy to predict without writing down the full dot product computation in your test. You may want to extract the logic for generating one vector into a reusable building block to avoid copy-and-paste, which you can do as follows:
```
easy_vector_params = integers(min_value = 1, max_value = max_len).flatmap(
    lambda length: tuples(
        floats(),
        integers(min_value = 0, max_value = length - 1),
        just(length)
    )
)

@given(easy_vector_params, easy_vector_params)
def test_dot_easy(params1, params2):
    pass
```
Now generate one arbitrary vector, as done in the vector addition example above, and compute its dot product with itself. This gives you its squared norm, as before. Test that this squared norm is greater than the square of the largest vector component.
Finally, generate two arbitrary vectors, and check that their dot product follows the Cauchy-Schwarz inequality: pow(dot(v1, v2), 2) <= dot(v1, v1) * dot(v2, v2).

Note that floating-point rounding errors differ between multiple ways to do the same computation, so you may need to use approximate comparisons like math.isclose() in some of your tests.

Hard: intransivity of the Condorcet method

Condorcet’s method is a voting system in which each voter draws up an ordered list of candidates and the only winner, if there is one, is the candidate who, compared in turn with each of the other candidates, would each time prove to be the preferred candidate.

The curiosity of this system is that there is no guarantee that there will be a candidate who satisfies the victory criterion. In other words, this system is not transitive: if A is better than B, and B is better than C, A is not necessarily better than C. This is Condorcet’s paradox, and we propose to demonstrate it through hypothesis.

Exercise: Write a test function which claims that Condorcet’s system is transitive, and show that this test does not work.

Tests that require overly complex data and slow down execution.
Tests that do not explore the parameter space sufficiently.
Faulty recursion strategies (i.e. that diverge too much).
Tests that have no chance of finishing before the end of the world.

See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information.

Robust Programming

Python property-based testing

A first example

Gimme gimme gimme a NaN

Exercise

Beginner: sorting a list

Medium: dot product of two vectors

Hard: intransivity of the Condorcet method

Free practice

Hypothesis, numpy & pandas

Hypothesis & pytest

Health checks