Introduction

This website hosts the instructions for the practical exercises of the “Robust Programming” computing training. Please use the arrows or the menu on the left to navigate between pages.

Goal

The aim of this training is to discover some of the software engineering concepts and methodologies and to practise using selected tools in order to write code that is better tested, better documented and easier to understand and maintain. We will cover:

  • automatic rule checks
  • documentation
  • tests
  • automation of workflows

Languages

For most practicals, Python and C++ versions of each exercise are provided. Unless otherwise stated, it is not expected that you complete both exercises (though you can if you are fast enough). The intent is rather that you pick the language you are most comfortable with, so that you can optimally focus on the new concepts introduced by this course.

For some exercises, you will find two tabs for each language:

print("Hello world")
#include <iostream>

int main() {
    std::cout << "Hello World";
    return 0;
}

Virtual machine

We deployed virtual machines in the cloud at VirtualData. To ensure that everyone works with the same environment during the practicals, it is strongly suggested that you do them on the Linux server that you have been given access. All instructions given in this document will assume that you are doing so, and may not work as expected on another computer. Simply connect to the virtual machine using ssh:

ssh <username>@<IP>

<username>, <IP>, and corresponding password will be given during the lecture. More information on the set up can be found on the platform section.

Platform

We deployed virtual machines in the cloud at VirtualData. To ensure that everyone observes the same system performance phenomena during the practicals, it is strongly suggested that you do them on the Linux server that you have been given access. All instructions given in this document will assume that you are doing so, and may not work as expected on another computer.

Network connexion

At least one of the academic institutions with which you work likely provides you with access to eduroam. Be sure to set it up and try it out in advance: you may need help from your local computing service as the configuration procedure is institution-specific. Here’s the documentation for IJCLab members.

In case you experience last-minute issues with eduroam, you will alternatively be able to use our public Wi-Fi network. You will need to be physically at IJCLab to set up this one, so no advance preparation is possible in this configuration.

Server configuration

The practical servers are configured in the following manner:

  • Each server is running an AlmaLinux OS (Alma 9).
  • The username to connect is student, and you should use the password given at the start of the lecture.
  • All necessary tools to compile and execute scripts are installed and accessible in the server. If this is not the case, please ask one of the teachers.
  • When you connect to the server, you are automatically put in a Python virtual environment (called robprog). This environment contains all necessary packages to perform the work. If you deactivated it, just log out from the server and reconnect, or execute source /opt/.venv/bin/activate.

Finally keep in mind that

  • You should not use the server for other purposes than this lecture.
  • The server will be available until the 12th April, then it will be reallocated to other work and all data will be removed. This should give you enough time to finish the second homework, but please remember to scp any file you care about away from the server!

SSH shortcuts

At the beginning of the course, you have been provided with the IP address of the server that you will be working on, as well as your user name on it. However, IP addresses are unwieldy to remember and user names are boring to specify on every connexion, so you may want to add something like the following to your SSH config file (~/.ssh/config on Linux and macOS)…

# Robust programming training
Host robust-programming
    User <username>
    HostName <server IP address>

…where you replace the <bits between brackets> with the matching information from the registration e-mail that you have received. Once that is done ssh robust-programming will connect you directly to the practicals platform, with the right user name, and directly ask you for your password.

Source code

First of all, you should download the source code for the practicals. It is located inside of this archive, which you can download and unpack in your home directory using this script:

#!/bin/bash

FOLDER_NAME="robust-programming"
BASE_URL="https://informatique-des-deux-infinis.pages.in2p3.fr/pheniics"

cd ~

# Keep a copy of the current
# folder if it exists
if test -d ${FOLDER_NAME}; then
  SUFFIX=`date +"%Y%m%d_%s"`
  BACKUP=${FOLDER_NAME}_${SUFFIX}
  echo "Found existing ${FOLDER_NAME} folder"
  echo "Keeping a copy under ${BACKUP} and redownloading it..."

  mv ${FOLDER_NAME} ${BACKUP}
fi
curl ${BASE_URL}/${FOLDER_NAME}/${FOLDER_NAME}.tar.gz | tar -xz

If, for some reason, you get one of these files into an undesirable state (e.g. accidentally delete it), you can get back a fresh copy of it in its initial state by re-running the script. It will save your current work, and download again the robust-programming directory in which you will find the original source files.

Eventually, your home directory ~/robust-programming directory containing the source code for all practicals. For demonstrations, this will be the final source code at the end of the demonstration. You can compile C++ programs using the provided Makefile:

make -j$(nproc)

Some Makefiles also have additional options, such as make test for running the test binaries (if any) or make doc for building Doxygen documentation. Please feel free to check the contents of the Makefiles for more info about what they do.

Local file access

One drawback of using an SSH-based remote configuration is that it seemingly forces you into using archaic terminal-based code editors like vi or emacs, instead of more modern alternatives like Sublime Text or Visual Studio Code. Thankfully, there are ways to mount a remote directory that you have access to via SSH as a local directory on your laptop, so that you can manipulate files in it locally with your preferred text editor.

  • For a general way to do this, please check out this sshfs tutorial that Hadrien wrote as part of the documentation of srv-calcul-ambulant, another compute server that he administrates. Translating these instructions for a different server should be straightforward (you only need to change the server name), but do get back to us if you encounter any issue.
  • If you are using Visual Studio Code specifically, it has its own support for operating over SSH connections that you may prefer over sshfs.

Of course, this access is only good for editing source code, and you should not try build programs or execute them on your local machine: you may not have the required prerequisites installed and even if you do, your local system is probably not as well tuned for reliable and reproducible performance benchmarking as the remote one.

Automatic rule checkers

The best way to get a team of developers to agree on some common rules is to have a tool that automatically enforces the rules. Failing that, the second best option is to have a tool that automatically detects rule violations, and then lets the programmer address them.

C++ and Python both have a long history, a large user base, and unfortunate design choices that make it easy for developers to accidentally write incorrect code. As a result, many tools exist to automatically detect clear-cut problems, as well as suspicious code constructs that may or may not be a problem, but will likely make the code harder to understand and maintain in any case. Finally, another common target automation is the enforcement of a consistent coding style across the entire codebase, which makes code easier to read.

All of these tools, and the kind of problems that they detect and correct, are heavily language-dependent. As a result, tools for each language will be introduced in a separate section. Please choose the language whose tooling you want to study:

Linting your Python code

In this section, we will learn how to use linters, formatters, and pre-commit hooks to improve the quality of your code and automate the process of checking for errors and enforcing coding standards.

The Style Guide for Python Code is called PEP8 (or see this stylized presentation). Rules are proposed by the community of users, and subject to change. There are hundreds of rules as of 2024, and it would not be feasible (even desirable) to check them manually each time we write a code.

Linters: the flake8 example

A linter is a tool that analyzes your code for potential errors, bugs, and stylistic issues. It can help you catch mistakes early in the development process and ensure that your code is consistent and maintainable. There are many linters available for Python, but one of the most popular is flake8. flake8 is a command-line tool that runs several other linters, including pyflakes and pycodestyle, and reports any errors or warnings that it finds.

You can run flake8 from the command line to analyze your code. For example, to check all Python files in the current directory and its subdirectories, you can use the command flake8 . (see the documentation for more options/configuration).

flake8 will report any errors or warnings that it finds, along with the line number and a brief description of the issue. Fix the issues reported by flake8 and run it again to ensure that your code is free of errors and warnings.

Exercise: run flake8 on the module ~/robust-programming/tests/bad_quality_flake8.py and fix the errors:

flake8 ~/robust-programming/tests/bad_quality_flake8.py

Note that we installed flake8 but also some of its extensions to ensure a wider rule coverage:

flake8==7.0.0
flake8-docstrings==1.7.0
pep8-naming==0.13.3

Formatters: the black example

Complying to a coding style can be a tedious task, which a linter won’t help you with. A formatter can help you automate this process by automatically formatting your code to comply with a specific coding style.

One of the most popular formatters for Python is black. black reformats your entire file in place to comply with the black coding style (a strict subset of PEP 8). It is a great tool to enforce consistent formatting across your code base.

Exercise: first run black on the module ~/robust-programming/tests/bad_quality_black.py with the options --test and diff

black --check --diff ~/robust-programming/tests/bad_quality_black.py

--test runs black without overwriting the file, and --diff shows you what would have been changed: + are additions, and - are deletions. Once you understand better the changes, you can directly apply black to reformat the file:

black ~/robust-programming/tests/bad_quality_black.py

Linters vs formatters

Should you use a linter or a formatter? In practice both! They usually are very complementary, and your code will be only better by using both. Note however they can contredict each other sometimes, see e.g. the black FAQ to remedy to this problem.

When to run linters and formatters?

You can run your favourite linter and formatter on your local machine each time you make changes, but let’s face it: it will become annoying very quickly, and there is a high chance to forget to do it. Instead, you can automate the rule checks using pre-commit hooks and continuous integration jobs.

Pre-commit hooks

Pre-commit hooks are scripts installed in your .git/hooks directory that are run before each commit. They can be used to automate the process of checking for errors and enforcing coding standards, and can help you catch mistakes early in the development process.

In Python, the pre-commit package can be used to manage pre-commit hooks. It allows you to define a set of hooks in a configuration file, and then run them automatically before each commit.

Continuous integration

Pre-commit hooks are great, but they assume that you have your development environment installed in your local machine, which is not always the case. So instead of enforcing checks before committing the code, you can perform the check after each push in the repository. This is called continuous integration, and you will learn more at the end of this lecture (see continuous integration).

C++ automatic rule checkers

Compiler warnings

One advantage of using an ahead-of-time compiler, as C++ programmers normally do, is that the compiler can check some aspects of program correctness without even running the program. Unfortunately, for historical and cultural reasons, many of these checks are disabled by default.

Compiler warnings, also known as “diagnostics” or “lints”, usually point at code which is legal according to the C++ standard, but invokes tracherous language corner cases that are unlikely to match programmer intent/comprehension. Code which triggers compiler warnings should normally be reviewed carefully for correctness, and ultimately rewritten into a more readable/correct form that does not trigger the warning.

In this section, we will explore some useful options of the GCC and clang compilers that you may want to enable to let them detect more issues with your code. We will focus on the behavior of GCC, which is largely imitated by clang. But if you are using another compiler, such as Microsoft Visual C++, you may want to check its documentation.

-Wall

This incorrectly named GCC and clang compiler option does not enable all error checking features, as you would expect. It does, however, enable a set of compiler warnings that the GCC and clang developers consider to provide a good trade-off between protection and false positives.

Click here for some examples of common problems that you can detect with -Wall
  • Comparison of a char* pointer to a hardcoded string: The following code will compare the memory addresses of two strings, and not their contents as the developer likely intended, a classic mistake in C and old-style C++.
void charptr_eq_str(const char* s) {
    if (s == "reference string") { /* ... */ }
}
  • Statically out of bounds array accesses: A C++ program that performs reads or writes beyond the end of an array invokes undefined behavior. This is the worst kind of error that a C++ programmer can make, because a compiler is allowed to assume that it does not happen when optimizing code, and thus unpredictable chaos will ensue at execution time. Using analysis that is carried out during the code optimization process (i.e. only when optimizations are enabled), GCC can detect this error in easy cases where all relevant parameters are known are compile time, as in the following example:
float numbers[42];
for (int i = 1; i <= 42; ++i) {  // Whoops! This is C++ not Fortran/Julia
    numbers[i] = i;
}
  • Misleading indentation: One unfortunate design decision of C++ is that control flow operations like if and while do not start a new code block. As a result, it is very easy to accidentally write code like the following, which -Wall will rightly report as suspicious:
if (condition)
    foo();  // Runs only if condition is true
    bar();  // Runs always, no matter if condition is true or false
  • Missing enum variant in switch : One problem with enums and switch statements is that it is very easy to start from correct usage…
enum class Color {
    Red,
    Green,
    Blue
};

// ... much later, possibly in a different source file ...

void print_color(Color color) { 
    switch (color) {
      case Color::Red:
        std::cout << "Red" << std::endl;
        break;
      case Color::Green:
        std::cout << "Green" << std::endl;
        break;
      case Color::Blue:
        std::cout << "Blue" << std::endl;
    }
}

…and later break the code by adding new enum variants without updating switch statements:

enum class Color {
    Red,
    Green,
    Blue,
    Pink  // This is new and support must be added to the switch statement
};

With -Wall, GCC will correctly point out that the switch statement must be updated.

-Wextra

As previously mentioned, the -Wall option only enables some of the available compiler warnings. More warnings can be enabled by additionally using the -Wextra option, which historically enabled warnings that the compiler authors considered to provide a less favorable true vs false positive compromises.

Most programmers, however, would consider the warnings which are enabled by -Wextra nowadays to point at code patterns that are still highly objectionable and worth eliminating.

Click here for some examples of common problems that you can detect with -Wextra
  • Implicit switch fallthrough: Starting from the incorrect switch statement example from the -Wall section, this would be one incorrect way to fix the it:
void print_more_colors(Color color) { 
    switch (color) {
      case Color::Red:
        std::cout << "Red" << std::endl;
        break;
      case Color::Green:
        std::cout << "Green" << std::endl;
        break;
      case Color::Blue:
        std::cout << "Blue" << std::endl;
      case Color::Pink:  // This breaks the former Color::Blue case
        std::cout << "Pink" << std::endl;
    }
}

This is incorrect because C++ switch statements infamously fall through by default, which means that if the switch statement is invoked with color set to Color::Blue, the program will continue executing code for other cases (in this case Color::Pink), and thus output both “Blue” and “Pink” in succession. With -Wextra, GCC will rightly point out that a break; statement should be inserted before the start of the Color::Pink case.

  • Missing field initializations: If you have a struct with multiple fields such as…
struct MultiField {
    int a;
    float b;
    int c;
};

…then C++ allows only some of the fields to be initialized during construction:

MultiField make_multi_field() {
    return { 12, 3.4 };
}

When you do this, only the first struct fields are initialized, and the remaining ones are left uninitialized. This surprising behavior commonly happens as a result of adding new struct fields without updating code that instantiates the struct, and because reading uninitialized data is undefined behavior, it is a very dangerous mistake. GCC will therefore warn about this patten in -Wextra mode.

  • Signed/unsigned integer type confusion: The C++ community has had many heated debates about where signed vs unsigned integer types should be used. One thing remains constant, however: programmers often pick the inappropriate integer type for their application, to disastrous results. Here are two examples that are detected by GCC in -Wextra mode:
void suspicious_int_conditions(int s, unsigned u) {
    // You probably meant to use a signed integer here
    if (u < 0) { /* Never executed */ }

    // The signed integer is reinterpreted as unsigned, with negative numbers
    // mapped into positive ones. This is likely to result in surprising outcome.
    while (s < u) { /* ... */ }
}

Hopefully, you will agree that there is little point in using only -Wall without -Wextra these days, and both options should normally be used.

Other correctness-oriented warnings

At this point, it will probably not surprise you that the -Wall -Wextra combination still does not enable the full error-checking power of GCC and clang. Among the many other available options, you may specifically want to enable…

  • -Wshadow: This detects situation where a local variable has the same name as a global variable, which is well-defined by the language (local variable takes precedence over global one) but can easily confuse a reader who is aware of the existence of the global variable.
  • -Wnon-virtual-dtor: This detects situation where a class has virtual methods, and is thus meant to be used via a pointer-to-base-class, but does not have the virtual destructor that this usage normally requires for correct resource cleanup.
  • -Woverloaded-virtual: This detects situations where code creates an overload of a virtual method, rather than an override, which is usually a mistake coming from an incorrect or outdated copy of the virtual method’s signature.
  • -Wconversion detects risks of loss of information caused by implicit conversions (e.g. trying to store a size_t into an char will normally lead to truncation).
  • -Wduplicated-cond detects the if (cond) { /*...*/ } else if (cond) { /*...*/ } duplicate condition pattern, which usually comes from copy-and-paste errors.
  • -Wduplicated-branches detects the if (cond1) { stuff(); } else { stuff(); } duplicate branch pattern, which also normally comes from copy-and-paste errors.
  • -Wnull-dereference detects simple attempts to dereference a null pointer, which is undefined behavior and thus strictly forbidden in any C++ program.
  • -Wdouble-promotion detects situations where a computation was seemingly intended to be carried out in single precision, but is actually carried out in (potentially slower) double precision due to the use of double-precision literals.

-Wpedantic aka -pedantic

While both -Wall and -Wextra are mainly about correctness, this is mainly about portability. By default, GCC and clang both provide a number of “GNU” extensions to the C and C++ standard which are not available in other compilers, including but far from limited to…

  • Mathematical constants like M_PI in <cmath>/<math.h>.
  • Support for the void main() main function signature.
  • C-style arrays whose length is 0 or determined at runtime.
  • __restrict__ manual alias analysis optimizations.
  • Hardware-agnostic SIMD support.

Many of thse extensions are extremely widely used and could be considered bugfixes for shortcomings of the C++ standard. But nonetheless they are not part of the standard, not available in other compilers, and a program that uses them is technically not a true C++ program.

The synonymous -pedantic and -Wpedantic command line options disable some of these extensions. If you care in any capacity about the portability of your code to compilers other than GCC and clang, including Microsoft, NVidia and Intel compilers, you should consider using these compiler options in order to reduce the risk of accidentally using GCC- or clang-specific features.

Beware, however, that -Wpedantic does not detect usage of all compiler extensions and is therefore not a substitute for running test builds for all compilers you care about in your project’s continuous integration infrastructure.

-Weverything

This clang-specific command line option is what most programmers agree -Wall should have been. It enables every single clang warning. It is commonly combined with a set of -Wno-xyz options (e.g. -Wno-shadow), which disables specific compiler warnings that the developers themselves (not the compiler authors) deemed to have an unfavorable cost/benefit tradeoff for their project.

-Werror

Because most compiler warnings point at code constructs that are suspicious, but not illegal according to the language rules, warnings do not normally cause compilation to fail.

For everyday development builds, this is undesirable: it commonly leads to undesirable outcomes like warnings being ignored by developers, or scripts continuing to do thing like running tests or other scripts on a build that does not even pass the compiler’s internal checks, hiding the outcome of these checks pretty deep in the depth of old build logs.

In these situations, the -Werror compiler option, which turns warnings into hard erros that fail the build, can be used to ensure that the build fails promptly when warnings as present, without burying the warnings below thousands of lines of unrelated script output.

However, -Werror should not be made a mandatory part of every non-development build of the project, because it means that a single compiler update that adds a new warning is all it takes to break the build for everyone, including mere users of the project who are not working on it and would not want to write patches to fix the warnings.

Exercise

The analyzers/mdbook_snippets.cpp file contains all the source code examples of this chapter, and can be compiled using the Makefile in the same directory using the make command.

Try to modify the Makefile’s first line (CXXFLAGS variable definition) to enable some of the warnings discussed in this chapter. Then rebuild using this new configuration (you may need to run make clean before make), and see how the compiler starts to report the various issues that were discussed previously.

clang-format

Modern programming languages allow a single program can be written in many different ways, with different amounts of spaces, line feeds, parentheses, and so on. This creates a conundrum:

  • On one hand, programmers who learn/work on their own tend to develop, over time, a distinctive way to lay out code in the source file that they are most comfortable with.
  • On the other hand, programs which are written by many different programmers are a lot more readable when everyone agrees to follow a common style, than when each individual developer goes and lays out the code in their own preferred/signature style.

The universally accepted solution to this problem is to standardize on a common coding style whose rules are simple enough to be easily machine-enforced. Tools can then be used to automatically translate the entire project’s code to this style, both at adoption time and whenever new code gets integrated into the common repository.

It is inevitable that the simple rules enforced by the tool will result in imperfect results that everyone agrees looks bad from time to time. To some extent, this can be addressed by making the rules more complex, which is why automatic code formatters tend to have a bewildering number of configuration options.

One should not forget, however, that attempts to improve code formatting often translate into hours of team arguments and configuration trial and error with near zero useful code production. The best is the enemy of the good, and staying as close as possible to the default tool configuration, however perfectible it may be, has the major advantage of quickly giving you the main benefit of automatic code formatting (standardizing the project’s coding style with minimal manual effort) without paying too many costs in return (endless team arguments about ideal code style).

In C++, the generally agreed standard for automatic code formatting is the clang-format utility. It internally uses the clang compiler, but can be used on any project even if the build is done using GCC or MSVC, as long as clang still manages to make basic sense of the source code. It is very easy to use: just point it at a source file with the -i flag for in-place reformating, and it will adjust the formatting according to the default style:

clang-format -i source.cpp

With a bit of extra work, you can also configure your code editor to automatically reformat source files whenever they are saved, or configure version control systems like git so that each modified source file is reformatted before creating a commit. Finally, as a last resort, your project’s continuous integration should check that the source files are formatted correctly before code is integrated into the project’s master branch. We’ll talk more about continous integration later in this course.

Exercise

Because it is used as part of a static site generation pipeline and must match the indentation expectations of the underlying HTML generation tool, the mdbook_snippets.cpp file has a very unusual style that you may find hard to read.

Try to automatically reformat it using clang-format, then visually inspect the result using your favorite text editor. Are there any formatter decisions that you find yourself disagreeing with?

Dynamic analysis

clang-format and compiler warnings are two examples of static program analysis: they can analyze and transform source code without running the underlying program, and reach conclusions that are valid no matter which input values the program is called with, or which operating system and hardware is used to executes the program. Ultimately, however, static analysis will not find many classes of problems in your code for two key reasons.

First, static analysis is hard. When working on static analysis tools, it is surprisingly easy to encounter edge cases where a precise analysis algorithm which tries to always returns the correct result would take an extremely long time to run or not manage to reach a conclusion at all.

Second, some information that is critical to program correctness is not known until runtime. For example, if a C++ program indexes an array with an integer that is specified by the user through the command line interface or API, whether that array access is in bounds (correct well-defined behavior) or out of bounds (incorrect undefined behavior) cannot be determined at compile time. A static analyzer could pessimistically flag this code pattern as potentially flawed, and some do. But this would inevitably have false positives, which would reduce programmer trust in the analyzer.

The solution to both of these problems is to stop relying on static analysis alone, and instead complement it with dynamic analysis, which analyzes the runtime behavior of the program during the execution of one or more specific test jobs, in a specific environment and with specific inputs.

Of course, one general drawback of dynamic analysis is that it will only find errors in that particular execution configuration. If your program is susceptible to out-of-bounds array accesses, but your test workloads do not trigger any such access, then dynamic analysis will not report the problem. In other words, dynamic analysis is only as good as your test suite is.

Valgrind

Valgrind uses a form of dynamic analysis called dynamic binary instrumentation. You can think of it as running your program in a kind of x86 CPU emulator (technically a JIT) that supports injection of arbitrary checks every time an instruction is executed, a basic block is entered/exited, etc.

One advantage of this approach is that it can be applied to any program without any preparation (although availability of debug info will greatly improve error messages). The drawbacks are that…

  • Some information from the source code (e.g. various categories of undefined behavior) is missing in the final binary, which limits the amount of problems that can be detected.
  • Instrumentation must be shoehorned into an existing binary, without invalidating code and data pointers or modifying existing control flow. As a result, it cannot be implemented very efficiently, which commonly results in execution being slowed down by ~30x. You will want to account for this by running valgrind-instrumented programs on smaller workloads.

To keep the implementation simple, valgrind also serializes execution of all program threads onto a single OS thread. This means that parallel programes will not get any speedup from availability of multiple CPU cores, resulting in an even greater valgrind-induced slowdown.

For program correctness checking purpose, the most useful valgrind tool is Memcheck, which is conveniently the one that is enabled by default when you run valgrind without any extra option. Memcheck checks for various memory usage errors, including…

  • Out-of-bounds accesses aka “buffer overflow”.
  • Use of uninitialized memory, which is undefined behavior.
  • Use-after-free of heap-allocated data.
  • Invalid frees, aka trying to free a pointer that was not previously allocated by the matching allocator (or was already freed before).
  • Memory leaks, aka allocating data and never freeing it.

To reduce overhead and false positives, some checks are disabled by default. See the command-line options documentation for more info about these and how they can be enabled when needed.

The easiest way to run your program through memcheck is to first make sure that your program is built with debug symbols (enabled by GCC/clang option -g) and then run it through valgrind using this sort of command line invocation:

# Simple version: just run "your_program" under valgrind's watch.
valgrind ./your_program

# More complex version, with extra arguments to both valgrind of your program
#
# To make the command more readable, using -- to separate valgrind's options
# from the command line that valgrind is monitoring is recommended.
valgrind --leak-check=full --track-origins=yes -- ./your_program arg1 arg2

Sanitizers

Some problems cannot be found by binary instrumentation like valgrind, either because the relevant information is not present in final binaries or because the slowdown brought by the dynamic instrumentation is so great that getting to the point of program execution where the error is triggered would take an excessive amount of time and machine resources.

In those situations, better results can be obtained by using a more invasive form of dynamic analysis, where the compiler injects supplementary error-checking code during program compilation. In C++, this is most commonly done using the Sanitizer tool family:

  • AddressSanitizer and MemorySanitizer cover memory errors similar to those found by valgrind’s Memcheck, but at a much lower runtime performance cost.
  • ThreadSanitizer and UndefinedBehaviorSanitizer cover many more cases of undefined behavior that Valgrind cannot reliably detect because un-instrumented binaries do not contain enough information about what the original source code was doing.
  • Other sanitizers are available, but they cover quite specialized failure modes and require more involved setup, so they are only used in situations where the high cost of failure justifies extra effort (e.g. security-critical software that processes untrusted user input).

In exchange for detecting more bugs and having a reduced impact on program performance than valgrind, sanitizers bring several limitations of their own:

  • Due to their deep integration with compiler internals, they work best when used with the compiler and libc/STL implementation that they were developed against, typically clang/libc++. Other compilers like GCC may try to support them too, but will often do so less reliably (expect hard-to-debug crashes and some missed error detection).
  • Undefined behavior is a program-wide property, it may not directly happen in the code that you wrote but instead in libc or other libraries that your program uses as a result of an invalid API request. Therefore complete error detection requires recompiling every single dependency of the program, down to libc, with sanitizer instrumentation. Source-based package managers like Spack may help with this, but in general achieving good sanitizer coverage in larger software projects is best done with help from the project’s computing infrastructure experts.

To use sanitizers, just pass the right -fsanitize=xyz option to your compiler during build, as documented in your compiler’s manual. Beware that some sanitizers are incompatible with each other, and even if they work combining them will result in increased runtime resource usage, so you may need multiple full-stack software builds to cover all the sanitizers that you are interested in. Use of a dedicated highly specced build & test server is therefore recommended.

It’s also worth noting that by default sanitizers don’t always stop the program when an error is detected, which can result in duplicated error reporting and misses in automatic test runs. You can enforce fail-on-error mode on operating systems where it is not the default using environment variables, e.g.

export ASAN_OPTIONS=abort_on_error=1:halt_on_error=1
export UBSAN_OPTIONS=abort_on_error=1:halt_on_error=1

Exercise

As previously discussed, the ability of GCC’s warnings to detect undefined behavior, like incorrect memory usage, is unfortunately very limited.

In particular, the out-of-bounds array access that the fortran_iteration() function performs is not detected when compiler optimizations are not enabled.

Experiment with valgrind and sanitizers, and see if you can find a tool configuration that succeeds at detecting this out-of-bounds memory access.

Conclusion

Where tools won’t help

It is easy to get carried away and add lots and lots of automated tools to your software build and test infrastructure, in the hope of finding more and more errors, but keep in mind that…

  • Each tool you add slows down your automated test routine and requires additional long-term maintenance effort. Because of this, you should normally only use one tool per category of error that you want to detect. Prefer a few good tools over many bad ones.
  • Tools are mainly focused on helping you follow the language rules and avoid suspicious constructs (areas of the language that are known to be hard to use, or useless code that suggests a typo has been made). Tools will not normally find more advanced logic errors, such as incorrect math, hardcoded reduced-precision math constants (e.g. #define PI 3.14) or use of exact comparison for validation of floating-point results.
  • To detect this sort of errors and make the most of dynamic analysis tools, you will want to have a good test suite that covers as much of your code as possible in a way that is as fine-grained as possible. We will explore this further in the remainder of this course.

Want more tools anyway?

Here are some other tools that you may want to experiment with, which may find extra issues beyond the ones covered by the tools we discussed here.

Documentation

It is important to document how software works, because it allows people to understand how it works a lot faster than by trying to make sense of the source code. We’re not only talking about other people here, by the way. When you get back to working on some software project after a break of several months, you will also thank yourself for having written good documentation.

Software can be provided with many different flavors of documentation, including…

  • API reference: Summarizes what individual data types are meant to store, what functions take as input and emit as output, what are the error conditions…
  • Tutorials/Examples: Introduces a new user to the software, what it does, and how it’s used in some concrete, hopefully common scenarios.
  • Architecture/Design overview: In larger software, explains how the various sub-parts fit together and why they were designed that way.
  • Contribution guide: In open-source projects, tells newcomers about how to best submit new contributions to the software.
  • Data management plan: In applications that are meant to store user data for extended periods of time, explains how data gets there, how long it’s kept in the system, how it’s protected from system failures…

Documentation is not even necessarily a written document that’s separate from the source code. Some aspects of the source code, such as the name and type of code entities, can also provide basic documentation to the reader of the code. For example, compare the following two functions:

# Would you rather use this function?
def do_stuff(x, y):
    return x[y]


# ...or this function?
def get_phone_number(
    phone_book: dict[Name, PhoneNumber],
    full_name: Name
) -> PhoneNumber:
    return phone_book[full_name]
// Would you rather use this function?
auto do_stuff(auto&& x, auto&& y) {
    return x[y];
}


// ...or this function?
PhoneNumber get_phone_number(const std::map<Name, PhoneNumber>& phone_book,
                             const Name& full_name) {
    return phone_book[full_name];
}

Even though do_stuff and get_phone_number can be used to do the same thing, you will hopefully agree that get_phone_number is a lot easier to understand in its because it’s more specific about what it does. There is a more general lesson to be learned here: your code should only be as generic as it needs to be for the task at hand, because specific programs are easier to understand, document, test and maintain than generic ones. Please resist the temptation to write code which solves problems that you do not have!

In this practical, we will cover the two most common forms of documentation, namely API documentation and tutorials. API documentation tools are language-specific so you will need to pick which language you want to study here:

Python API documentation

Docstrings

A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.

>>> import numpy as np
>>> print(np.sum.__doc__)

    Sum of array elements over a given axis.

    Parameters
    ----------
    a : array_like
        Elements to sum.
    axis : None or int or tuple of ints, optional
        Axis or axes along which a sum is performed.  The default,
        axis=None, will sum all of the elements of the input array.  If
        axis is negative it counts from the last to the first axis.

        .. versionadded:: 1.7.0

        If axis is a tuple of ints, a sum is performed on all of the axes
        specified in the tuple instead of a single axis or all the axes as
        before.
    dtype : dtype, optional
        The type of the returned array and of the accumulator in which the
        elements are summed.  The dtype of `a` is used by default unless `a`
        has an integer dtype of less precision than the default platform
        integer.  In that case, if `a` is signed then the platform integer
        is used while if `a` is unsigned then an unsigned integer of the
        same precision as the platform integer is used.
    out : ndarray, optional
        Alternative output array in which to place the result. It must have
        the same shape as the expected output, but the type of the output
        values will be cast if necessary.
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the input array.

        If the default value is passed, then `keepdims` will not be
        passed through to the `sum` method of sub-classes of
        `ndarray`, however any non-default value will be.  If the
        sub-class' method does not implement `keepdims` any
        exceptions will be raised.
    initial : scalar, optional
        Starting value for the sum. See `~numpy.ufunc.reduce` for details.

        .. versionadded:: 1.15.0

    where : array_like of bool, optional
        Elements to include in the sum. See `~numpy.ufunc.reduce` for details.

        .. versionadded:: 1.17.0

    Returns
    -------
    sum_along_axis : ndarray
        An array with the same shape as `a`, with the specified
        axis removed.   If `a` is a 0-d array, or if `axis` is None, a scalar
        is returned.  If an output array is specified, a reference to
        `out` is returned.

    See Also
    --------
    ndarray.sum : Equivalent method.

    add.reduce : Equivalent functionality of `add`.

    cumsum : Cumulative sum of array elements.

    trapz : Integration of array values using the composite trapezoidal rule.

    mean, average

    Notes
    -----
    Arithmetic is modular when using integer types, and no error is
    raised on overflow.

    The sum of an empty array is the neutral element 0:

    >>> np.sum([])
    0.0

    For floating point numbers the numerical precision of sum (and
    ``np.add.reduce``) is in general limited by directly adding each number
    individually to the result causing rounding errors in every step.
    However, often numpy will use a  numerically better approach (partial
    pairwise summation) leading to improved precision in many use-cases.
    This improved precision is always provided when no ``axis`` is given.
    When ``axis`` is given, it will depend on which axis is summed.
    Technically, to provide the best speed possible, the improved precision
    is only used when the summation is along the fast axis in memory.
    Note that the exact precision may vary depending on other parameters.
    In contrast to NumPy, Python's ``math.fsum`` function uses a slower but
    more precise approach to summation.
    Especially when summing a large number of lower precision floating point
    numbers, such as ``float32``, numerical errors can become significant.
    In such cases it can be advisable to use `dtype="float64"` to use a higher
    precision for the output.

    Examples
    --------
    >>> np.sum([0.5, 1.5])
    2.0
    >>> np.sum([0.5, 0.7, 0.2, 1.5], dtype=np.int32)
    1
    >>> np.sum([[0, 1], [0, 5]])
    6
    >>> np.sum([[0, 1], [0, 5]], axis=0)
    array([0, 6])
    >>> np.sum([[0, 1], [0, 5]], axis=1)
    array([1, 5])
    >>> np.sum([[0, 1], [np.nan, 5]], where=[False, True], axis=1)
    array([1., 5.])

    If the accumulator is too small, overflow occurs:

    >>> np.ones(128, dtype=np.int8).sum(dtype=np.int8)
    -128

    You can also start the sum with a value other than zero:

    >>> np.sum([10], initial=5)
    15

All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings. Public methods (including the __init__ constructor) should also have docstrings. A package may be documented in the module docstring of the __init__.py file in the package directory.

String literals occurring elsewhere in Python code may also act as documentation. They are not recognized by the Python bytecode compiler and are not accessible as runtime object attributes (i.e. not assigned to __doc__), but two types of extra docstrings may be extracted by software tools:

  • String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called “attribute docstrings”.
  • String literals occurring immediately after another docstring are called “additional docstrings”.

More information at https://peps.python.org/pep-0257/.

Google vs NumPy vs others

How should the docstrings be formatted? There are many variants that you can choose, and the two main popular styles are:

The main difference between the two styles is that Google uses indentation to separate sections, whereas NumPy uses underlines.

Doctest

A doctest is a way to embed test cases within the documentation of a function or module in Python. The tests are written in the form of interactive Python sessions, and they are used to make sure that the code examples in the documentation are accurate. Check for example the examples given above in the np.sum function.

Let’s take a simple Python function that calculates the area of a rectangle:

import sys
import doctest

def area(width, height):
    """Calculate the area of a rectangle.

    >>> area(5, 10)
    50
    >>> area(2, 3)
    6
    """
    return width * height

if __name__ == "__main__":
    sys.exit(doctest.testmod())

The tests are executed using:

python mymodule.py

This will run all of the doctests in the current module and report any failures. Alternatively you can run it using the doctest module from the command line:

python -m doctest mymodule.py

Exercise

Add a docstring to this function by documenting its arguments, its return value. Do not hesitate to add missing types as well in the definition of the function. You can add any useful notes as well (see the example above), and finally add a doctest to it:

def vector_sum(vec, vec2):
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

Use generators!

Writing documentation is a time consuming task that we easily skip if we do not get some help. Typically, we recommend to use an extension with your favourite code editor to automatically generate template docstrings (e.g. autoDocstring in VSCode and VSCodium), or use integrated chatbot based on large language models (e.g. Microsoft Copilot). In the latter case, please be sure to check the output carefully as it may seem sensible at first glance, but could turn out to be completely wrong on closer examination.

C++ API documentation

Introduction

In C++, the de-facto standard for API documentation is Doxygen. Like many API documentation tools, it works by parsing the program’s source code, looking for specially formatted code comments that precede the declaration of the code entity (type, function…) that is being documented.

// The following code comment is parsed by Doxygen as a function description,
// since it comes right before a function declaration.
//
// Notice how the \param directive can be used to document individual function
// parameters. You should not overuse this functionality: keep it for situations
// where the function documentation is not, on its own, complete enough for
// users to infer parameter semantics.

/*!
 * Compute the squared Euclidean norm of `vec`.
 *
 * In linear algebra, the norm of a vector quantifies how far away from the
 * origin the vector extends in the underlying vector space. Among possible
 * norms, one will most commonly use the Euclidean norm or 2-norm, which is
 * the square root of the sum of squared vector coordinates.
 *
 * In low-dimensional spaces, computing the square root is much more expensive
 * than other parts of the norm computation. Therefore, it is recommended to use
 * the squared norm over the actual Euclidean norm whenever your use case allows
 * for it.
 *
 * \param[in] vec Coordinates of a vector in an n-dimensional Euclidean space.
 */
double norm2(const std::vector<double>& vec);

This is a good design choice for an API documentation system because it brings the documentation as close as possible to the code entity that is being documented, which reduces the odds that as the code evolves, developers will forget to update associated API documentation.

However, this also means that Doxygen a poor fit for documentation which is not as closely related to the source code, such as user tutorials, as using it for this purpose would drown the source code into heaps of long comments which are not strongly related to the program’s main job. We will later discuss another documentation tool, mdBook which is better suited to this sort of documentation.

Setting up Doxygen

First of all, you need to install the doxygen command line tool. For the purpose of this training, we have done it for you in the virtual machines on which the practicals are un. But in most Linux distributions, this is most easily done by using the doxygen package from your distribution’s standard package manager. Other operating systems may require finding an alternate source of pre-built binaries, or even building Doxygen from source.

After installing Doxygen, the next step will be to create a configuration file for your project. This is most easily done by using the doxygen -g command, which will create a Doxyfile in the current directory that contains all configuration options, pre-set to mostly reasonable default values. The configuration file uses a simple KEY = VALUE format, supplemented by comments that explain what each configuration option does.

Options that you may want to adjust away from the defaults include…

  • EXTRACT_PACKAGE = YES: This ensures that every source file gets documented, even in absence of a file-wide documentation comment. This avoids the puzzling situation where you added documentation to some code entity but it doesn’t appear in the Doxygen output. In programming languages with a source/header dichotomy like C++, this setting is best complemented with removing every source file extension from the FILE_PATTERNS list.
  • QT_AUTOBRIEF = YES: This ensures that the first sentence of an entity’s documentation is interpreted as a summary, which is a good practice in general, and better than the visually noisy alternative of needing explicit \brief tag inside of each and every doc comment.
  • SHOW_INCLUDE_FILES = NO: This suppresses the default Doxygen behavior of listing every header included by a source file, which is usually more noisy than informative.

Once Doxygen is configured, running doxygen in your source directory will produce, among other things, an html directory containing a set of HTML pages with your new source code documentation. It is advised to add this directory, along with other automatically generated Doxygen output, to your project’s .gitignore version control exclusion rules.

Another optional step which you may want to take is to set up your build system so that documentation can be built together with the source code in a single command. This ensures that the documentation is regularly built in your everyday development workflow, and errors in the documentation are therefore noticed quickly.

The apidoc/ subdirectory of the source code examples demonstrate a full setup that uses the aforementioned configuration, based on the GNU Make build system. Running make in this directory (or the parent directory) will automatically rebuild both the provided C++ programs and the associated Doxygen documentation.

Writing the API documentation

A complete guide to Doxygen comments would be out of scope for this quick tutorial, we will refer you to the official manual for this. But here is a complete header file that demonstrates some key features of Doxygen that you may be interested in when starting out:

#pragma once


// It is a good idea to start a header file with a Doxygen comment that explains
// what it is about, what common theme unifies all functionality implemented by
// the underlying code module.
//
// If you can't find a common theme that can be summarized in a single short
// sentence, your code module is probably trying to do too many things and would
// benefit from being split into several simpler code modules.
//
// Notice the following Doxygen syntax points:
//
// - Comments are flagged for Doxygen consumption using a marker character, in
//   this course we are using the Qt style of Doxygen comment where the marker
//   character is an exclamation mark.
// - We need to clarify that we are documenting the entire source file, rather
//   than some specific code object, using the \file directive.
// - Thanks to the QT_AUTOBRIEF = YES configuration option, the first sentence
//   of a Doxygen comment is automatically treated as a summarized description
//   of the entity that is being documented.

/*!
 * \file
 *
 * This file demonstrates basic usage of Doxygen for API documentation purpose.
 *
 * We will use it to demnstrate notions such as file-wide documentation,
 * function documentation, type documentation, and so on.
 */

#include <vector>


// The following code comment is parsed by Doxygen as a function description,
// since it comes right before a function declaration.
//
// Notice how the \param directive can be used to document individual function
// parameters. You should not overuse this functionality: keep it for situations
// where the function documentation is not, on its own, complete enough for
// users to infer parameter semantics.

/*!
 * Compute the squared Euclidean norm of `vec`.
 *
 * In linear algebra, the norm of a vector quantifies how far away from the
 * origin the vector extends in the underlying vector space. Among possible
 * norms, one will most commonly use the Euclidean norm or 2-norm, which is
 * the square root of the sum of squared vector coordinates.
 *
 * In low-dimensional spaces, computing the square root is much more expensive
 * than other parts of the norm computation. Therefore, it is recommended to use
 * the squared norm over the actual Euclidean norm whenever your use case allows
 * for it.
 *
 * \param[in] vec Coordinates of a vector in an n-dimensional Euclidean space.
 */
double norm2(const std::vector<double>& vec);


// Doxygen can also document much more complex code entities, such as class
// hierarchies.

/*! This is the base class of all evil
 *
 * It has no destructor because evil cannot be destroyed, only displaced.
 */
class EvilBase {
  public:
    /*! Automatically set up the evilness */
    EvilBase() = default;

    /*! Query the current level of evilness */
    int evilness() const { return m_evilness; }

  protected:
    /*! Manually configure the level of evilness (for experts only) */
    explicit EvilBase(int evilness) : m_evilness(evilness) {}

  private:
    /*! Current level of evilness */
    int m_evilness = 42;
};


/*! Travel management system powered by the souls of unborn children
 *
 * Hand washing is strongly advised after prolonged contact with instances of
 * this class. In case of accidental ingestion, seek immediate medical
 * assistance.
 */
class Notilus : public EvilBase {
  public:
    /*! Release the proverbial Kraken */
    Notilus() : EvilBase(666) {}
};

Exercise

Pick any C++ codebase (either one of the source directories supplied as part of this course, other than apidoc obviously, or possible a simple program of yours) and try to set up Doxygen and write some documentation for some of the public entities in the source code.

Conclusion

As you can see, writing API documentation is not terribly difficult, and can be a significant asset to people getting familiar with your program if done right.

However, there is more to documentation than API docs. For example, while it is technically possible to use Doxygen to document higher-level architecture or write tutorials, as done by libraries like hwloc or Eigen, the quirks of Doxygen make this needlessly difficult. Nicer toolchains exist for standalone documentation, as we are going to see in the next chapter on mdbook.

Tutorial-style docs with mdBook

Introduction

API documentation tools are useful when documenting code. But if the document that you are writing is mainly structured prose with a small-ish amount of code in it, like the text that you are currently reading or user tutorials in general, there are nicer alternatives.

Static Site Generators (SSGs) that generate HTML from Markdown have gained tremendous popularity for this application in recent years because…

  • Starting from a plain text format makes it much easier to use version control systems.
  • Targeting HTML lets you leverage the huge ecosystem of web tools and hosting providers.
  • Among the various plain text formats that compile to HTML, Markdown is by far the most popular option, and arguably one of the easiest to learn.

Unfortunately, Markdown by itself (including its CommonMark standardized dialect) does not specify enough functionality to ease the writing of complex, multi-page documents. Hence we often need to tie ourself to a specific renderer that extends Markdown enough to make it useful. In this practical, we will use mdBook for this purpose.

Getting started with mdBook

There are many alternatives to install mdBook on a machine. For the sake of simplicity, mdBook is already installed on the virtual machines that you are using.

Initialisation of the book

The mdbook init command will create a new directory containing an empty book for you to get started. Give it the name of the directory that you want to create:

mdbook init my-first-book

It will ask a few questions before generating the book. After answering the questions, you can change the current directory into the new book, and inspect the directory:

$ tree -L 2 .
.
├── book
├── book.toml
└── src
    ├── chapter_1.md
    └── SUMMARY.md

2 directories, 3 files

The file book.tomlcontains settings for describing how to build your book. This is written in the TOML markup language. The default settings are usually good enough to get you started. When you are interested in exploring more features and options that mdBook provides, check out the Configuration chapter for more details.

The folder src contains your source code to build the book, written in markdown. You can add as many files as you want. Note that the SUMMARY.md file contains the definition of the table of content. You can have many files in src, but only those defined in SUMMARY.md will actually appear on the rendered book.

Finally the folder book, initially empty, will contain the rendered book.

Rendering a book

There are several ways to render a book, but one of the easiest methods is to use the serve command, which will build your book and start a local webserver. As you are on a Virtual machine, you have to specify the hostname, as well as a port that is open:

mdbook serve --hostname <IP> --port <PORT>

where <IP> is the IP of your machine, and <PORT> is set to 24000 here (we opened this port for you). Just follow the displayed URL on the terminal to read your book. You can leave the server running even while you edit the content of the book, and mdBook will automatically rebuild the output and automatically refresh your web browser. Note that now the book folder contains many more files (recreated each time you make modifications in the src folder).

Publishing a book

Once you have written your book, you may want to host it somewhere for others to view. The first step is to build the output of the book. This can be done with the mdbook build command in the same directory where the book.toml file is located:

mdbook build

This will update the directory named book which contains the HTML content of your book. You can then place this directory on any web server to host it.

For more information about publishing and deploying, check out the Continuous Integration section from the mdBook documentation, and we will play with a concrete example at the end of this training (Continuous integration).

Exercise: build a logbook

Build a new book, whose sections follow the sections of this training. In each section, you will write your findings, notes and feedback about this training :-)

Testing your code: the basics

Testing a code… a very vast subject. For some it is a fuzzy concept, borderline boring, for others it is the heart of programming; but everyone agrees that testing a code correctly is not an easy task.

In most languages, there are libraries to facilitate the writing of tests, still it remains up to you to make the tests relevant and effective to understand the limits and the flaws of our implementation and design. And then there is the age-old question: have I tested my code enough?

It might not be an existential question if you’re writing a little library for yourself only (though…), but very quickly the question becomes legit. Keep in mind that most of big software projects used to start as small projects that have grown unexpectedly. Here is how the creator of Linux described the first public version of the now-ubiquitous operating system:

I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones.

In this practical, we will cover two common tools to automate the execution of tests. As these tools are language-specific you will need to pick which language you want to study here:

Basic testing with pytest

In Python, there are libraries to facilitate the writing of tests. You have seen in the documentation section the use of doctest to write simple tests attached to the documentation of a method or a class. While doctest provides a first layer of tests, called unit tests, it is not well suited to more complex tests, outside the scope of a single object.

The other widely used test framework in Python is pytest. pytest is a testing framework for Python. It is a powerful tool for discovering and running tests, and it has a number of useful features, such as the ability to rerun only the tests that failed during the last run, and support for running tests in parallel.

Basic usage

You will find at ~/robust-programming/tests/test_addition.py a simple test for the add function:

def add(a: int, b: int):
    """ Add two integers

    Parameters
    ----------
    a: int
        First integer
    b: int
        Second integer

    Returns
    ----------
    out: int
        Sum of `a` and `b`

    Examples
    --------
    >>> add(2, 3)
    5
    """
    return a + b

def test_addition():
    assert add(2, 3) == 5
    assert add(-2, 3) == 1
    assert add(2, -3) == -1
    assert add(0, 0) == 0

In this script we define the function with its corresponding test suite, although in practice you will decouple the code and the tests in different files (even directory) and import necessary modules in the test files. Just execute this test and observe the result:

pytest ~/robust-programming/tests/test_addition.py


============================ test session starts =============================
platform linux -- Python 3.9.12, pytest-7.1.0, pluggy-1.0.0
Using --randomly-seed=1793550101
rootdir: /home/peloton/codes/robust-programming/practicals
plugins: asdf-2.14.3, typeguard-4.1.5, anyio-3.5.0, cov-4.0.0, factoryboy-2.5.1, rerunfailures-13.0, randomly-3.15.0
collected 1 item                                                             

code/tests/test_addition.py .                                          [100%]

============================= 1 passed in 0.16s ==============================

There are several pieces of information

  • platform: it gives you information on which platform (linux here) the tests ran, and versions for Python and pytest used. Report this information for example in bug reports or merge requests to ease the reproducibility of results.
  • seed: the initial seed used when running the test (useful if your program uses randomly generates numbers, to allow reproduction of test runs).
  • plugins: additional plugins loaded (might not be used)
  • collected item: number of tests run. Notice that we check 4 assertions, but the number of collected items is 1: pytest reports the number of test functions.
  • passed items: Summary of success and failure. In this case, our test passes.

Exercise: Change one value of a test to fail it on purpose, re-run pytest, and observe the report.

Use with doctest

You can also run doctest within pytest. Just add the option --doctest-modules when running the test:

pytest --doctest-modules ~/robust-programming/tests/test_addition.py

Note that in this case, pytest collected 2 items (our manually defined test plus the doctest attached to the function add).

Parametrize tests

Our current way of defining tests is not very compact (series of copy/paste with different values). You can instead parametrize your test by templating the operation to perform, and specifying the list of values (in/out) to test.

Exercise: Rewrite the test by using the pytest.mark.parametrize decorator.

Handle exceptions

Being aware of the limitations of your code is as important as knowing what your code does under normal circumstances. Hence it is a good practice to identify and test exceptions raised by your code.

The pytest.raises function allows you to test that a specific exception is raised when certain code is run. This can be useful for verifying that your code is handling errors and exceptions as expected.

Exercise: Implement a test to check that the function add raises a TypeError if one of the two arguments is a string.

Coverage

One of the advantage of using a tool such a pytest is to compute the coverage of our test suite, that is the percentage of our code that is tested. To enable the coverage report, you can run pytest with the options --cov and --cov-report:

pytest --cov=. --cov-report=term ~/robust-programming/tests/test_addition.py

---------- coverage: platform linux, python 3.9.12-final-0 -----------
Name                          Stmts   Miss  Cover
-------------------------------------------------
code/tests/test_addition.py       7      0   100%
-------------------------------------------------
TOTAL                             7      0   100%

Covered in this context means being hit during the execution of the test suites – which does not necessarily means it is tested meaningfuly.

Note that if you want to inspect in details which lines of your code are covered or not, you will prefer the argument --cov-report=html. pytest will create a folder htmlcov with the source of a web page that will contain the line-by-line coverage report for each module.

Free practice

Write tests for your code, and execute them with pytest. You might want to explore other functionalities of pytest such as mock.

Basic testing with Catch2

Introduction

The easiest way to test software is to write a bunch of assertions. These are simple statements which assess that given certain predetermined inputs, your code should behave in a certain way (produce a certain result, throw an exception…).

It is certainly possible to write basic assertions by hand…

// Assume this is the code we want to test
int triple(int x) {
    return 3 * x;
}

// Somewhere else in the source tree, we might write this test program:

#include <exception>
#include <iostream>
#include <sstream>

int main() {
    const int result1 = triple(0);
    if (result1 != 0) {
        std::ostringstream s;
        s << "expected triple(0) to be 0, but got " << result1;
        throw std::runtime_error(s.str());
    }

    const int result2 = triple(1);
    if (result2 != 3) {
        std::ostringstream s;
        s << "expected triple(1) to be 3, but got " << result2;
        throw std::runtime_error(s.str());
    }

    const int result3 = triple(2);
    if (result3 != 6) {
        std::ostringstream s;
        s << "expected triple(2) to be 6, but got " << result3;
        throw std::runtime_error(s.str());
    }

    std::cout << "All tests completed successfully" << std::endl;
}

…but as you can see, this gets verbose quickly, and we have not even started introducing other commonly desired test harness features like…

  • Being able to run all tests even if some of them fails and report on all failing tests at the end.
  • Being able to selectively run some tests, without running the other ones.

Therefore, you will soon want to introduce some abstractions that reduce the amount of code you need to write for each test. Thankfully, you do not need to write this code yourself, as there are many high-quality test frameworks available for C++. In this practical, we will use Catch2.

Using it, the above test program becomes…

#include <catch2/catch_test_macros.hpp>
#include <catch2/catch_session.hpp>

TEST_CASE("triple() multiplies a number by 3") {
  REQUIRE(triple(0) == 0);
  REQUIRE(triple(1) == 3);
  REQUIRE(triple(2) == 6);
}

int main(int argc, char* argv[]) {
  return Catch::Session().run(argc, argv);
}

…which as you can see is a lot more readable, focused on the task at hand, and yet more featureful (you get the aforementioned test harness features for free, among other).

Basic Catch2 usage

In Catch2, as in many other test frameworks, testing is done using test cases, which are code blocks described using a string and containing more assertions. If one of the assertion fails, you will automatically get an error message telling which test case(s) failed and why:

Randomness seeded to: 3094418497

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_triple_catch.bin is a Catch2 v3.5.3 host application.
Run with -? for options

-------------------------------------------------------------------------------
triple() multiplies a number by 3
-------------------------------------------------------------------------------
test_triple_catch.cpp:7
...............................................................................

test_triple_catch.cpp:9: FAILED:
  REQUIRE( triple(1) == 3 )
with expansion:
  2 == 3

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed

Test cases terminate on the first failed assertion, so if you want to test multiple mathematically unrelated properties, it is a good practice to test them using separate test cases. But for any nontrivial test, doing this naively means duplicating common setup code. The standard way to avoid this in Catch2 is to use sections:

#include <catch2/catch_test_macros.hpp>
#include <catch2/catch_session.hpp>

TEST_CASE("std::vector size/capacity matches expectations") {
  // Common setup
  std::vector<int> v(42);
  const size_t initial_capacity = v.capacity();

  // Common expectations
  REQUIRE(v.size() == 42);
  REQUIRE(initial_capacity <= 42);

  // Section that tests push_back
  SECTION("adding an element increases the size, may affect capacity") {
    v.push_back(123);
    REQUIRE(v.size() == 43);
    if (initial_capacity >= 43) {
      REQUIRE(v.capacity() == initial_capacity);
    } else {
      REQUIRE(v.capacity() >= 43);
    }
  }

  // Section that tests pop_back
  SECTION("removing an element decreases the size, capacity is unaffected") {
    v.pop_back();
    REQUIRE(v.size() == 41);
    REQUIRE(v.capacity() == initial_capacity);
  }

  // Section that tests clear
  SECTION("clearing the vector zeroes the size, capacity is unaffected") {
    v.clear();
    REQUIRE(v.size() == 0);
    REQUIRE(v.capacity() == initial_capacity);
  }
}

int main(int argc, char* argv[]) {
  return Catch::Session().run(argc, argv);
}

From this code, Catch2 will generate three test cases that begin by performing the same initial setup (and initial assertion checking), and then go on to each test how a different method of std::vector affects the vector’s size and capacity.

For more information on the functionality provided by Catch2 and the way it is used, please refer to the library’s reference documentation.

Exercise

Set up Catch2 on one of the example programs provided as part of this course, or a program of your own, and write some test cases with assertions. If you are not using the provided Makefiles, bear in mind that you will need to add -lCatch2 to your compiler flags in order to link to the Catch2 library.

Property-based testing

A time-honored techniques to avoid introducing bugs while modifying software is to set up automated tests which assert that code still does what it’s supposed to do (e.g. produces expected results). But it is not so easy to get this right:

  • A test is most useful when it targets a small software component (e.g. single function, class…) because when that fails, it gives you a precise indication of where the problem is, which saves you debugging time. But that means you need to write a lot of tests, which is tedious.
  • Because writing tests is tedious, manually written tests tend to be very unimaginative: they always test against the same inputs, do not try many different configurations… and that increases the odds that bugs which only affect more exotic edge cases will be missed.
  • The above problem is magnified by the fact that in most scientific projects, the same person writes the code being tested and the test that exercises it: what edge case you missed while writing the code, you are likely to also miss when writing the associated test.

Thankfully, there is a way to make fine-grained tests a little easier to write and much more imaginative, and that is to write tests that operate over random inputs, and assert properties that should be true of any output. This is called property-based testing.

Defining a property

The concept of property-based testing is borrowed from functional programming. To summarize, the idea is no longer to predict the output knowing a particular input, but to determine and test the properties of our function. But what do we mean by property? Mathematically, a property p for all elements of a set X is usually defined as a function p: X → {true, false}. Well, it is not always obvious to determine in practice.

Let’s try to reason on a simple example that takes two numbers and adds them:

def add_numbers(a: int, b: int) -> int:
    return a + b

What particular property(ies) characterizes the add_numbers function?

Looking at the function, we see that it is based around the addition operator. Unlike subtraction, addition is commutative, so add_numbers(a, b) == add_numbers(b, a). But we could have made a bug by writing the operator * instead of + and the test would still be valid… So what other property distinguishes addition from multiplication? The neutral element! add_numbers(a, 0) == a is not satisfied for multiplication.

We have therefore isolated at least two properties which allow us to test our function (its implementation) without having to resort to explicit input/output values (well almost, see later). Finding properties is the real difficulty of this type of test. Here are some properties that often come up in our scientific developments (the list is endless!):

  • The invariants (e.g. size of a collection when applying a map, or the contents of a collection when sorting it)
  • Inverses (invertible function)
  • Idempotence (applying twice the function left the results identical than applying it once)
  • Properties of internal composition laws: associativity, commutativity…
  • Structural induction (solve a smaller problem first)
  • Hard to prove, easy to verify (think of a maze)

A less trivial example

Let us assume, for example, that you are writing a C++ routine that takes an integer as input and decomposes it into prime factors. It could have the following signature:

std::vector<unsigned> prime_factors(unsigned input);

What can we tell about such a function without looking at the implementation?

  • All output integers should be prime numbers.
  • The product of all output integers should be equal to input.

So we can write a test that, given random unsigned integers, passes them through prime_factors and asserts that the result has these two properties:

void assert(bool expectation, std::string message) {
    if (!expectation) {
        throw std::runtime_error(message);
    }
}

bool is_prime(unsigned number) {
    for (unsigned i = 2; i <= number / 2; ++i) {
        if (number % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    for (int i = 0; i < 100; ++i) {
        // NOTE: This does not cover all possible values of "unsigned", in the
        //       exercises we will use a better randomness source that does.
        unsigned input = rand();

        std::vector<unsigned> factors = prime_factors(input);

        unsigned product = 1;
        for (const unsigned factor: factors) {
            assert(is_prime(factor), "Returned factor is not a prime number");
            product *= factor;
        }
        assert(product == input, "Factor product is not original input");
    }
}

And that, in a nutshell, is the idea of property-based testing. Of course, real world property-based testing frameworks have more features, such as printing out what inputs made the test fails or searching for the simplest input that makes the test fail. But the basic principle is simple: unless you are explicitly testing for a known tricky edge case, write your tests to operate over random inputs and assert things that should be true of any output.

Now that you are hopefully convinced that this is a testing technique worth learning, please choose in which programming language you would like to experiment with it:

Python property-based testing

In Python, one of the most established frameworks for property-based testing is Hypothesis. It has been pre-installed on your user sessions of the practicals platform, along with the pytest framework which provides a convenient way to run your tests.

To demonstrate how Hypothesis works, we will reimplement a few classical linear algebra primitives using Python lists of floating point numbers, as well as other popular methods, hopefully getting some appreciation of all the work that is saved when we use NumPy instead of rolling our own code along the way.

A first example

Let’s start with a simple vector sum implementation (that you documented earlier!):

def vector_sum(vec, vec2):
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

We can write a property-based test for it using Hypothesis as follows:

from hypothesis import given
from hypothesis.strategies import floats, lists

@given(lists(floats()), lists(floats()))
def test_sum(vec, vec2):
    result = vector_sum(vec, vec2)
    for i in range(len(result)):
        assert result[i] == vec[i] + vec2[i]

Notice how we design the test as a function that takes certain inputs, whose nature is specified via the @given decorator.

In any case if we execute the test defined above, it fails:

$ pytest sum.py
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.2.2, pluggy-1.0.0
rootdir: /tp/homes/tmp_user_1/tests
plugins: hypothesis-6.70.0
collected 1 item                                                               

sum.py F                                                                 [100%]

=================================== FAILURES ===================================
___________________________________ test_sum ___________________________________

    @given(lists(floats()), lists(floats()))
>   def test_sum(vec, vec2):

sum.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sum.py:12: in test_sum
    result = vector_sum(vec, vec2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

vec = [0.0], vec2 = []

    def vector_sum(vec, vec2):
        output = []
        for i in range(len(vec)):
>           output.append(vec[i] + vec2[i])
E           IndexError: list index out of range
E           Falsifying example: test_sum(
E               vec=[0.0],
E               vec2=[],
E           )

sum.py:7: IndexError
=========================== short test summary info ============================
FAILED sum.py::test_sum - IndexError: list index out of range
============================== 1 failed in 1.06s ===============================

What the failure message tells us is that if it is called with one list of length 1 and one empty list, our vector_sum() crashes at the line output.append(vec[i] + vec2[i]). That’s because as currently written, the code assumes that both input lists are of the same length (i.e. both input vectors have the same dimensionality). Which is not true of an arbitrary pair of vectors.

Such assumptions are really easy to make when writing code on a bad day, and they are hard to find because as soon as the assumption is ingrained in your brain, any hand-written test you write is likely to use an input that follows the same assumption. But here, it only took a bit of randomness to expose it. Such is the power of property-based testing.

Now that we found an unexpected input configuration, we need to think about how we want to handle it. Passing in two vectors of different dimensionality looks like a usage error from the caller of the function, so perhaps just crashing with a more self-explanatory error message is enough?

def vector_sum(vec, vec2):
    assert len(vec) == len(vec2), "Can't sum vectors of different lengths"
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

With this change, our test would fail with a nicer error messages, but still fail. We also need to teach the test that failing with an AssertionError is the expected behavior when vectors of different length are passed in. And while we’re at it, we will also add a check that ensures the output has the same length as the inputs:

@given(lists(floats()), lists(floats()))
def test_sum(vec, vec2):
    try:
        result = vector_sum(vec, vec2)
    except AssertionError:
        # We only except an assertion failure if the length differs
        assert len(vec) != len(vec2)
        # In that case, the test can't continue: there is no result
        return

    # If the test passes, then all input and output lengths should match
    assert len(vec) == len(vec2)
    assert len(vec) == len(result)

    # Value check is as before
    for i in range(len(result)):
        assert result[i] == vec[i] + vec2[i]

Gimme gimme gimme a NaN

And with these changes, we get… a rather different test failure:

$ pytest sum.py
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.2.2, pluggy-1.0.0
rootdir: /tp/homes/tmp_user_1/tests
plugins: hypothesis-6.70.0
collected 1 item                                                               

sum.py F                                                                 [100%]

=================================== FAILURES ===================================
___________________________________ test_sum ___________________________________

    @given(lists(floats()), lists(floats()))
>   def test_sum(vec, vec2):

sum.py:12: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

vec = [0.0], vec2 = [nan]

    @given(lists(floats()), lists(floats()))
    def test_sum(vec, vec2):
        try:
            result = vector_sum(vec, vec2)
        except AssertionError:
            # We only except an assertion failure if the length differs
            assert len(vec) != len(vec2)
            # In that case, the test can't continue: there is no result
            return
    
        # If the test passes, then all input and output lengths should match
        assert len(vec) == len(vec2)
        assert len(vec) == len(result)
    
        # Value check is as before
        for i in range(len(result)):
>           assert result[i] == vec[i] + vec2[i]
E           assert nan == (0.0 + nan)
E           Falsifying example: test_sum(
E               vec=[0.0],
E               vec2=[nan],
E           )

sum.py:27: AssertionError
=========================== short test summary info ============================
FAILED sum.py::test_sum - assert nan == (0.0 + nan)
============================== 1 failed in 1.67s ===============================

Here, the problem is that floating point math is complicated (a topic for a another course). If we claim to accept any list of floating-point numbers as input, then we must account not just for “normal” numbers, but also for special numbers that the IEEE-754 standard defines, such as infinities and NaN (Not a Number). Alas, the latter does not honor normal math identities like x == x.

Now we need to think about what we want, as there are several reasonable ways to handle this:

  • We could check inputs and fail with an error if they cannot be summed with reasonable results. This would be appropriate in projects where we care a lot about not performing meaningless computations (and eventually emitting meaningless results), but checking inputs may be costly if we expect high-dimensional input vectors.
  • We could warn users that they cannot expect sane results if the function is fed with special IEEE-754 numbers. In that case, rather than modifying the code of the vector_sum() function, we would modify its documentation to clarify the situation, then modify test_sum() to ignore results on pathological input.

Since we have already shown an example of the first strategy (error-checking in the function being tested), we will now show an example of the second strategy:

import math
from hypothesis import given
from hypothesis.strategies import floats, lists

# Will produce unexpected results on special IEEE-754 numbers: inf, NaN...
def vector_sum(vec, vec2):
    assert len(vec) == len(vec2), "Can't sum vectors of different lengths"
    output = []
    for i in range(len(vec)):
        output.append(vec[i] + vec2[i])
    return output

@given(lists(floats()), lists(floats()))
def test_sum(vec, vec2):
    try:
        result = vector_sum(vec, vec2)
    except AssertionError:
        assert len(vec) != len(vec2)
        return

    assert len(vec) == len(vec2)
    assert len(vec) == len(result)

    for i in range(len(result)):
        # Do not check results on non-finite numbers (NaN and inf)
        if math.isfinite(vec[i]) and math.isfinite(vec2[i]):
            assert result[i] == vec[i] + vec2[i]

After this change, the test passes, showing that the vector_sum() function now behaves in the way that the test expects it to:

$ pytest sum.py
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.2.2, pluggy-1.0.0
rootdir: /tp/homes/tmp_user_1/tests
plugins: hypothesis-6.70.0
collected 1 item                                                               

sum.py .                                                                 [100%]

============================== 1 passed in 1.18s ===============================

So, what can we make of all this?

  • Testing is not just about finding bugs in the function being tested. It’s also about forming a better understanding of the problem we’re facing, and making conscious choices about what inputs we want to handle and how. A test failure indicates that the test assertions do not accurately reflect how the function being tested works, but this mismatch can be a problem in either the function or the test code.
  • Property-based testing is very effective at forcing us to think about uncomfortable edge cases that our brain would rather ignore, like special floating-point numbers.
  • By freeing us from inefficient human generation of interesting test inputs, property based testing lets us focus on more interesting questions about how our function should behave, which is what testing should be about.
  • Property based testing is not a panacea however. In this example, it is attention to detail that led us to notice that as initially formulated, the test did not check that the output was as long as the inputs. Similarly, if a function has a very narrow edge cases that is only triggered by very specific inputs, random search is unlikely to hit that configuration. In this case, you may need to hint the algorithm or outright force it to go through known-tricky inputs.

Exercise

You will find the above property-based testing example in the ~/robust-programming/tests subdirectory of your home directory on the practicals server, where it is called sum.py. Alongside it, you will find incomplete files for the following three exercises on property-based test:

  • Sorting a list: sort_list.py
  • Dot product: dot.py
  • Condorcet method: condorcet.py

The goal of these exercises is for you to fill them. You will probably not have the time to complete all exercises, so choose based on your level.

Beginner: sorting a list

Let’s suppose we write a function that takes a list of integers as input, and returns a sorted version of the list (yes, I know there are ready-made functions for that…):

def sort_list(alist: list) -> list:
    """ sort the input list in a naive way """
    if len(alist) <= 1:
        return alist

    out = []
    alist_tmp = alist.copy()
    while alist_tmp:
        out.append(min(alist_tmp))
        alist_tmp.pop(alist_tmp.index(min(alist_tmp)))
    return out

Exercise: find and implement properties that characterise the sort function.

Medium: dot product of two vectors

Write a function that computes the dot product of two vectors, defined as lists of floating point numbers as above, then write some tests that assert that it does so correctly. To avoid duplicating the full dot product computation in the test (which is unlikely to help you find any bugs, since the test and the function being tested would share any misunderstanding from your side), try the following strategy:

  1. Start by generating a vector where only one component is not zero, and compute the dot product of that vector with itself. The result is the squared norm of the vector, which in this scenario should be very easy to predict. You may find the following test skeleton useful, feel free to check out the Hypothesis documentation to learn more about the various functions that are being used in the @given decorator.

    from hypothesis import given
    from hypothesis.strategies import floats, integers, just, tuples
    
    max_len = 1024
    
    @given(
        integers(min_value = 1, max_value = max_len).flatmap(
            lambda length: tuples(
                floats(),
                integers(min_value = 0, max_value = length - 1),
                just(length)
            )
        )
    )
    def test_norm2_easy(params):
        coord, idx, length = params
        assert length >= 1
        assert idx < length
    
  2. Now generate two vectors of this sort and compute their dot product. Again, the result should be easy to predict without writing down the full dot product computation in your test. You may want to extract the logic for generating one vector into a reusable building block to avoid copy-and-paste, which you can do as follows:

    easy_vector_params = integers(min_value = 1, max_value = max_len).flatmap(
        lambda length: tuples(
            floats(),
            integers(min_value = 0, max_value = length - 1),
            just(length)
        )
    )
    
    @given(easy_vector_params, easy_vector_params)
    def test_dot_easy(params1, params2):
        pass
    
  3. Now generate one arbitrary vector, as done in the vector addition example above, and compute its dot product with itself. This gives you its squared norm, as before. Test that this squared norm is greater than the square of the largest vector component.

  4. Finally, generate two arbitrary vectors, and check that their dot product follows the Cauchy-Schwarz inequality: pow(dot(v1, v2), 2) <= dot(v1, v1) * dot(v2, v2).

Note that floating-point rounding errors differ between multiple ways to do the same computation, so you may need to use approximate comparisons like math.isclose() in some of your tests.

Hard: intransivity of the Condorcet method

Condorcet’s method is a voting system in which each voter draws up an ordered list of candidates and the only winner, if there is one, is the candidate who, compared in turn with each of the other candidates, would each time prove to be the preferred candidate.

The curiosity of this system is that there is no guarantee that there will be a candidate who satisfies the victory criterion. In other words, this system is not transitive: if A is better than B, and B is better than C, A is not necessarily better than C. This is Condorcet’s paradox, and we propose to demonstrate it through hypothesis.

Exercise: Write a test function which claims that Condorcet’s system is transitive, and show that this test does not work.

Free practice

You probably have a lot of code on you, so why not try implementing property-based tests in your existing test suites?

Hypothesis, numpy & pandas

You can integrate strategies for Numpy and Pandas into Hypothesis quite naturally: https://hypothesis.readthedocs.io/en/latest/numpy.html

Hypothesis & pytest

There is a Hypothesis module for pytest. Everything is fairly transparent (as soon as you have the two libraries installed, it works by itself), and it allows you to insert your property-based tests naturally into your existing test suites:

Health checks

Hypothesis not only exposes tools for carrying out tests, but also offers features for detecting poor programming practices that can hinder the proper maintenance and execution of code:

  • Tests that require overly complex data and slow down execution.
  • Tests that do not explore the parameter space sufficiently.
  • Faulty recursion strategies (i.e. that diverge too much).
  • Tests that have no chance of finishing before the end of the world.

See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information.

C++ property-based testing

There are several C++ test frameworks for property-based testing, and a clear winner has yet to emerge. Here we will be using the RapidCheck framework, which comes pre-installed on your user sessions of the practicals platform.

To demonstrate how RapidCheck works, we will reimplement a few classical linear algebra primitives using std::vector<float>, hopefully getting some appreciation of all the work that is saved when we use a linear algebra library like Armadillo instead of rolling our own code along the way.

A first example

Let’s start with a simple vector sum implementation:

#include <stdlib.h>
#include <vector>

std::vector<float> vector_sum(const std::vector<float>& vec,
                              const std::vector<float>& vec2) {
    std::vector<float> output;
    for (size_t i = 0; i < vec.size(); ++i) {
        output.push_back(vec.at(i) + vec2.at(i));
    }
    return output;
}

We can write a property-based test for it using RapidCheck as follows:

#include <rapidcheck.h>

int main() {
    rc::check(
        "vector_sum sums vectors",
        [](const std::vector<float>& vec, const std::vector<float> &vec2) {
            std::vector<float> result = vector_sum(vec, vec2);
            for (size_t i = 0; i < result.size(); ++i) {
                RC_ASSERT(result[i] == vec[i] + vec2[i]);
            }
        }
    );
}
If the syntax used for the second parameter of the rc::check function surprises you, click this text to learn more about it.

This is a lambda expression, a shorthand to quickly define a function that is available since version 2011 of the C++ standard. In this particular case, you could have achieved the same result by defining a separate function like this…

void vector_sum_sums_vectors(const std::vector<float>& vec,
                             const std::vector<float>& vec2) {
    std::vector<float> result = vector_sum(vec, vec2);
    for (size_t i = 0; i < result.size(); ++i) {
        RC_ASSERT(result[i] == vec[i] + vec2[i]);
    }
}

int main() {
    rc::check("vector_sum sums vectors", vector_sum_sums_vectors);
}

…but when using frameworks like RapidCheck that rely on you defining a lot of single-use functions, it is nicer to be able to define them right at the only point where they are being used, rather than produce spaghetti code where the reader must constantly jump from one place of the code to the other in order to understand what’s going on.

In any case, if we compile and run the test defined above, it fails:

$ make -j$(nproc) && ./sum.bin
Using configuration: seed=15308558600823943072

- vector_sum sums vectors
Falsifiable after 5 tests and 6 shrinks

std::tuple<std::vector<float>, std::vector<float>>:
([0], [])

Exception thrown with message:
vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
Some of your RapidCheck properties had failures. To reproduce these, run with:
RC_PARAMS="reproduce=BchdlNGdvJ3XzVXbgMXdtNHI2V2Y09mczB6rcFH8sLH1g+KXxBP7yRNovyV

What this error message tells us, in a fairly roundabout way, is that if our test is passed one single-element std::vector and one empty std::vector, an exception is thrown because an std::vector is indexed outside of its useful range.

If you cannot guess where that’s coming from, you could locate the point of failure using a debugger like GDB or by adding printouts at strategic places of the code. Whichever way you go, what you will eventually figure out is that the problem lies inside of the vector_sum() function:

std::vector<float> vector_sum(const std::vector<float>& vec,
                              const std::vector<float>& vec2) {
    std::vector<float> output;
    // ERROR: This loop has vec.size() iterations...
    for (size_t i = 0; i < vec.size(); ++i) {
        // ...then it tries to index both vec and vec2 with the corresponding
        // values of i. But what if vec2 has less elements then vec?
        output.push_back(vec.at(i) + vec2.at(i));
    }
    return output;
}

Basically, as currently written, the code assumes that the two input vectors, vec and vec2, have the same dimensionality, which may not be the case.

Such assumptions are really easy to make when writing code on a bad day, and they are hard to find because as soon as the assumption is ingrained in your brain, any hand-written test you write is likely to use an input that follows the same assumption. But here, it only took a bit of randomness to expose it. Such is the power of property-based testing.

If you are surprised that this situation leads to an exception being thrown, click this text for more information on what's going on.

The only reason why this out-of-bounds indexing problem was reliably reported with an exception was that I made the choice to use the uncommon at() method, rather than the more usual [] indexing operator. Without this precaution, the program would exhibit undefined behavior, a very problematic situation where program behavior becomes unpredictable because the compiler applies code transformations under the assumption that undefined behavior does not happen.

As a result of these incorrect code transformations, a program that exhibits undefined behavior can do basically anything, including but not limited to…

  • Crashing with an unclear error message such as “Segmentation fault”.
  • Reading invalid data from a memory location it’s not supposed to access, possibly leaking secret user data like passwords to the outside world in the process.
  • Changing memory locations it’s not supposed to write to, leading other parts of the program to misbehave in bizarre ways later on.
  • Executing unexpected instructions injected by an attacker trying to get into your system.
  • Not doing the expected work at all, perhaps jumping to an unexpected stream of instructions somewhere else in the program or perhaps just terminating without having done anything.
  • Or seemingly working as expected today, only to fail tomorrow in a weird way.

To make matters worse, the behavior is not fixed but may change depending on which compiler is used, which compiler flags are set, which operating system is used, or what is the particular state of the environment (e.g. value of environment variables, state of other running processes on the same system) at the time where the program is running.

Because of this variability, undefined behavior can be hard to detect, make sense of and correct. Which is why there is a trend in modern programming languages to make it harder to trigger undefined behavior by removing it from all commonly used primitives and only allowing for its possibility in uncommon and specialized situations.

Unfortunately, legacy programming languages like C++ and Fortran cannot follow this trend as it would break too much existing code. This is a good argument for using more modern programming languages like Python or Rust whenever your project allows for it.

And now that this point is clarified, let us go back to fixing our test.

Now that we found an unexpected input configuration, we need to think about how we want to handle it. Passing in two vectors of different dimensionality looks like a usage error from the caller of the function, so perhaps just crashing with a more self-explanatory error message is enough?

#include <stdexcept>

std::vector<float> vector_sum(const std::vector<float>& vec,
                              const std::vector<float>& vec2) {
    // Detect inhomogeneous input configurations
    if (vec.size() != vec2.size()) {
        throw std::runtime_error(
            "vector_sum(): input vectors don't have the same dimension"
        );
    }

    std::vector<float> output;
    for (size_t i = 0; i < vec.size(); ++i) {
        output.push_back(vec.at(i) + vec2.at(i));
    }
    return output;
}

With this change, our test would fail with a nicer error message, but still fail. We also need to teach the test that throwing an std::runtime_error is the expected behavior when vectors of different lengths are passed to vector_sum(). And while we are at it, we will also add a check that ensures the output has the same length as the inputs:

int main() {
    rc::check(
        "vector_sum sums vectors",
        [](const std::vector<float>& vec, const std::vector<float> &vec2) {
            std::vector<float> result;
            try {
                result = vector_sum(vec, vec2);
            } catch(std::runtime_error&) {
                // We only except an assertion failure if the length differs
                RC_ASSERT(vec.size() != vec2.size());
                // In that case, the test can't continue: there is no result
                return;
            }

            // If the test passes, then all input and output lengths should match
            RC_ASSERT(vec.size() == vec2.size());
            RC_ASSERT(vec.size() == result.size());

            // Rest is as before
            for (size_t i = 0; i < result.size(); ++i) {
                RC_ASSERT(result[i] == vec[i] + vec2[i]);
            }
        }
    );
}

With that, our test now passes:

$ make -j$(nproc) && ./sum.bin
Using configuration: seed=8215553341845074688

- vector_sum sums vectors
OK, passed 100 tests

Does this mean that our function is correct? Not necessarily in an absolute sense. All it means is that after trying 100 random input configurations, our property-based test harness was not able to find an input configuration that fails the test.

There is, however, an input configuration that RapidCheck did not try, which fails this test. Can you find it? Click this text for the answer!

As Hypothesis, a Python cousin of RapidCheck, correctly points out, this vector_sum() version does not pass the test if one of the inputs contains the value NaN, shorthand for Not a Number.

NaN is an special floating-point value (or, more accurately, a set of special floating-point values) that is emitted by floating-point functions when asked to perform a meaningless computation, such as dividing 0 by 0 or computing the square root of a negative number.

Because it is just a value in the output dataset, numerical codebases rarely check for its presence, which means that it commonly ends up in output files. Good numerical libraries must therefore be prepared to handle the presence of NaN in their input. Unfortunately, NaN has very nasty properties as far as mathematics go, one of which being that x == x is false if x is NaN, which commonly breaks sorting algorithms and tests based on floating-point equality.

Here, we could account for the presence of NaN in the test by either moving to bitwise equality or ignoring results where inputs contain non-finite floating point values, documenting that these are not supported in the function’s documentation.

The moral of this story is that it is best not to perform property-based testing using a general-purpose random generator that only produces “normal” numbers, but instead use a specialized generator that gives high probability to edge cases that programmers are unlikely to think about.

Unfortunately, RapidCheck disappoints in this respect, showing the relative immaturity of property-based testing in the C++ ecosystem.

So, what can we make of all this?

  • Testing is not just about finding bugs in the function being tested. It’s also about forming a better understanding of the problem we’re facing, and making conscious choices about what inputs we want to handle and how. A test failure indicates that the test assertions do not accurately reflect how the function being tested works, but this mismatch can be a problem in either the function or the test code.
  • Property-based testing is very effective at forcing us to think about uncomfortable edge cases that our brain would rather ignore. And by freeing us from inefficient human generation of interesting test inputs, property based testing lets us focus on more interesting questions about how our function should behave, which is what testing should be about.
  • Property based testing is not a panacea however. In this example, it is attention to detail that led me to notice that as initially formulated, the test did not check that the output was as long as the inputs. Similarly, if a function has a very narrow edge cases that is only triggered by very specific inputs, random search is unlikely to hit that configuration. In this case, you may need to complement property-based testing with manual testing with handpicked input, using a framework like Catch.

Exercise

You will find the above property-based testing example in the robust-programming/tests subdirectory of your home directory on the practicals server, where it is called sum.cpp. You can compile it with make -j$(nproc). Alongside it, you will find an incomplete dot.cpp property-based test. The goal of this exercise is for you to fill it.

Write a function that computes the dot product of two vectors, defined as lists of floating point numbers as above, then write some tests that assert that it does so correctly. To avoid duplicating the full dot product computation in the test (which is unlikely to help you find any bugs, since the test and the function being tested would share any misunderstanding from your side), try the following strategy:

  1. Start by generating a vector where only one component is not zero, and compute the dot product of that vector with itself. The result is the squared norm of the vector, which in this scenario should be very easy to predict. The following test skeleton is provided to help you get started more quickly, feel free to check out the RapidCheck documentation to learn more about the various functions that are being used.

    #include <rapidcheck.h>
    #include <stdlib.h>
    
    // Data needed to build a vector where exactly one component is not zero
    struct SimpleVectorConfig {
        // Value of the nonzero component
        float component;
    
        // Index of the nonzero component
        size_t idx;
    
        // Length of the vector
        size_t length;
    };
    
    // RapidCheck generator that produces valid SimpleVectorConfigs
    rc::Gen<SimpleVectorConfig> simple_vector_generator() {
        // Good values of "idx" depend on "length", so we must pick "length" first.
        // "mapcat" is how we express this idea of random generation where some
        // generated quantities depend on others in RapidCheck.
        return rc::gen::mapcat(
            // Don't let length grow too big to avoid slow tests / RAM exhaustion.
            rc::gen::inRange<size_t>(1, 1024),
            [](size_t length) {
                // RapidCheck syntax for building a generator of structs with
                // per-member configuration.
                return rc::gen::build<SimpleVectorConfig>(
                    // "component" value simply uses the standard float generator.
                    rc::gen::set(&SimpleVectorConfig::component),
                    // "index" values must be between 0 and "length", right-exclusive.
                    rc::gen::set(&SimpleVectorConfig::idx,
                                 rc::gen::inRange<size_t>(0, length)),
                    // "length" generator re-emits the value already generated above.
                    rc::gen::set(&SimpleVectorConfig::length,
                                 rc::gen::just(length))
                );
            }
        );
    }
    
    int main() {
        rc::check(
            "single-component squared norm",
            [&] {
                SimpleVectorConfig vector_config = *simple_vector_generator();
                // TODO: Replace with dot product test
                RC_ASSERT(vector_config.length >= 1);
                RC_ASSERT(vector_config.idx < vector_config.length);
            }
        );
    }
    
    
  2. Now generate two vectors of this sort and compute their dot product. Again, the result should be easy to predict without writing down the full dot product computation in your test.

  3. Now generate one arbitrary vector, as done in the vector addition example above, and compute its dot product with itself. This gives you its squared norm, as before. Test that this squared norm is greater than the square of the largest vector component.

  4. Finally, generate two arbitrary vectors, and check that their dot product follows the Cauchy-Schwarz inequality: pow(dot(v1, v2), 2) <= dot(v1, v1) * dot(v2, v2).

Note that floating-point rounding errors differ between multiple ways to do the same computation, so you may need to use approximate comparisons in some of your tests. The following approximate comparison algorithm is suggested:

#include <cmath>

bool is_close(float value,
              float reference,
              float rel_tol = 1e-6,
              float abs_tol = 1e-8) {
    return std::abs(value - reference) < std::max(rel_tol * std::abs(reference),
                                                  abs_tol);
}

Outro

A bit of demystification

In property-based tests, unlike example-based tests (oracle), we do not specify any particular values: only the type of the input data and the property to be tested. The framework (Hypothesis/RapidCheck) will take care of actually executing the test. Magic you would say to me.

Almost magical. There is still a lot of smoke in most cases. If we take our function, we suspect that we cannot test the whole set of floats… Under the hood, there is therefore a random number generator, which stops at some point given (configurable, but not infinite). So basically I could have generated a set of random integers, written my traditional oracle test, and put a good old for loop in front of it:

import sys
import numpy as np

def add_numbers(a: int, b: int) -> int:
    return a + b

n_iteration = 1000
data_in = np.random.randint(-sys.maxsize - 1, sys.maxsize, (n_iteration, 2))
for a, b in data_in:
    assert(add_numbers(a, b) == add_numbers(b, a))

Of course, this is possible for simple cases, and there are fortunately a large number of cases where property-based tests do not really have simple equivalents.

Finally, what should I choose for my tests?

Oracle tests or property tests? Our answer is: both! If you are looking for some motivations for using property-based testing, here are a few that are quite relevant:

  • Property-based tests are generally more generic than Oracle tests.
  • They expose a more concise description of the requirements and expected behavior of our code than oracles tests.
  • As a result, a property-based test can replace a large number of oracle tests.
  • By delegating input generation to specialized tools, property-based testing will be more efficient in finding implementation problems, particular untreated values…
  • Property-based tests require more thinking effort than oracle tests, and therefore they allow for better introspection.
  • As a result, code design is often better.

But it is not always easy or relevant to find properties, so we can still rely on oracle tests. It is the combination of both that will make your code even more robust :-) To go further, we invite you to read this quite interesting blog post (based on F# - but the idea is the same).

To finish, and to meditate: programming is not just a matter of lines of code, it is above all conceiving a design that makes it possible to fulfill objectives.

Other forms of randomized testing

As you could see in this chapter, randomizing your tests is a powerful technique to make sure that you do not forget about algorithmic edge cases. Other variations of the concept that you may want to learn about include…

  • Fuzz testing aka fuzzing: This is a close cousin of property-based testing which is commonly used in security-critical code, such as complex file format parsers that operate on untrusted data from the Internet. In fuzzing, a specialized test harness constantly feeds your code with randomly mutated versions of a correct input, where mutations are informed by real-time tracking of changes in source code coverage. The goal of the test harness is to make your code crash. If a crash does happen, it indicates that your code is not yet fully robust in the face of unexpected inputs, such as those that could be generated by an attacker.
  • Mutating testing: A good test suite should catch involuntary code changes that change program results. Mutation testing verifies this by flipping the idea of property-based testing around and randomly modifying the program under test, making sure that these program changes cause the test suite to fail. If the test suite does not fail, it suggests that there are some aspects of your program that are not yet well covered by the test suite, as random changes in program outputs are not detected by the test suite.

Continuous integration

Let’s take a breath. In this lecture, you learned about how to perform automatic rule checks, how to document your code, and how to write and execute tests.

There is an adage that we should not write more than 10 lines of code per day. It does not seem much, but if at each step you have to manually perform all quality checks, write and deploy the corresponding documentation, and design and run new tests to make sure the newly introduced lines does not break the code, you would probably write less than 10 lines a day.

GitLab CI/CD

What if at each modification, your code were shipped somewhere that has the correct environment to check and execute it, reporting any errors? This would mean that instead of manually performing all checks, you would outsource most of the work to a remote server that will do the boring work on your behalf, leaving you the luxury to focus on more interesting tasks. Crazy? Possible!

In this section, we will learn how to automate a workflow using the GitLab CI/CD (continuous integration/continuous deployment). A workflow can be: checking the code is correctly written (linting), running a test suite, deploying a package on a registry, deploying an online documentation, …

Before starting

Connect to the IN2P3 GitLab (https://gitlab.in2p3.fr) and create a new repository. If you have not yet done so, generate a ssh key, and add it to your account.

Then clone your newly created repository, and create a new empty file called .gitlab-ci.yml at the root of your repository. Also add the robust-programming folder. Your repository should look like:

(robprog) -bash-5.1$ ls -al
drwxr-xr-x.  4 student users   109 Mar 27 12:18 .
drwx------. 10 student users  4096 Mar 27 20:09 ..
drwxr-xr-x.  8 student users  4096 Mar 27 16:33 .git
-rw-r--r--.  1 student users   894 Mar 27 12:18 .gitlab-ci.yml
drwxr-xr-x.  8 student users   154 Mar 27 16:33 robust-programming

Your first pipeline

A CI pipeline is composed of a set of jobs that must all successfully run to completion. Jobs may be independent or depend on each other. The easiest form of dependency management in GitLab CI are pipeline stages. Jobs in a pipeline stage can run in parallel, but each pipeline stage must run to completion before the next stage starts. Let’s implement a first job to apply rule checking to our code. Add the following lines to the file:

default:
  image: almalinux:9

stages:
  - check
  - render
  - deploy

The first block specifies the operating system that we want to use in the remote server. You can use any name known from e.g. the DockerHub, and more. The second block defines the different stages of our pipeline (that we will have to define).

Let’s implement the check stage. Define a new block that contains:

Lint Python:
  image: python:3.9
  stage: check
  before_script:
    - |
      python -m venv .venv
      source .venv/bin/activate
      pip install hypothesis flake8
  script:
    - flake8 practicals/code/tests/sum.py
Lint C++:
  stage: check
  before_script:
    - |
      dnf update -y
      dnf install -y gcc-c++
      dnf install -y clang libubsan libasan clang-tools-extra valgrind
  script:
    - cd robust-programming/analyzers
    - make
    - clang-format -i mdbook_snippets.cpp

Lint Python (or Lint C++) is a (arbitrary) job name related to the check stage. In the case of the Python job, we specify another image than the default one (you can override any defaults within a job). The section before_script is where we install any dependencies. The section script is the actual execution of checks. More information on available keywords can be found online.

Save your work, commit, and push to your repository. Then go to the GitLab interface, select on the left bar: Build > Pipelines and observe the progression of your pipeline!

Exercise

Implement the other jobs and stages:

  • check: add a job to run tests and report any failures.
  • render: build your documentation written with mdbook in a previous section.
  • deploy: deploy your documentation online using GitLab pages.