Software correctness is becoming an increasingly important concern as our society grows more and more reliant on computer systems, with even even the simplest of software errors having devastating financial or security implications. How can we find errors in real-world applications? How can we guarantee that such errors don’t exist? To even begin to answer these questions we need a specification: a description of what it means for a piece of code to be correct, stated either explicitly (e.g. my private data are never leaked to third parties) or implicitly (e.g. this code will not terminate with an uncaught exception). This project’s goal is to develop efficient ways to write, debug, and reason about software and specifications through the use of random testing.

A particularly attractive approach is property-based testing (PBT) in the style of Haskell’s QuickCheck: given a property, an executable specification over structured data, a PBT tool generates (usually randomized) inputs and executes the property repeatedly; should a counterexample be found, it is minimized through a process called shrinking before it is reported to the user. All three components (an executable specification, generation, and shrinking) are necessary for PBT to succeed and automating each one comes with its own set of challenges.

There are two settings in which we explore randomized property based testing: a popular dynamic programming language (Python) and a proof assistant (Coq). On the one hand, Python poses interesting challenges to traditional PBT, as it is notoriously slow: running tests is no longer cheap so it opens the way for smarter, more symbolic techniques. In this setting we’re looking into improving Hypothesis. On the other hand, random testing within a proof assistant is especially attractive, as it is complementary to formal verification: testing can quickly uncover errors in complex software, but it cannot guarantee their absence; formal verification provides strong correctness guarantees, but is still very expensive in time and effort. In the Coq setting, we’re developing and improving QuickChick.

Generation

In a PBT tool, the generation of inputs is determined by generators: programs that describe a distribution of test data. Much of the time, these generators can be inferred automatically based on existing type information. However, most properties of interest involve sparse preconditions, imposing restrictive constraints on their inputs. For such properties, generating inputs using only type information, and then filtering out any that fail to satisfy a precondition, is a recipe for extremely inefficient testing. For example, if we wanted to test the correctness of compiler optimizations—that is, that a program compiled with and without optimization yields the same output—then we would need to generate programs for which compilation succeeds. If we attempt to generate such programs by generating random ones (based on, e.g., a grammar) and the filtering out those that fail to compile, we would waste the vast majority of testing time generating inputs that will just be discarded. In such cases, for random testing to be effective, users have to provide custom generators for random data that satisfy a given precondition by construction.

Writing such a generator by hand can be extremely time consuming and error prone. We are working on techniques on automating this process. Previous work has focused on deriving a “good” generator from a given precondition in QuickChick. Another exciting approach is to use feedback to guide the generation of data and better explore the input space by mutating “interesting” already generated inputs. One popular form of such feedback is coverage: coverage-guided fuzzing in the style of AFL has a proved record of finding crashes in existing software. We can use this idea of using coverage information to test for general correctness properties, in what we call Coverage Guided, Property Based Testing. We’re also looking at incorporating smarter ways of generating new input, such as symbolic execution, as well as different forms of feedback such as combinatorial coverage (Combinatorial Testing), utility functions (Targeted PBT) or input distance (Adaptive Random Testing).

Executable Specifications

Another component that is necessary for PBT is the property itself. In the particular setting of a proof assistant like Coq, theorems are usually specified using inductive relations. Such relations can describe more behaviors than their traditional functional counterparts, from potentially nonterminating computations to nondeterministic or concurrent semantics. To make use of PBT in such a setting, users currently need to provide an executable variant of their specifications, ideally with a proof linking the two variants together. A goal of this project is to automatically derive an efficient executable property from an inductive specification, as well as a proof that both variants coincide.

Shrinking

The last aspect of property-based testing that this project will focus on is counterexample minimization. Just like generating data that satisfy preconditions directly can be exponentially more efficient than generating arbitrary data and filtering them, the same is true for minimizing reported counterexamples that need to satisfy the same invariants. A related problem is that of counterexample deduplication; when testing a piece of software for a non-trivial amount of time, usually many different bugs are discovered. Being able to characterize which counterexamples correspond to the same underlying problem in the code helps both reduce the effort on the end user (that otherwise has to sort through numerous instances of the same issue), and improve the efficiency of testing (by avoiding input data that exhibit identical erroneous behavior). Existing techniques in the coverage-based fuzzing world, like comparing stack trace hashes or coverage pro-files, fall way short of discriminating instances of the same error correctly.

Moreover, there is a unique opportunity to study this problem in the setting of a proof assistant as we have ground truth: we can establish the correctness of a program via verification and observe the outcome of different techniques in a controlled setting with artificially injected errors. We can then apply the most promising techniques to the most mainstream setting of Python.

Past publications from members of this project: