Bug Bash 2025 Conference Experience

April 12, 2025

The inaugural Bug Bash conference was really special. I’ve been to many conferences, but this was legitimately the first that I felt “a part of,” because the subject matter greatly overlapped with what I’m interested in and what I write about here. There are various combinations of testing conferences, devops conferences, and formal methods conferences, sure, but this still felt like a new stake in the ground. Possibly because of the undeniable connection to deterministic simulation testing, or possibly because it just consisted of a bunch of people on a similar wavelength at the moment. But I’ve personally never been in a room where almost every single person raised their hand when a speaker asked: “who’s familiar with property-based testing?” So it certainly felt like something interesting was in the air.

Branch Coverage Won't Prove The Collatz Conjecture

January 26, 2025

The Collatz conjecture is the prime example of the limitations of thinking in terms of branch coverage. It can be written as a recursive function in 5 lines of code with only three branches. That’s great, except we have no idea if it’s true or not, and no amount of tests can prove either way.

Simulating Some Queues

January 03, 2025

System performance boils down to the timing behavior of various interacting queues. Queues are one of those incredibly simple but powerful concepts, but they have some unintuitive or non-obvious behavior when only thinking about them mathematically. Simulating queueing scenarios gives us a better picture about how queues operate in practice.

Controlling Nondeterminism in Model-Based Tests with Prophecy Variables

December 23, 2024

We have to constantly wrestle with nondeterminism in tests. Model-based tests present unique challenges in dealing with it, since the model must support the implementation’s nondeterministic behavior without leading to flaky failures. In traditional example-based tests, nondeterminism is often controlled by adding stubs, but it’s not immediately clear how to apply this in a model-based context where tests are generated. We’ll look to the theory of refinement mappings for a solution.

Does Your Test Suite Account For Weak Transaction Isolation?

December 31, 2023

Transaction isolation is the kind of thing that you learn about and it fills you with fear. Specifically, there are weak transaction isolation levels which allow some fairly unexpected behavior. Tools like Jepsen are used to test the general isolation guarantees of databases, but it’s pretty uncommon to check the application layer for issues related to isolation anomalies. These anomalies can impact actual domain logic, so it’s important to understand them as well as how we can test them.

Forward and Backward Reasoning in Proof Assistants

October 01, 2023

Proof assistants are really fascinating tools, but the learning curve can be extremely steep. If you’re a programmer by trade and not a mathematician, this curve can be even steeper, because it’s not like programmers are doling out proofs left and right at work. One particular sticking point that I had trouble overcoming is the difference between forward vs. backward reasoning - proofs assistants support both.

Compiling a Test Suite

August 23, 2023

When I first stumbled upon certifying compilation¹, I was absolutely awestruck. I thought a compiler was a very specific thing, a translator from source to target language. But a certifying compiler goes further: it also proves its own correctness. My motto has become “most tests should be generated”, so this immediately seemed like a promising approach to my goal of improving the generative testing of interactive applications. It wasn’t immediately clear how exactly to incorporate this into that context, but after a little experimentation I now have a prototype of what it might look like.

I first heard about certifying compilation through a talk on YouTube and a corresponding paper (by Liam O’Connor, Zilin Chen, Christine Rizkallah, Sidney Amani, Japheth Lim, Toby Murray, Yutaka Nagashima, Thomas Sewell, and Gerwin Klein). These are about the Cogent language, which compiles from itself to C, but also generates a proof of its correctness in Isabelle. ↩

Most Tests Should Be Generated

July 02, 2023

Traditional testing wisdom eventually invokes the test pyramid, which is a guide to the proportion of tests to write along the isolation / integration spectrum. There’s an eternal debate about what the best proportion should be at each level, but interestingly it’s always presented with the assumption that test cases are hand-written. We should also think about test generation as a dimension, and if I were to draw a pyramid about it I’d place generated tests on the bottom and hand-written scenarios on top, i.e. most tests should be generated.

Logical Time and Deterministic Execution

February 28, 2023

Recently, Tomorrow Corporation released this video of their in-house tech stack doing some truly awesome time-travel debugging of a production-quality game. You should watch this video, even if you don’t read this post, because the workflow that they’ve created is really inspiring. The creator kept bringing up the fact that the reason their tools can do this is that they have determinism baked into them at the very foundational levels. You simply can’t bolt this on at higher levels in the stack.

This got me thinking - not only do we rarely have this level of control in our projects, but I think it’s rare to even understand how determinism is possible in modern systems that are interactive, concurrent, and distributed. If we don’t understand this, we can’t ever move our tools toward determinism, which I think is a very good idea. It turns out that even if we can’t predict exactly how a program will execute in a specific run, we can still model and reason about it deterministically. This is a prerequisite for most formal methods, and while I understand that formal methods aren’t everyone’s cup of tea, this is the number one thing that I wish more people understood. So today, we won’t be talking about testing or verifying anything, we’ll just be looking to better understand software in general by diving into logical time and how it enables deterministic reasoning.

Efficient and Flexible Model-Based Testing

January 31, 2023

In Property-Based Testing Against a Model of a Web Application, we built a web application and tested it against an executable reference model. The model-based test in that post checks sequences of actions against a global system state, which is simple to explain and implement, but is unsuitable for testing practical applications in their entirety. To test the diverse applications that arise in practice, as well as test more surface area of a single application, we’ll need a more efficient and flexible approach.

In that post, I promised that we’d dive deeper into the theory of model-based testing. To upgrade our testing strategy, we’ll look at the theoretical concepts of refinement mappings¹ and auxiliary variables², and add in a couple of tweaks based on the specific context of testing. All of this will get applied to a real test of a full-stack application.

I recommend reading this paper to get a handle on refinement mappings. Another name for this technique is simulation, which you can see an example of in how seL4 proves that the implementation implements its functional specification. Both are the same ultimate idea - prove that one program implements another by showing that all single transitions in each implement each other. ↩
We’ll expand on what auxiliary variables are throughout the post, but you can read more about them here and here. ↩

The Case for Models

December 11, 2022

I’ve become a bit model-obsessed as of late. By “model,” I mean a simplified representation of a program. No databases. No HTTP. Just pure logic. What interests me so much about models is that they exemplify software minimalism. We often talk about essential vs. accidental complexity - well, models are the embodiment of the essential. We have an ongoing battle against complexity at the language level, with tons of new languages providing support for immutability, reference sharing control, and other features of modern semantics. But I can’t help but still ask: is it really enough? Is software really getting any simpler?

We all know the opposite is true, and I’d like to make the case for using models to combat complexity.

Extracting a Verified Interpreter from Isabelle/HOL

September 21, 2022

Sometimes building a language is the best solution to a problem. This takes on many forms, from designing DSLs to implement an OS shell to more ambitious languages for verifying filesystems. Since we’re in the Golden Age of PL Research, there are plenty of reasons to give language design a try!

Languages are foundational, though, and soundness issues in them affect all of their programs. This makes them a great candididate for formalization and verification, since the cost of errors is very high, but as usual that means we have to address the verification gap. In this post we’ll cross the gap by building and (partially) verifying an operational semantics of a small language in Isabelle/HOL, and we’ll extract this semantics into an executable interpreter in OCaml.

Domain-Driven Test Data Generation: A Category-Partition Method and Property-Based Testing Mashup

August 31, 2022

Property-based testing is a well-known testing approach where random input generation is combined with checking correctness properties expressed as predicate functions. The goal of property-based testing is to use large amounts of these randomly generated inputs as a proxy for asserting that the property is always true, also known as an invariant of the program. That’s the main difference from example-based testing: examples check the expected result of a single program execution, whereas properties are statements about all program executions.

No matter how many inputs we generate, though, anything short of exhaustive testing or proof leaves room for errors. This begs the question: are there other data generation strategies that we can use to check the same correctness properties that we’d use when property-based testing? A property predicate doesn’t care how the input data was generated, so we can decouple how data is produced from the actual property checking.

Enter the category-partition method, a testing technique that’s existed since the 1980s. It’s a hybrid human/automated approach for creating domain-driven test data. One of the biggest “downsides” to it is that it can produce lots of test cases, which often makes it prohibitively expensive for manual and example-based testing. But when testing for properties, lots of test cases is a good thing. So is the mashup of category-partition-created-inputs and property-based-testing-properties a hyphen-laden match made in heaven?

Property-Based Testing Against a Model of a Web Application

August 11, 2022

The term “lightweight formal methods” is a bit of a contradiction, but the idea is that like everything else, formal methods exists on a spectrum. We don’t have to prove each line of our entire application to use mathematical thinking to reason about our software. A great example of this is model-based testing, where we can automatically generate test cases for an implementation by first constructing a model of the system. This is a well-studied approach, but what rekindled my curiosity in it was Amazon’s recent usage of it to test parts of S3’s functionality, as well as seeing some recent activity around the Quickstrom web application testing tool.

If it hasn’t been clear, I’m mostly interested in which aspects of formal methods we can use “at work” and on real projects. So what interests me about lightweight formal methods is that they generally have a built-in solution to the verification gap, meaning we can apply them to deployable code. Does the lightweight nature of model-based testing make it a viable approach for testing web applications? Let’s first go into an example, and we’ll tackle the underlying theory in a separate post.

The Verification Gap: A Major Hurdle for the Industry Adoption of Formal Methods

July 12, 2022

Let’s assume that we’ve made the decision to formally verify a project that we’re working on. How would we even go about doing that? This simple question inevitably leads to the verification gap: the fact that software has to be executed at some point, but the tools for logic verification are often detached from the languages that we use to build executable programs. Some languages do exist¹ that reduce or eliminate this gap, but they come with their own challenges and tradeoffs. Without trying to be an alarmist, there’s really no escaping this problem. At the end of the day, there is always some delta between what can be verified and what is actually happening in a production system. Thinking about this gap is crucial in understanding the limitations of formal methods, but honing in on it also opens up the conversation for practical solutions and a deeper understanding of software correctness.

Like F*, Dafny, Why3, and Chalice. Each takes a different and interesting approach to bridging the verification gap. ↩

Quality and Paradigm: The Assembly Language of Reasoning and the Domain-Specific Language of the Machine

December 30, 2021

Does programming paradigm affect the ultimate quality of software? I generally avoid stoking the eternal flame war of functional vs. imperative programming, but it has to be discussed since there are so many repeated statements of fact about which one leads to higher-quality programs¹. The arguments primarily exist along two dimensions: correctness and performance. Summing them up, functional programs are more correct but slow, and imperative programs are buggy but fast. Of course, there are counterexamples to each of these statements, but this is a fair generalization overall.

Thinking of it as a competition isn’t helpful though, because each paradigm has a different purpose: FP is optimized for mathematical reasoning, whereas imperative programming is fundamentally about manipulating computer hardware. It’s this difference between reasoning and practical execution that suggests that the highest quality software actually requires utlizing both paradigms: no matter what level we write code at, we must be able to reason about how that code runs at the level of real-world machines. This is what allows us to achieve the holy grail of quality: fast and correct software.

We’ll leave out other paradigms from this discussion, such as logic programming, acknowledging that FP and imperative aren’t the only possible programming paradigms. ↩

Refinement: Formalizing the Simplicity Underneath Complex Programs

November 26, 2021

“Real world” software is large, messy, and full of detail. A customer might just want to store and retrieve their data, but those simple requirements can get lost in the sea of programming language semantics, libraries, frameworks, databases, Internet protocols, serialization formats, performance optimizations, security hardening, auditability, monitorability, asynchronicity, etc., etc., ad infinitum. We should always try to simplify our stack, but practical computation is optimization to an extent - how would you like to use a logically correct application where each interaction takes 10 seconds to give feedback?

To really understand this difference between functional and non-functional requirements, let’s look at the concept of refinement. Refinement is the fundamental concept behind behavior-preserving program transformation, and it allows us to separate abstraction from implementation in a formal and verifiable process.

Misspecification: The Blind Spot of Formal Verification

October 06, 2021

As people, we like to be right, and we also want to avoid the effort of proving that we’re right. So we often resort to logical fallacies to take shortcuts and strong-arm people into believing us. One such logical fallacy is an “appeal to authority” - you simply reference someone or thing that is considered to be an expert, and voilá - now you are right!

The reason it’s a logical fallacy is that even smart people make mistakes. Or, even more importantly, smart people can have ill-intent - what better way to use social capital than to convince your followers to buy your new book and pay for your next vacation? That’s why papers should be peer-reviewed. That’s why we reproduce empirical experiments. And that’s why we should always have healthy skepticism and evaluate an argument for ourselves.

Formal verification is often presented as an appeal to authority. This progam was formally verified, so please stop asking me if there are any bugs! Well, there are simply no silver bullets out there, and formal verification is certainly not one of them because of a very serious blind spot: the possibility of misspecification.

State Space Explosion: The Reason We Can Never Test Software to Perfection

January 02, 2021

Have you ever seen a test suite actually prevent 100% of bugs? With all of the time that we spend testing software, how do bugs still get through? Testing seems ostensibly simple – there are only so many branches in the code, only so many buttons in the UI, only so many edge cases to consider. So what is difficult about testing software?

This post is dedicated to Edmund Clarke, who spent a large portion of his life pioneering solutions to the state explosion problem.

Why Quality?

December 19, 2020

First thing’s first: why should we care about the quality of software?

Does it matter?¹ The answer, like all answers, is that it completely depends on the context. Software is vastly diverse, and not all applications require the same level of quality and reliability. The most obvious litmus test is “will people die if the application fails to perform its job?” That isn’t the only dimension to consider, though. For example, people are very sensitive to mistakes made with their money. Our time is precious, our money is earned, and we expect our banking software to work correctly.

I definitely think it matters, but it’s a little more complicated than that. ↩