Concerning Quality

Bug Bash 2025 Conference Experience

The inaugural Bug Bash conference was really special. I’ve been to many conferences, but this was legitimately the first that I felt “a part of,” because the subject matter greatly overlapped with what I’m interested in and what I write about here. There are various combinations of testing conferences, devops conferences, and formal methods conferences, sure, but this still felt like a new stake in the ground. Possibly because of the undeniable connection to deterministic simulation testing, or possibly because it just consisted of a bunch of people on a similar wavelength at the moment. But I’ve personally never been in a room where almost every single person raised their hand when a speaker asked: “who’s familiar with property-based testing?” So it certainly felt like something interesting was in the air.

I left feeling more than ever that generative / autonomous testing is the future, even though I don’t need much encouragement there. There was a broader message though: if we really want correctness and reliability in the presence of radical software complexity, we should be open to everything. From formal verification to testing in prod. From hand-crafted unit test cases to fault-injected end-to-end tests. From sprinkling a few asserts around our codebase to building new infrastructure components to better support deterministic testing. Every approach has different assumptions, tradeoffs, and strengths, so we had better start breaking down the walls between separate and even historically at-odds communities in order to elevate the correctness and reliability of our systems.

That’s a message I can get behind, and that’s the reason I left feeling rejuvenated, and dare I say inspired.


The keynote given by Will Wilson set the stage for the conference as a whole, and perfectly introduced this overarching message. After diving into the question of “why do we have bugs to begin with?”, he posited that testing, verification, and observability are fundamentally not at odds, but rather variations on a theme. At the extreme, each of these seem totally incompatible with one another. Testing can be worlds away from formal verification, because the very act of writing a test is an admission that we can’t model the real execution environment with enough precision. Observability even further distances itself from testing by entirely punting on input data generation and allowing that to be handled by the natural operation of the production system.

The line becomes totally blurred with small variations though, like going from hand-written to generated test inputs. This is much closer to user-generated inputs in prod because we don’t know exactly which inputs will be produced. Observability monitors are also just a generalization of test assertions, and formal properties generalize them both. In the world of generated inputs, testing, observability, and formal methods all start to blend together. Testing and observability particularly blend together when it comes to tracking down the actual cause of a generated test failure.

Cue the Antithesis tool, which is not “just” a deterministic hypervisor, but truly a test execution and analysis platform. Will showed a demo of using it to find a bug in etcd by querying over execution histories that the tool stores. He coined this workflow “pre-observability”: the idea that we can take the same root cause analysis techniques from post-deployment observability and apply them to the massive amount of execution traces produced by simulated system actions.

On top of this, this bug was found much more regularly in the test environment due to fault injection techniques, highlighting one of the tradeoffs of observability vs. testing: sometimes a failure scenario is rare enough in production that it’s inefficient to sit and wait for it to happen. Fault injection speeds up the bug finding process by triggering rare scenarios more frequently. It also highlights the unifying nature of formal methods: fault injection to me has always seemed a practical manifestation of prophecy variables, which were invented precisely to deal with situations like nondeterministic failures in distributed systems.

All in this one talk, we spanned testing, observability, formal methods, and tied it together with a ribbon of determinism. I knew then that this was gonna be a good one.


Now, I’ll describe the overall feel and themes of the conference, since you’ll be able to watch all the talks once they’re posted (which I recommend that you do).

Challenge the Status Quo

One big theme I heard throughout the talks was that we should absolutely challenge the status quo. The obvious example of this is Antithesis itself. I, along with everyone else in the programming world, have been complaining about flaky tests for years. But what I have not done is write a deterministic hypervisor to simply avoid the problem at its root.

Generative testing is also inherently in opposition to the current mainstream state of quality techniques. In his talk about the adoption of the Hypothesis property-based testing library, Zac Hetfield-Dodds mentioned that only 5% of Python users use Hypothesis according to their measurements. When most people think of checking for functional correctness, they think of lots and lots of hand-written example-based tests running in CI, with maybe some linting or static type checking layered on top. They typically don’t think in terms of properties and generating inputs.

Zac remains convicted that testing properties is how we should be testing (I agree), so rather than just give up he shared his overall view on why people don’t adopt it as a practice. His message was that we should primarily focus on the human aspect of property-based testing, for example by making the tests easier to write and improving their error messages. On the easier writing front, Hypothesis added a ‘write’ command that can help bootstrap tests for you. We have a ways to go there, but I appreciated that the maintainer of such a prominent library was taking a step back and really analyzing the situation.

The most interesting manifestation of this theme though was in Kyle Kingsbury’s talk on the most recent Jepsen analyses. He was describing the analysis done on Datomic, where he encountered a surprising behavior that didn’t really fall into any of the existing terms we have for transaction anomalies. An eerie vibe came over the room when he suggested that we may have to create new terms for situations like these. If aphyr himself, one of the only people I know of who can tell the difference between G1a and G2-item phenomena, doesn’t have the words to describe a distributed system scenario, what the heck are we supposed to do?

This may be a more personal revelation, but that moment made me realize: we are the adults in the room now. It’s not enough to just rehash things that Leslie Lamport or Barbara Liskov discovered 30-50 years ago. We need to be the ones doing new research, building new tools, or thinking of how we can create infrastructure that allows us to gain more control over our software. If we’re unhappy with the state of the world, we wield the power to change it. And many people are actively working on this.

Test End-to-End

Another common theme was end-to-end testing. End-to-end testing can be a dirty word in some circles, but this group of speakers went all-in on it. Mitchell Hashimoto didn’t get to true end-to-end testing until the end of his talk about making hard-to-test code testable, but he gave a great variety of advice on applying the functional-core-imperative-shell style of design to a codebase. This enables tests of interacting components, in this case stopping only at actual GPU instruction execution. This approach implies that mocking should only be done at strategic system boundaries, which I think is fantastic advice in general.

But then he went on to talk about full end-to-end testing via NixOS VM testing for the final bits that you just don’t want to abstract away. This was actually the first time I heard about Nix’s VM testing, and this looks like a great tool for anyone plagued by the inconsistency of e2e test infrastructure management. I’m definitely going to give it a further look.

Stephanie Wang spoke about all of the reliability lessons she learned while building MotherDuck, and someone from the audience asked how some of these were verified. She replied that they performed lots of chaos testing using a mock network interface for controllability, of course. When she spoke about minimizing data movement via caching as a win for both reliability and performance, I can’t think of a unit test that would give any kind of confidence about that. And this was the main assertion in Ben Egger’s talk about testing in prod at OpenAI, which is the most extreme form of end to end testing: no matter how well you model your system, prod is the concrete instantiation of it, and you shouldn’t ignore the ways that all of your components interact in the production setting.

End to end testing is the perfect example of where testing and observability have a lot in common. The further toward prod your test moves, the more you have to worry about collecting information from hard-to-reach infrastructure components, and the more the semantics of these components influence system behavior. This part about hard-to-reproduce semantics is the whole reason Jepsen takes the end to end testing approach (“as God intended it” as Kyle likes to say). Because unit tests are great and all, but will they catch anomalies caused by weak transaction isolation?.

Formal Methods Push The Limits of Testing

This was my personal favorite theme because it so perfectly puts my intuition into words. This was spoken almost verbatim by Ankush Desai in his talk about formal and semi-formal methods at AWS. His point was that the value of formal techniques aren’t just limited to actual formal verification. The logical approach that formal methods take can be used to come up with testing techniques as well.

This to me is the sweet spot of formal methods, at least in today’s landscape. Formality gives us the framework and strategy for figuring out what exactly we should be looking for and how we should think about systems, but we can use tests in place of proofs when it comes to the checking part. We sacrifice the completeness of the checking in the name of practicality and efficiency: generative tests can’t prove a property, but they provide a much higher level of confidence than a few hand-crafted example scenarios. This is an idea that others have shone light on as well: the Cogent sub-project of seL4 wrote a paper about using property-based tests as an intermediary on their way to proofs in the verification of a filesystem implementation.

In Ankush’s talk, he introduced PObserve, a framework for checking a production system against a specification written in P, a language that he created, and one in use within AWS. Instead of using the spec to prove the implementation correct, it takes logs from the real system and checks that they adhere to the specification. This is similar to model-based tests which many property-based testing libraries support, but it instead takes the observability-inspired approach of checking execution traces extracted from the actual running system. It also reminds me of Tracetest.

This was another case of the trifecta of testing, observability, and formal methods all working together, this time with a focus on validating implementation behavior against a model. We sometimes refer to such approaches as “lightweight formal methods,” and this is the area that I see being most likely to be implemented in a practical setting.


On top of the actual talks, there was a very social vibe to the conference in general, which is a hard thing to fake. I got the sense that there’s very much an appetite for the particular brand of autonomous testing, lightweight formal methods, and observability techniques being presented there. Overall, it was a great experience, and really galvanized and clarified ideas that I’ve been mulling over for a while now. Thank you to Antithesis for shepherding this conversation and getting everyone together under one roof to contribute to it. If there’s a Bug Bash 2026, I will certainly be first in line for a ticket.