Concerning Quality

Bug Bash 2: Attack of the Clones

I’m back from this year’s Bug Bash conference. I was a big fan of the first one. Did the sophomore effort live up to the energy and execution of last year? Or has AI automated away the correctness community, leaving us with nothing left to talk about? Tune in for the verdict.

Before we proceed, first let me say that this post is totally and entirely hand-written, with no LLM-generated content, whatsoever. This is a strange thing to have to call out, but we live in strange times now, and I keep seeing posts get accused of being slop. I hereby guarantee that all slop on this blog is simply a result of my idiosyncratic brain structure and personal shortcomings. Any em dashes I proffer are of my own volition.

If you didn’t go to or hear about last year’s Bug Bash, it’s a conference centered around all things correctness. That is testing, reliability, performance, formal methods, astrology, The I Ching, you name it, everything’s on the table. I go because that’s exactly my jam, my fuzzy warm place where I get to enjoy philosophizing and theorizing about topics that tickle my brain for reasons totally beyond my comprehension. Topics like logical time, inductive invariants, strict serializability, concolic testing, input space coverage, specification refinement, etc. You stumble upon Specifying Systems one day, and the next you’re at Bug Bash.

At least, that was last year. This year, it seems, we were all experiencing collective whiplash. While not everyone in the correctness community is jazzed about AI, what was clear from this conference is that everyone has to learn how to operate within its new confines. Even if you’re not using AI, everyone else is, and that means we have to be prepared for its tentacles one way or another.

The opening talk, given by Antithesis’ CEO Will Wilson, addressed this very directly. The pithily titled “We Won, What Now?” dove into life after the Big Bang of AI, but from the vantage point of the correctness community. As the title might suggest, this talk presupposed that AI has already initiated a wave of interest in elevated correctness techniques, so much so that we’re about to feel like an underground band getting swept up in a viral movement. It’s easy to stay passionate when it’s our little hometown band, but how will we feel when the inevitable knockoffs come in and start selling out stadiums?

He cited multiple other similar scenarios, for example the Eternal September of the Internet, whereby the flood of new users decimated the early culture of online socialization. The pattern seems to be that when a niche all of a sudden becomes relevant to the Zeitgeist, it’s simply impossible to avoid getting absorbed. He says that this is likely what’s going to happen to us correctness folk. We’ll watch as others come in and profit off of the ideas we’ve dreamed up, shepherded, or raised out of obscurity. The farfetched research of yesteryear becomes commonplace after a tectonic reshuffling such as the one we’re in right now. And it’s going to feel bad. But this is the very definition of progress, albeit bittersweet, and we should take solace in the fact that we were the hipsters of correctness. We always knew there was a better way, and it just took an army of informal, nondeterministic clones to convince the masses of this.

Now, I do think that there’s been a welcome uptick of interest in correctness due to AI, and Will even shared Google trends graphs showing an increase in searches for both property-based testing and formal verification. The conference itself was another bit of supporting evidence, having close to double the attendees over last year. You can probably sense the incoming “but” here - BUT, I think the industry is in such a soupy state of flux at the moment that it’s really hard to see past the horizon to make any sort of predictions. If this is how it does play out, it would be quite cool though, and would leave me with a feeling that the “good people” won.

Either way, the keynote set the stage for the rest of the conference. Though this is far from an AI conference, AI was no doubt a centerpiece throughout, and has some very interesting implications for correctness. Let’s now do a tour of the themes that I noticed throughout the rest of the talks.

Software Engineering Remains

Multiple people offered the same proposition here, which is that much of software engineering is founded in ideas that are universally good and still apply in the AI world. For example, Ben Eggers shocked the crowd when he advised us to still write our database schemas by hand, even if AI writes the system around it. By hand? Ahh!! As a member of OpenAI, shouldn’t he be telling us to surrender our minds over to the machines? No, says he, because we still have so much hard-won intuition about fickle things like database engines and data modeling, and we can use the schema as a way to specify what we want more precisely than a “vibe.”

Similarly, Ron Minsky, the OCaml cobbler extraordinaire, devoted his entire talk to pleading that code quality, testing, and code review all still really matter. At first, I thought that micromanaging code would be a thing of the past. But after watching agents struggle to reconstitute context, read tokens from far-away files, and modify duplicate code paths, I can’t help but think that code quality really benefits LLM models just as much as it benefits humans. Testing of course is a tremendous enabler of agentic software development, because it allows you to offload tasks with much more confidence, provided you trust the tests.

Frank McSherry, CTO of Materialize issued the mic-drop quote of the conference for me though: “Systems that work, work for a reason.” He spoke about how Materialize is built on top of various ideas from distsys research, such as virtual time and differential dataflow. These ideas are well studied and logically sound, which we have to remember counts for so much in our frighteningly entropic world. Does anyone site and watch agents do their work? They are amazing little things, but you often have to guide them toward certain patterns or push back on certain choices that lead to code that can surely be called slop without any loss of generality. It’s kind of fun to have a robot that will chisel away at edge cases until you have something resembling a working system, but a working system still has to be built on a suitable foundation.

I’ll be honest, this was the theme that gave me the most hope out of the conference. It’s not that I care about being replaced, because I know that progress is inevitable and I am but a grain of sand in the vast universe. But there’s something satisfying, even moving, about acknowledging that we’ve discovered deep truths about logic and computation throughout our short history in the field. I left feeling that I should continue investing here, now and forever. Computer science and the field of software engineering are here to stay, it seems, at least for now.

AI Enables Stronger Verification

Another awesome theme at this conference was that AI could very well be a verification enabler. Everyone knows formal verification is hard. But so is property-based testing, and so is fuzzing, and so is deterministic simulation testing. But because these techniques are grounded in sound logic (already a call back to the previous section), this makes them highly amenable to automation.

First, there’s the fact that AI can help build more ambitious test harnesses, more quickly. Pierre Zemb gave a lightning talk about building on top of FoundationDB’s simulation framework, and when we ran into each other in the hallway track we spoke about how AI makes things like this so much easier than it was years ago. I notice this in my own work: I’m writing more property-based tests than ever, because AI is fantastic at helping with building more sophisticated input generators.

Another example of this was Antithesis’ own Carl Sverre presenting about the skills they’ve built out for getting set up with testing on the Antithesis platform. Having gone through this setup myself without the skills before, I can definitely say that these are a huge time saver. Beyond just saving time, they also get you to start thinking in terms of property-based testing, which I think is an undeniable good thing. For example, the first artifact that the antithesis skills output is a “property catalog,” which defines and documents all of the properties that will ultimately be checked for, and why. It ends up looking like a proof skeleton for a more complicated proof, and gets you in the mode of thinking about your system holistically.

Probably the most AI-forward verification talk was given by Fernanda Graciolli, co-founder of Midspiral. She went over multiple aspects of the Midspiral toolchain, which all heavily leverage AI to make formal verification more palatable for developers of web applications. Their core technology, which generates verified “state kernels” callable from things like a React application, is a clear example of what can be done now that AI has the ability to generate proofs for us. But even more interesting was her overview of claimcheck, a system that aims to solve the specification validation problem. I’ve written about the perils of misspecification before, and this is one of the biggest issues people have with even attempting formal verification. claimcheck uses an LLM to de-compile proof-carrying Dafny code back into an informal spec, validating that the executable model actually matches the intention of the original spec. This is a blending of the formal and informal worlds, which seems to be Midspiral’s modus operandi, and while this means some formality is lost it’s still really useful to have any amount of feedback on the validity of the generated code. I really like what they’re doing here.

Lastly, Gabriela Moreira pleaded with the crowd to use executable specifications whenever possible when dealing with AI-generated code. She works on Quint, which is the more pragmatic cousin of TLA+. It starts with the semantics of TLA+, but enforces executability, adds a static type system, and comes with a really developer-friendly CLI. I feel bad saying anything negative about the venerable TLA+, which is literally the tool that made formal verification click for me. But it’s always been lacking in the pragmatic tool department, and Quint is filling a really nice void here. By packaging a familiar programming language syntax along with CLI commands for running random spec simulations as well as outputting in a standardized trace format, Quint is incredibly friendly to agents working in a feedback loop.

This is a really exciting space, and these were the main talks that jumped out to me on this front.

Performance and Reliability

Again, this is not actually an AI conference, and there were plenty of talks that barely even mentioned it. These three in particular were good, old fashioned tech talks, that each zoomed in on the importance and difficulties of non-functional requirements like performance and reliability.

In “Fast and fault-tolerant: pick two,” Matt Barrett talked about how his company Adaptive built one of, if not the most performant and correct Raft implementation libraries out there. This talk reminded me that simple implementations are rarely the most performant. For example, one of the first performance breakthroughs they had was switching from replicating full messages to streaming individual bytes. You see this idea crop up all over the place, but most germane to me at the moment is the analog to relational database engines chopping their data up into fixed-size data pages. The simple thing would be to store full rows, but this could never work in practice when hard drives are built around fixed-size data blocks. Performance is the art of understanding the machine, and you’ll always have to play to the machine’s operational model if you like being fast.

But equally impressive were the algorithmic changes they employed for reliability’s sake. During leader election, they introduced a “canvassing” and “vetoing” logic to prevent unnecessary elections and agreement on bad data. The voting metaphors here are delectable, but these led to real reliability gains in practice, again reinforcing that the first pass of an algorithm doesn’t always hold up in the reliability domain either. It’s always a challenge to keep the functional properties of an algorithm acceptable while adding reliability logic like this, since it can drastically change the operational behavior.

Another mesmerizing entry on this theme was “Protocol-aware deterministic simulation testing” by Chaitanya Bhandari at Tigerbeetle. Once you hear Tigerbeetle, you know this is going to be a good talk, but he dove into an aspect of their testing that I hadn’t heard of up to this point: since they have a deterministic simulation engine with virtual time, they can actually do deterministic performance testing. They check for speed metrics on things like their block repair algorithm to make sure that safety is reached in a reasonable amount of time. This reminds me of “Real Time is Really Simple” by Leslie Lamport, where he shows how to model real-time properties into the TLA+ formalism, and I’d love to see what other applications of this are out there. You can’t make statements like “this will finish in 50 milliseconds,” but knowing that an optimization goes from 1,000 ticks of the virtual clock to 100 is incredibly useful information.

There was one talk that took the cake here for me though, and that was Peter Alvaro’s “Where all the ladders start.” The talk overall was a retrospective on his research career, distilled into the main threads that he’s dove into and what learnings came out of it. What really caught my ear though was that his recent research is going in a different direction, focusing on modeling and simulating reliability characteristics. He and his colleagues are using continuous-time markov chains along with discrete event simulation to find tipping points in algorithms that can lead to metastable failures. I absolutely love this, because we still hardly do any ahead-of-time validation for this in industry, and we rely on incidents to expose these tipping points.

This is the area that I’m thinking about most these days, so I got a lot out of these talks. I’m forever interested in correctness, but I feel that the correctness space is much more mature and we’re converging on techniques. Lots of people do property-based testing now. Almost no one is doing algorithm simulation to find performance and reliability choke points. I’ve also stumbled into working on a performance and reliability monitoring tool, so after analyzing hundreds and hundreds of performance related incidents it’s clear to me that there’s a lot of room for research and innovation in this space.

Uncertainty

Unfortunately we have to end on a more sobering note. Above all there was a tone of uncertainty in the air at the conference. I actually love this, because it’s appropriate to be uncertain at this time, and anyone telling you otherwise is probably selling you something.

Last year, things were more simple. Just a bunch of people being jazzed about making sure software doesn’t break. We’ll watch a talk about an esoteric random input generation strategy and say, “hell yea.” 2026 was the year that a clone army descended upon Bug Bash, and we all processed it together.

The final talk echoed this tone of uncertainty, with Steve Klabinik’s talk about “Steel, Rust, and Truth.” This talk was kind of all over the place, in a good way. He wove together an epistemological study on what even is truth along with a personal story relating his career story to his father’s and grandfather’s. His grandfather was able to retire happily at the peak of his career, but his father had much worse luck. Now that AI is here, and thinkpieces about the end of software engineering are being published daily, which one will he be? Which one will we all be? What is truth, and does the software AI has begun producing bear any resemblance to what we’ve been doing for the last 50 years?

I’ll go back to the group of speakers that all reminded us of the importance of our collective software engineering knowledge. While we may be in the middle of a generation-defining hype cycle, and software may very well never be the same, I find some things to always be true. Staying curious will always be a good thing. Understanding the essence of software will always be a good thing. I think that if we stay true to these principles, everything is going to be ok.

Catch you at next year’s Bug Bash.