Home | Concerning Quality
Proof assistants are really fascinating tools, but the learning curve can be extremely steep. If you’re a programmer by trade and not a mathematician, this curve can be even steeper, because it’s not like programmers are doling out proofs left and right at work. One particular sticking point that I had trouble overcoming is the difference between forward vs. backward reasoning - proofs assistants support both.
When I first stumbled upon certifying compilation, I was absolutely awestruck. I thought a compiler was a very specific thing, a translator from source to target language. But a certifying compiler goes further: it also proves its own correctness. My motto has become “most tests should be generated”, so this immediately seemed like a promising approach to my goal of improving the generative testing of interactive applications. It wasn’t immediately clear how exactly to incorporate this into that context, but after a little experimentation I now have a prototype of what it might look like.
Traditional testing wisdom eventually invokes the test pyramid, which is a guide to the proportion of tests to write along the isolation / integration spectrum. There’s an eternal debate about what the best proportion should be at each level, but interestingly it’s always presented with the assumption that test cases are hand-written. We should also think about test generation as a dimension, and if I were to draw a pyramid about it I’d place generated tests on the bottom and hand-written scenarios on top, i.e. most tests should be generated.
Recently, Tomorrow Corporation released this video of their in-house tech stack doing some truly awesome time-travel debugging of a production-quality game. You should watch this video, even if you don’t read this post, because the workflow that they’ve created is really inspiring. The creator kept bringing up the fact that the reason their tools can do this is that they have determinism baked into them at the very foundational levels. You simply can’t bolt this on at higher levels in the stack.
This got me thinking - not only do we rarely have this level of control in our projects, but I think it’s rare to even understand how determinism is possible in modern systems that are interactive, concurrent, and distributed. If we don’t understand this, we can’t ever move our tools toward determinism, which I think is a very good idea. It turns out that even if we can’t predict exactly how a program will execute in a specific run, we can still model and reason about it deterministically. This is a prerequisite for most formal methods, and while I understand that formal methods aren’t everyone’s cup of tea, this is the number one thing that I wish more people understood. So today, we won’t be talking about testing or verifying anything, we’ll just be looking to better understand software in general by diving into logical time and how it enables deterministic reasoning.
In Property-Based Testing Against a Model of a Web Application, we built a web application and tested it against an executable reference model. The model-based test in that post checks sequences of actions against a global system state, which is simple to explain and implement, but is unsuitable for testing practical applications in their entirety. To test the diverse applications that arise in practice, as well as test more surface area of a single application, we’ll need a more efficient and flexible approach.
In that post, I promised that we’d dive deeper into the theory of model-based testing. To upgrade our testing strategy, we’ll look at the theoretical concepts of refinement mappings and auxiliary variables, and add in a couple of tweaks based on the specific context of testing. All of this will get applied to a real test of a full-stack application.
I’ve become a bit model-obsessed as of late. By “model,” I mean a simplified representation of a program. No databases. No HTTP. Just pure logic. What interests me so much about models is that they exemplify software minimalism. We often talk about essential vs. accidental complexity - well, models are the embodiment of the essential. We have an ongoing battle against complexity at the language level, with tons of new languages providing support for immutability, reference sharing control, and other features of modern semantics. But I can’t help but still ask: is it really enough? Is software really getting any simpler?
We all know the opposite is true, and I’d like to make the case for using models to combat complexity.
Sometimes building a language is the best solution to a problem. This takes on many forms, from designing DSLs to implement an OS shell to more ambitious languages for verifying filesystems. Since we’re in the Golden Age of PL Research, there are plenty of reasons to give language design a try!
Languages are foundational, though, and soundness issues in them affect all of their programs. This makes them a great candididate for formalization and verification, since the cost of errors is very high, but as usual that means we have to address the verification gap. In this post we’ll cross the gap by building and (partially) verifying an operational semantics of a small language in Isabelle/HOL, and we’ll extract this semantics into an executable interpreter in OCaml.
Property-based testing is a well-known testing approach where random input generation is combined with checking correctness properties expressed as predicate functions. The goal of property-based testing is to use large amounts of these randomly generated inputs as a proxy for asserting that the property is always true, also known as an invariant of the program. That’s the main difference from example-based testing: examples check the expected result of a single program execution, whereas properties are statements about all program executions.
No matter how many inputs we generate, though, anything short of exhaustive testing or proof leaves room for errors. This begs the question: are there other data generation strategies that we can use to check the same correctness properties that we’d use when property-based testing? A property predicate doesn’t care how the input data was generated, so we can decouple how data is produced from the actual property checking.
Enter the category-partition method, a testing technique that’s existed since the 1980s. It’s a hybrid human/automated approach for creating domain-driven test data. One of the biggest “downsides” to it is that it can produce lots of test cases, which often makes it prohibitively expensive for manual and example-based testing. But when testing for properties, lots of test cases is a good thing. So is the mashup of category-partition-created-inputs and property-based-testing-properties a hyphen-laden match made in heaven?
The term “lightweight formal methods” is a bit of a contradiction, but the idea is that like everything else, formal methods exists on a spectrum. We don’t have to prove each line of our entire application to use mathematical thinking to reason about our software. A great example of this is model-based testing, where we can automatically generate test cases for an implementation by first constructing a model of the system. This is a well-studied approach, but what rekindled my curiosity in it was Amazon’s recent usage of it to test parts of S3’s functionality, as well as seeing some recent activity around the Quickstrom web application testing tool.
If it hasn’t been clear, I’m mostly interested in which aspects of formal methods we can use “at work” and on real projects. So what interests me about lightweight formal methods is that they generally have a built-in solution to the verification gap, meaning we can apply them to deployable code. Does the lightweight nature of model-based testing make it a viable approach for testing web applications? Let’s first go into an example, and we’ll tackle the underlying theory in a separate post.
Let’s assume that we’ve made the decision to formally verify a project that we’re working on. How would we even go about doing that? This simple question inevitably leads to the verification gap: the fact that software has to be executed at some point, but the tools for logic verification are often detached from the languages that we use to build executable programs. Some languages do exist that reduce or eliminate this gap, but they come with their own challenges and tradeoffs. Without trying to be an alarmist, there’s really no escaping this problem. At the end of the day, there is always some delta between what can be verified and what is actually happening in a production system. Thinking about this gap is crucial in understanding the limitations of formal methods, but honing in on it also opens up the conversation for practical solutions and a deeper understanding of software correctness.
Does programming paradigm affect the ultimate quality of software? I generally avoid stoking the eternal flame war of functional vs. imperative programming, but it has to be discussed since there are so many repeated statements of fact about which one leads to higher-quality programs. The arguments primarily exist along two dimensions: correctness and performance. Summing them up, functional programs are more correct but slow, and imperative programs are buggy but fast. Of course, there are counterexamples to each of these statements, but this is a fair generalization overall.
Thinking of it as a competition isn’t helpful though, because each paradigm has a different purpose: FP is optimized for mathematical reasoning, whereas imperative programming is fundamentally about manipulating computer hardware. It’s this difference between reasoning and practical execution that suggests that the highest quality software actually requires utlizing both paradigms: no matter what level we write code at, we must be able to reason about how that code runs at the level of real-world machines. This is what allows us to achieve the holy grail of quality: fast and correct software.
“Real world” software is large, messy, and full of detail. A customer might just want to store and retrieve their data, but those simple requirements can get lost in the sea of programming language semantics, libraries, frameworks, databases, Internet protocols, serialization formats, performance optimizations, security hardening, auditability, monitorability, asynchronicity, etc., etc., ad infinitum. We should always try to simplify our stack, but practical computation is optimization to an extent - how would you like to use a logically correct application where each interaction takes 10 seconds to give feedback?
To really understand this difference between functional and non-functional requirements, let’s look at the concept of refinement. Refinement is the fundamental concept behind behavior-preserving program transformation, and it allows us to separate abstraction from implementation in a formal and verifiable process.
As people, we like to be right, and we also want to avoid the effort of proving that we’re right. So we often resort to logical fallacies to take shortcuts and strong-arm people into believing us. One such logical fallacy is an “appeal to authority” - you simply reference someone or thing that is considered to be an expert, and voilá - now you are right!
The reason it’s a logical fallacy is that even smart people make mistakes. Or, even more importantly, smart people can have ill-intent - what better way to use social capital than to convince your followers to buy your new book and pay for your next vacation? That’s why papers should be peer-reviewed. That’s why we reproduce empirical experiments. And that’s why we should always have healthy skepticism and evaluate an argument for ourselves.
Formal verification is often presented as an appeal to authority. This progam was formally verified, so please stop asking me if there are any bugs! Well, there are simply no silver bullets out there, and formal verification is certainly not one of them because of a very serious blind spot: the possibility of misspecification.
Have you ever seen a test suite actually prevent 100% of bugs? With all of the time that we spend testing software, how do bugs still get through? Testing seems ostensibly simple – there are only so many branches in the code, only so many buttons in the UI, only so many edge cases to consider. So what is difficult about testing software?
This post is dedicated to Edmund Clarke, who spent a large portion of his life pioneering solutions to the state explosion problem.
First thing’s first: why should we care about the quality of software?
Does it matter? The answer, like all answers, is that it completely depends on the context. Software is vastly diverse, and not all applications require the same level of quality and reliability. The most obvious litmus test is “will people die if the application fails to perform its job?” That isn’t the only dimension to consider, though. For example, people are very sensitive to mistakes made with their money. Our time is precious, our money is earned, and we expect our banking software to work correctly.