Home | Concerning Quality
Sometimes building a language is the best solution to a problem. This takes on many forms, from designing DSLs to implement an OS shell to more ambitious languages for verifying filesystems. Since we’re in the Golden Age of PL Research, there are plenty of reasons to give language design a try!
Languages are foundational, though, and soundness issues in them affect all of their programs. This makes them a great candididate for formalization and verification, since the cost of errors is very high, but as usual that means we have to address the verification gap. In this post we’ll cross the gap by building and (partially) verifying an operational semantics of a small language in Isabelle/HOL, and we’ll extract this semantics into an executable interpreter in OCaml.
Property-based testing is a well-known testing approach where random input generation is combined with checking correctness properties expressed as predicate functions. The goal of property-based testing is to use large amounts of these randomly generated inputs as a proxy for asserting that the property is always true, also known as an invariant of the program. That’s the main difference from example-based testing: examples check the expected result of a single program execution, whereas properties are statements about all program executions.
No matter how many inputs we generate, though, anything short of exhaustive testing or proof leaves room for errors. This begs the question: are there other data generation strategies that we can use to check the same correctness properties that we’d use when property-based testing? A property predicate doesn’t care how the input data was generated, so we can decouple how data is produced from the actual property checking.
Enter the category-partition method, a testing technique that’s existed since the 1980s. It’s a hybrid human/automated approach for creating domain-driven test data. One of the biggest “downsides” to it is that it can produce lots of test cases, which often makes it prohibitively expensive for manual and example-based testing. But when testing for properties, lots of test cases is a good thing. So is the mashup of category-partition-created-inputs and property-based-testing-properties a hyphen-laden match made in heaven?
The term “lightweight formal methods” is a bit of a contradiction, but the idea is that like everything else, formal methods exists on a spectrum. We don’t have to prove each line of our entire application to use mathematical thinking to reason about our software. A great example of this is model-based testing, where we can automatically generate test cases for an implementation by first constructing a model of the system. This is a well-studied approach, but what rekindled my curiosity in it was Amazon’s recent usage of it to test parts of S3’s functionality, as well as seeing some recent activity around the Quickstrom web application testing tool.
If it hasn’t been clear, I’m mostly interested in which aspects of formal methods we can use “at work” and on real projects. So what interests me about lightweight formal methods is that they generally have a built-in solution to the verification gap, meaning we can apply them to deployable code. Does the lightweight nature of model-based testing make it a viable approach for testing web applications? Let’s first go into an example, and we’ll tackle the underlying theory in a separate post.
Let’s assume that we’ve made the decision to formally verify a project that we’re working on. How would we even go about doing that? This simple question inevitably leads to the verification gap: the fact that software has to be executed at some point, but the tools for logic verification are often detached from the languages that we use to build executable programs. Some languages do exist that reduce or eliminate this gap, but they come with their own challenges and tradeoffs. Without trying to be an alarmist, there’s really no escaping this problem. At the end of the day, there is always some delta between what can be verified and what is actually happening in a production system. Thinking about this gap is crucial in understanding the limitations of formal methods, but honing in on it also opens up the conversation for practical solutions and a deeper understanding of software correctness.
Does programming paradigm affect the ultimate quality of software? I generally avoid stoking the eternal flame war of functional vs. imperative programming, but it has to be discussed since there are so many repeated statements of fact about which one leads to higher-quality programs. The arguments primarily exist along two dimensions: correctness and performance. Summing them up, functional programs are more correct but slow, and imperative programs are buggy but fast. Of course, there are counterexamples to each of these statements, but this is a fair generalization overall.
Thinking of it as a competition isn’t helpful though, because each paradigm has a different purpose: FP is optimized for mathematical reasoning, whereas imperative programming is fundamentally about manipulating computer hardware. It’s this difference between reasoning and practical execution that suggests that the highest quality software actually requires utlizing both paradigms: no matter what level we write code at, we must be able to reason about how that code runs at the level of real-world machines. This is what allows us to achieve the holy grail of quality: fast and correct software.
“Real world” software is large, messy, and full of detail. A customer might just want to store and retrieve their data, but those simple requirements can get lost in the sea of programming language semantics, libraries, frameworks, databases, Internet protocols, serialization formats, performance optimizations, security hardening, auditability, monitorability, asynchronicity, etc., etc., ad infinitum. We should always try to simplify our stack, but practical computation is optimization to an extent - how would you like to use a logically correct application where each interaction takes 10 seconds to give feedback?
To really understand this difference between functional and non-functional requirements, let’s look at the concept of refinement. Refinement is the fundamental concept behind behavior-preserving program transformation, and it allows us to separate abstraction from implementation in a formal and verifiable process.
As people, we like to be right, and we also want to avoid the effort of proving that we’re right. So we often resort to logical fallacies to take shortcuts and strong-arm people into believing us. One such logical fallacy is an “appeal to authority” - you simply reference someone or thing that is considered to be an expert, and voilá - now you are right!
The reason it’s a logical fallacy is that even smart people make mistakes. Or, even more importantly, smart people can have ill-intent - what better way to use social capital than to convince your followers to buy your new book and pay for your next vacation? That’s why papers should be peer-reviewed. That’s why we reproduce empirical experiments. And that’s why we should always have healthy skepticism and evaluate an argument for ourselves.
Formal verification is often presented as an appeal to authority. This progam was formally verified, so please stop asking me if there are any bugs! Well, there are simply no silver bullets out there, and formal verification is certainly not one of them because of a very serious blind spot: the possibility of misspecification.
Have you ever seen a test suite actually prevent 100% of bugs? With all of the time that we spend testing software, how do bugs still get through? Testing seems ostensibly simple – there are only so many branches in the code, only so many buttons in the UI, only so many edge cases to consider. So what is difficult about testing software?
This post is dedicated to Edmund Clarke, who spent a large portion of his life pioneering solutions to the state explosion problem.
First thing’s first: why should we care about the quality of software?
Does it matter? The answer, like all answers, is that it completely depends on the context. Software is vastly diverse, and not all applications require the same level of quality and reliability. The most obvious litmus test is “will people die if the application fails to perform its job?” That isn’t the only dimension to consider, though. For example, people are very sensitive to mistakes made with their money. Our time is precious, our money is earned, and we expect our banking software to work correctly.