<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.2.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" hreflang="en" /><updated>2025-04-13T01:36:15+00:00</updated><id>/feed.xml</id><title type="html">Concerning Quality</title><author><name>Alex Weisberger</name></author><entry><title type="html">Bug Bash 2025 Conference Experience</title><link href="/bug-bash-2025/" rel="alternate" type="text/html" title="Bug Bash 2025 Conference Experience" /><published>2025-04-12T00:00:00+00:00</published><updated>2025-04-12T00:00:00+00:00</updated><id>/bug-bash-2025</id><content type="html" xml:base="/bug-bash-2025/">&lt;p&gt;The inaugural &lt;a href=&quot;https://bugbash.antithesis.com/#about&quot;&gt;Bug Bash conference&lt;/a&gt; was really special. I’ve been to many conferences, but this was legitimately the first that I felt “a part of,” because the subject matter greatly overlapped with what I’m interested in and what I write about here. There are various combinations of testing conferences, devops conferences, and formal methods conferences, sure, but this still felt like a new stake in the ground. Possibly because of the undeniable connection to deterministic simulation testing, or possibly because it just consisted of a bunch of people on a similar wavelength at the moment. But I’ve personally never been in a room where almost every single person raised their hand when a speaker asked: “who’s familiar with property-based testing?” So it certainly felt like something interesting was in the air.&lt;/p&gt;

&lt;p&gt;I left feeling more than ever that generative / autonomous testing is the future, &lt;a href=&quot;/generated-tests/&quot;&gt;even though I don’t need much encouragement there&lt;/a&gt;. There was a broader message though: if we really want correctness and reliability in the presence of radical software complexity, we should be open to &lt;em&gt;everything&lt;/em&gt;. From formal verification to testing in prod. From hand-crafted unit test cases to fault-injected end-to-end tests. From sprinkling a few asserts around our codebase to building new infrastructure components to better support deterministic testing. Every approach has different assumptions, tradeoffs, and strengths, so we had better start breaking down the walls between separate and even historically at-odds communities in order to elevate the correctness and reliability of our systems.&lt;/p&gt;

&lt;p&gt;That’s a message I can get behind, and that’s the reason I left feeling rejuvenated, and dare I say inspired.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The keynote given by Will Wilson set the stage for the conference as a whole, and perfectly introduced this overarching message. After diving into the question of “why do we have bugs to begin with?”, he posited that testing, verification, and observability are fundamentally not at odds, but rather variations on a theme. At the extreme, each of these seem totally incompatible with one another. Testing can be worlds away from formal verification, because the very act of writing a test is an admission that we can’t model the real execution environment with enough precision. Observability even further distances itself from testing by entirely punting on input data generation and allowing that to be handled by the natural operation of the production system.&lt;/p&gt;

&lt;p&gt;The line becomes totally blurred with small variations though, like going from hand-written to generated test inputs. This is much closer to user-generated inputs in prod because we don’t know exactly which inputs will be produced. Observability monitors are also just a generalization of test assertions, and formal properties generalize them both. In the world of generated inputs, testing, observability, and formal methods all start to blend together. Testing and observability particularly blend together when it comes to tracking down the actual cause of a generated test failure.&lt;/p&gt;

&lt;p&gt;Cue &lt;a href=&quot;https://antithesis.com/product/what_is_antithesis/&quot;&gt;the Antithesis tool&lt;/a&gt;, which is not “just” a deterministic hypervisor, but truly a test execution and analysis platform. Will showed a demo of using it to &lt;a href=&quot;https://github.com/etcd-io/etcd/issues/18667&quot;&gt;find a bug in etcd&lt;/a&gt; by querying over execution histories that the tool stores. He coined this workflow “pre-observability”: the idea that we can take the same root cause analysis techniques from post-deployment observability and apply them to the massive amount of execution traces produced by simulated system actions.&lt;/p&gt;

&lt;p&gt;On top of this, this bug was found much more regularly in the test environment due to fault injection techniques, highlighting one of the tradeoffs of observability vs. testing: sometimes a failure scenario is rare enough in production that it’s inefficient to sit and wait for it to happen. Fault injection speeds up the bug finding process by triggering rare scenarios more frequently. It also highlights the unifying nature of formal methods: fault injection to me has always seemed &lt;a href=&quot;/prophecy-variables/&quot;&gt;a practical manifestation of prophecy variables&lt;/a&gt;, which were invented precisely to deal with situations like nondeterministic failures in distributed systems.&lt;/p&gt;

&lt;p&gt;All in this one talk, we spanned testing, observability, formal methods, and tied it together with a ribbon of determinism. I knew then that this was gonna be a good one.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Now, I’ll describe the overall &lt;em&gt;feel&lt;/em&gt; and themes of the conference, since you’ll be able to watch all the talks once they’re posted (which I recommend that you do).&lt;/p&gt;

&lt;h2 id=&quot;challenge-the-status-quo&quot;&gt;Challenge the Status Quo&lt;/h2&gt;

&lt;p&gt;One big theme I heard throughout the talks was that we should absolutely challenge the status quo. The obvious example of this is Antithesis itself. I, along with everyone else in the programming world, have been complaining about flaky tests for years. But what I have not done is &lt;strong&gt;write a deterministic hypervisor&lt;/strong&gt; to simply avoid the problem at its root.&lt;/p&gt;

&lt;p&gt;Generative testing is also inherently in opposition to the current mainstream state of quality techniques. In his talk about the adoption of the &lt;a href=&quot;https://hypothesis.readthedocs.io/en/latest/&quot;&gt;Hypothesis property-based testing library&lt;/a&gt;, Zac Hetfield-Dodds mentioned that only 5% of Python users use Hypothesis according to their measurements. When most people think of checking for functional correctness, they think of lots and lots of hand-written example-based tests running in CI, with maybe some linting or static type checking layered on top. They typically don’t think in terms of properties and generating inputs.&lt;/p&gt;

&lt;p&gt;Zac remains convicted that testing properties is how we &lt;em&gt;should&lt;/em&gt; be testing (I agree), so rather than just give up he shared his overall view on why people don’t adopt it as a practice. His message was that we should primarily focus on the human aspect of property-based testing, for example by making the tests easier to write and improving their error messages. On the easier writing front, &lt;a href=&quot;https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#ghostwriter&quot;&gt;Hypothesis added a ‘write’ command&lt;/a&gt; that can help bootstrap tests for you. We have a ways to go there, but I appreciated that the maintainer of such a prominent library was taking a step back and really analyzing the situation.&lt;/p&gt;

&lt;p&gt;The most interesting manifestation of this theme though was in Kyle Kingsbury’s talk on the most recent Jepsen analyses. He was describing the analysis done on Datomic, where he encountered a surprising behavior that didn’t really fall into any of the existing terms we have for transaction anomalies. An eerie vibe came over the room when he suggested that we may have to create new terms for situations like these. If aphyr himself, one of the only people I know of who can &lt;a href=&quot;https://jepsen.io/consistency/phenomena&quot;&gt;tell the difference between G1a and G2-item phenomena&lt;/a&gt;, doesn’t have the words to describe a distributed system scenario, what the heck are we supposed to do?&lt;/p&gt;

&lt;p&gt;This may be a more personal revelation, but that moment made me realize: we are the adults in the room now. It’s not enough to just rehash things that Leslie Lamport or Barbara Liskov discovered 30-50 years ago. We need to be the ones doing new research, building new tools, or thinking of how we can create infrastructure that allows us to gain more control over our software. If we’re unhappy with the state of the world, we wield the power to change it. And many people are actively working on this.&lt;/p&gt;

&lt;h2 id=&quot;test-end-to-end&quot;&gt;Test End-to-End&lt;/h2&gt;

&lt;p&gt;Another common theme was end-to-end testing. End-to-end testing can be a dirty word in some circles, but this group of speakers went all-in on it. Mitchell Hashimoto didn’t get to true end-to-end testing until the end of his talk about making hard-to-test code testable, but he gave a great variety of advice on applying the &lt;a href=&quot;https://www.destroyallsoftware.com/screencasts/catalog/functional-core-imperative-shell&quot;&gt;functional-core-imperative-shell&lt;/a&gt; style of design to a codebase. This enables tests of interacting components, in this case stopping only at actual GPU instruction execution. This approach implies that mocking should only be done at strategic system boundaries, which I think is fantastic advice in general.&lt;/p&gt;

&lt;p&gt;But then he went on to talk about full end-to-end testing via &lt;a href=&quot;https://nixos.org/manual/nixos/stable/index.html#sec-nixos-tests&quot;&gt;NixOS VM testing&lt;/a&gt; for the final bits that you just don’t want to abstract away. This was actually the first time I heard about Nix’s VM testing, and this looks like a great tool for anyone plagued by the inconsistency of e2e test infrastructure management. I’m definitely going to give it a further look.&lt;/p&gt;

&lt;p&gt;Stephanie Wang spoke about all of the reliability lessons she learned &lt;a href=&quot;https://motherduck.com/&quot;&gt;while building MotherDuck&lt;/a&gt;, and someone from the audience asked how some of these were verified. She replied that they performed lots of chaos testing using a mock network interface for controllability, of course. When she spoke about minimizing data movement via caching as a win for both reliability and performance, I can’t think of a unit test that would give any kind of confidence about that. And this was the main assertion in Ben Egger’s talk about testing in prod at OpenAI, which is the most extreme form of end to end testing: no matter how well you model your system, prod is the concrete instantiation of it, and you shouldn’t ignore the ways that all of your components interact in the production setting.&lt;/p&gt;

&lt;p&gt;End to end testing is the perfect example of where testing and observability have a lot in common. The further toward prod your test moves, the more you have to worry about collecting information from hard-to-reach infrastructure components, and the more the semantics of these components influence system behavior. This part about hard-to-reproduce semantics is the whole reason Jepsen takes the end to end testing approach (“as God intended it” as Kyle likes to say). Because unit tests are great and all, but will they catch anomalies &lt;a href=&quot;https://concerningquality.com/txn-isolation-testing/&quot;&gt;caused by weak transaction isolation?&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;formal-methods-push-the-limits-of-testing&quot;&gt;Formal Methods Push The Limits of Testing&lt;/h2&gt;

&lt;p&gt;This was my personal favorite theme because it so perfectly puts my intuition into words. This was spoken almost verbatim by Ankush Desai in his talk about formal and semi-formal methods at AWS. His point was that the value of formal techniques aren’t just limited to actual formal verification. The logical approach that formal methods take can be used to come up with testing techniques as well.&lt;/p&gt;

&lt;p&gt;This to me is the sweet spot of formal methods, at least in today’s landscape. Formality gives us the framework and strategy for figuring out what exactly we should be looking for and how we should think about systems, but we can use tests in place of proofs when it comes to the checking part. We sacrifice the completeness of the checking in the name of practicality and efficiency: generative tests can’t &lt;em&gt;prove&lt;/em&gt; a property, but they provide a much higher level of confidence than a few hand-crafted example scenarios. This is an idea that others have shone light on as well: the Cogent sub-project of seL4 wrote a paper about using &lt;a href=&quot;https://trustworthy.systems/publications/papers/Chen_ROSKHK_22.pdf&quot;&gt;property-based tests as an intermediary on their way to proofs&lt;/a&gt; in the verification of a filesystem implementation.&lt;/p&gt;

&lt;p&gt;In Ankush’s talk, he introduced PObserve, a framework for checking a production system against a specification written in &lt;a href=&quot;https://github.com/p-org/P&quot;&gt;P, a language that he created&lt;/a&gt;, and one in use within AWS. Instead of using the spec to prove the implementation correct, it takes logs from the real system and checks that they adhere to the specification. This is similar to model-based tests which many property-based testing libraries support, but it instead takes the observability-inspired approach of checking execution traces extracted from the actual running system. It also reminds me of &lt;a href=&quot;https://docs.tracetest.io/concepts/what-is-trace-based-testing&quot;&gt;Tracetest&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This was another case of the trifecta of testing, observability, and formal methods all working together, this time with a focus on &lt;a href=&quot;/model-based-testing/&quot;&gt;validating implementation behavior against a model&lt;/a&gt;. We sometimes refer to such approaches as “lightweight formal methods,” and this is the area that I see being most likely to be implemented in a practical setting.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;On top of the actual talks, there was a very social vibe to the conference in general, which is a hard thing to fake. I got the sense that there’s very much an appetite for the particular brand of autonomous testing, lightweight formal methods, and observability techniques being presented there. Overall, it was a great experience, and really galvanized and clarified ideas that I’ve been mulling over for a while now. Thank you to Antithesis for shepherding this conversation and getting everyone together under one roof to contribute to it. If there’s a Bug Bash 2026, I will certainly be first in line for a ticket.&lt;/p&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><category term="observability" /><category term="reliability" /><summary type="html">The inaugural Bug Bash conference was really special. I’ve been to many conferences, but this was legitimately the first that I felt “a part of,” because the subject matter greatly overlapped with what I’m interested in and what I write about here. There are various combinations of testing conferences, devops conferences, and formal methods conferences, sure, but this still felt like a new stake in the ground. Possibly because of the undeniable connection to deterministic simulation testing, or possibly because it just consisted of a bunch of people on a similar wavelength at the moment. But I’ve personally never been in a room where almost every single person raised their hand when a speaker asked: “who’s familiar with property-based testing?” So it certainly felt like something interesting was in the air.</summary></entry><entry><title type="html">Branch Coverage Won’t Prove The Collatz Conjecture</title><link href="/collatz-conjecture/" rel="alternate" type="text/html" title="Branch Coverage Won’t Prove The Collatz Conjecture" /><published>2025-01-26T00:00:00+00:00</published><updated>2025-01-26T00:00:00+00:00</updated><id>/collatz-conjecture</id><content type="html" xml:base="/collatz-conjecture/">&lt;p&gt;The Collatz conjecture is the prime example of the limitations of thinking in terms of branch coverage. It can be written as a recursive function in 5 lines of code with only three branches. That’s great, except we have no idea if it’s true or not, and no amount of tests can prove either way.&lt;/p&gt;

&lt;p&gt;Here’s the code for generating the Collatz process:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;function collatz(n: number): boolean {
    if (n === 1) {
        return true;
    }

    return n % 2 === 0 ? collatz(n / 2) : collatz(3 * n + 1);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It’s quite simple, by almost any metric. Just a couple of conditionals, and some plain arithmetic. The conjecture is that this &lt;em&gt;always&lt;/em&gt; returns true: no matter the starting number, all paths should end at 1, says Collatz.&lt;/p&gt;

&lt;p&gt;There’s only a few lines of code. Let’s just test all the branches. To make this a little more explicit, let’s unwind the ternary into an if-else:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;function collatz(n: number): boolean {
    if (n === 1) {
        return true;
    }

    if (n % 2 === 0) {
        return collatz(n / 2)
    } else {
        return collatz(3 * n + 1);
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We only need 2 test cases to hit all 3 branches: n=2, and n=3. Here’s the sequence of &lt;code&gt;n&lt;/code&gt; values that result in each case, just to get a feel for how the state progresses:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;n=2 -&amp;gt; n=1 ==&amp;gt; true
n=3 -&amp;gt; n=10 -&amp;gt; n=5 -&amp;gt; n=16 -&amp;gt; n=8 -&amp;gt; n=4 -&amp;gt; n=2 -&amp;gt; n=1 ==&amp;gt; true
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That was easy. All the branches are covered. There’s just one problem: since it was proposed in the 1930s, the entirety of the math community has been unable to prove this true or not. We don’t know if this is just a pattern up to some gigantic value of n, after which it breaks down, or if it’s the real deal and we can finally watch it grow up into a real theorem. We simply don’t know for sure if it’s &lt;em&gt;always&lt;/em&gt; true, or even within what bounds it is true. The issue is that the state oscillates. If we could show that every iteration of the recursion produced a smaller value, then we’d be sure that we’ll always get down to 1. But when n is odd, we go &lt;em&gt;up&lt;/em&gt;. The progress is inconsistent. It, pretty surprisingly given its apparent simplicity, completely eludes our species.&lt;/p&gt;

&lt;p&gt;Look back at the above test cases and how they create sequences of &lt;code&gt;n&lt;/code&gt; values. Sequences like this are what software behavior boils down to. A program is really two things: its code, along with the set of all behaviors that it produces. Branch coverage is a statement about the code, but it doesn’t touch the full breadth of the runtime behavior of the program. And the runtime behavior is what determines correctness.&lt;/p&gt;

&lt;p&gt;This is why a tiny little function can lead to an unknowable question. There are lots of numbers, so lots of possible sequences of &lt;code&gt;n&lt;/code&gt;, and in this case the code branches keep getting revisited until the program terminates. That is, &lt;em&gt;if&lt;/em&gt; it terminates.&lt;/p&gt;

&lt;p&gt;Branch coverage will get a small glimpse of your code’s behavior, but it isn’t enough to prove the Collatz conjecture.&lt;/p&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><summary type="html">The Collatz conjecture is the prime example of the limitations of thinking in terms of branch coverage. It can be written as a recursive function in 5 lines of code with only three branches. That’s great, except we have no idea if it’s true or not, and no amount of tests can prove either way.</summary></entry><entry><title type="html">Simulating Some Queues</title><link href="/queue-simulations/" rel="alternate" type="text/html" title="Simulating Some Queues" /><published>2025-01-03T00:00:00+00:00</published><updated>2025-01-03T00:00:00+00:00</updated><id>/queue-simulations</id><content type="html" xml:base="/queue-simulations/">&lt;p&gt;System performance boils down to the timing behavior of various interacting queues. Queues are one of those incredibly simple but powerful concepts, but they have some unintuitive or non-obvious behavior when only thinking about them mathematically. Simulating queueing scenarios gives us a better picture about how queues operate in practice.&lt;/p&gt;

&lt;h1 id=&quot;the-unit-queue&quot;&gt;The Unit Queue&lt;/h1&gt;

&lt;p&gt;Let’s introduce the simplest possible queue as a reference point, which we’ll call the unit queue. Requests arrive once per second, and each request takes one second to process. There’s only one processor that services requests. Here are some quick definitions about the operations of this queue:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Arrival Rate&lt;/strong&gt;: the rate that requests come into the queue. Here, it is 1 / second.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Processing Time&lt;/strong&gt;: the time it takes to process a request. Here, it’s 1 second.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Wait Time&lt;/strong&gt;: the amount of time a request waits after arrival and before processing begins. Here, the wait time for all requests is 0.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt;: the total time it takes to process a request after arrival. Here, it’s 1 second.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Active Request&lt;/strong&gt;: a request that’s currently being processed.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Queued Request&lt;/strong&gt;: a request that’s waiting to be processed after arrival.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Queue Length&lt;/strong&gt;: the number of queued requests. Here, it’s always 0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This queue is in an equilibrium state: as soon as a request is done being processed, a new one comes in. And before the next one comes in, the current request has enough time to complete. This means that a request never has to wait to be processed, and it begins processing as soon as it comes in. Because of this, the queue length is always 0 and never grows.&lt;/p&gt;

&lt;h1 id=&quot;discrete-event-simulation&quot;&gt;Discrete Event Simulation&lt;/h1&gt;

&lt;p&gt;This won’t be a deep dive into discrete event simulation, but it helps to know a few things about it to understand the data that we’re generating in our simulations. You can read more &lt;a href=&quot;https://simpy.readthedocs.io/en/latest/&quot;&gt;in the SimPy docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The basic idea is that we emit events for system changes, and the system is assumed to be in the same state between events. Because of this, we can “fast forward” time by only considering the events and not waiting for time to pass. It’s another manifestation of the state machine model of a system, only here we can keep track of the duration of each transition instead of only worrying about the states that changed.&lt;/p&gt;

&lt;p&gt;In the case of a queue, we’ll broadcast one event for each of:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;request arrival&lt;/li&gt;
  &lt;li&gt;processing start&lt;/li&gt;
  &lt;li&gt;processing end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With just these events, we can calculate:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;wait time (processing start - request arrival)&lt;/li&gt;
  &lt;li&gt;processing time (processing end - processing start)&lt;/li&gt;
  &lt;li&gt;latency (processing end - request arrival)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can also record queue lengths whenever a request arrives. We’ll look at some code in a bit, let’s just focus on the behavior that the simulation gives us for the moment. Simulating 5 minutes of the unit queue leads to the following graphs:&lt;/p&gt;

&lt;div style=&quot;display:flex&quot;&gt;
  &lt;img src=&quot;/assets/queue_simulations/unitqueue.svg&quot; style=&quot;margin: auto;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Visually, equilibrium is a bunch of straight lines. The reason the lines can remain straight is because the queue length is always 0, so no wait time ever gets introduced. Let’s see what happens when we break this.&lt;/p&gt;

&lt;p&gt;Queue equilibrium relies on the following inequality always being true:&lt;/p&gt;

\[processing\ time \leq arrival\ rate\]

&lt;p&gt;If the processing time ever exceeds the arrival rate, the queue length will begin to grow, and thus some wait time will be added to the latency of subsequent requests. Let’s simulate the same 1 / second arrival rate, but with a 2 second processing time (note that 300 requests aren’t recorded any more. This is because fewer requests can complete in the fixed time window when queueing is introduced):&lt;/p&gt;

&lt;div style=&quot;display:flex&quot;&gt;
  &lt;img src=&quot;/assets/queue_simulations/unitqueue_slower.svg&quot; style=&quot;margin: auto;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;The processing time remains constant at 2 seconds, but latency, wait time, and queue length all increase. It’s actually worse, they increase &lt;em&gt;indefinitely&lt;/em&gt;. This queue will never catch up, because the processing time exceeds the arrival rate. It is saturated.&lt;/p&gt;

&lt;p&gt;The effect is brutal. After 100 requests arrive in the queue, only 50 have been processed, so there’s a queue of 50 requests. Requests 101 and onward wait for 100 seconds before even beginning processing, and their total latency reflects this.&lt;/p&gt;

&lt;p&gt;The lesson here is: there’s no ideal processing time or arrival rate. Their relationship is what matters, so we need to know both. Even if there’s no change in the processing time of a request, an increase in arrivals will lead to queueing and increased latencies across the board.&lt;/p&gt;

&lt;h1 id=&quot;processing-distributions&quot;&gt;Processing Distributions&lt;/h1&gt;

&lt;p&gt;Here’s where simulations really become useful. We obviously won’t have a system with constant processing times. They’ll depend on any number of factors: customer size, data skew, current system load, etc. Let’s look at what happens when we set the average processing time back to 1, but this time we distribute the times exponentially. As a reminder, the exponential distribution favors smaller values, but there’s a long tail of large ones. It looks like this:&lt;/p&gt;

&lt;div style=&quot;display:flex&quot;&gt;
  &lt;img src=&quot;/assets/queue_simulations/exp_dist.svg&quot; style=&quot;margin: auto;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This shows 100,000 random samplings of an exponential distribution. Thinking of it in terms of the number of requests that fall into the given processing time ranges, ~70,000 requests would be between 0 and 1 seconds, and ~90,000 would be between 0 and 2 seconds (90% of all requests). A relatively small number of requests would take more than 2 seconds, but we get requests all the way up to 12 seconds.&lt;/p&gt;

&lt;p&gt;The average of all of these is still ~1 second. Let’s see what happens when these are the processing times instead of the constant 1 second, keeping the 1 / second arrival rate constant:&lt;/p&gt;

&lt;div style=&quot;display:flex&quot;&gt;
  &lt;img src=&quot;/assets/queue_simulations/queue_simulation_exponential.svg&quot; style=&quot;margin: auto;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;The max processing time looks to be around 5 seconds, and there are only a few requests that high. But the latency increases to over 10 seconds at parts, because there’s a big swell in the queue length at around request 150. I calculated some more metrics for this particular simulation run:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;p99 latency: 4.26s&lt;/li&gt;
  &lt;li&gt;Average wait time: 3.34&lt;/li&gt;
  &lt;li&gt;Average queue length: 2.93&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even with an average processing time of 1 second, and rare long requests, there is still pretty constant queueing here.&lt;/p&gt;

&lt;p&gt;The lesson here being: don’t only look at average request times, because there can be wildly different queueing characteristics for the same average value. The processing time distribution should always be considered.&lt;/p&gt;

&lt;p&gt;In this particular case, where we only have one processor servicing the queue and a processing time distribution with a long tail, we can smooth out the queueing by adding an additional processor:&lt;/p&gt;

&lt;div style=&quot;display:flex&quot;&gt;
  &lt;img src=&quot;/assets/queue_simulations/queue_simulation_exponential_multiple_processors.svg&quot; style=&quot;margin: auto;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This is slightly surprising, because we know that we have requests up to and past 5 seconds. If multiple of those happen at the same time, even with two processors they should clog up the queue and introduce queueing. But, we know that those long requests are rare, so the odds of two of them getting processed at the same time is low. It still does happen, as we see by the queue length increasing at certain points, but the queue recovers quickly. Average wait time for this run was 0.08 seconds, and the average queue length was 0.04. So, any latency is due to the actual request processing time, which is ideal.&lt;/p&gt;

&lt;h1 id=&quot;code&quot;&gt;Code&lt;/h1&gt;

&lt;p&gt;Now for a little code, for those who are interested in running their own simulations (you’ll need to install &lt;code&gt;simpy&lt;/code&gt;, &lt;code&gt;matplotlib&lt;/code&gt;, and &lt;code&gt;numpy&lt;/code&gt; via your favorite Python dependency manager):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import simpy
import itertools
import random
from dataclasses import dataclass
import matplotlib.pyplot as plt
import numpy as np

SIM_DURATION = 300

EXPONENTIAL_DIST = 'exponential'
GAUSSIAN_DIST = 'gaussian'
UNIFORM_DIST = 'uniform'
CONSTANT_DIST = 'constant'

MEAN_PROCESSING_TIME = 1
NUM_PROCESSORS = 1

@dataclass
class Monitor:
    wait_times: list[float]
    queue_lengths: list[int]
    latencies: list[float]
    processing_times: list[float]

class Queue:
    queue: simpy.Resource

    def __init__(self, env):
        self.env = env
        self.queue = simpy.Resource(env,  NUM_PROCESSORS)

def request(env, n, dist, monitor, queue):
    arrival = env.now
    monitor.queue_lengths.append(len(queue.queue.queue))
    with queue.queue.request() as req:
        yield req
        wait = env.now - arrival
        monitor.wait_times.append(wait)

        processing_start = env.now
        execution_time = 1
        mean_processing_time = MEAN_PROCESSING_TIME
        if dist == EXPONENTIAL_DIST:
            execution_time = random.expovariate(1 / mean_processing_time)
        elif dist == GAUSSIAN_DIST:
            execution_time = random.gauss(mean_processing_time, mean_processing_time / 4)
        elif dist == UNIFORM_DIST:
            delta = mean_processing_time * 0.5
            execution_time = random.uniform(mean_processing_time - delta, mean_processing_time + delta)
        elif dist == CONSTANT_DIST:
            execution_time = mean_processing_time

        yield env.timeout(execution_time)
        monitor.processing_times.append(env.now - processing_start)
        monitor.latencies.append(env.now - arrival)

def generate_load(env, latency_dist, monitor, queue):
    req_count = itertools.count()
    while True:
        yield env.timeout(1)
        env.process(request(env, next(req_count), latency_dist, monitor, queue))

def simulate_queue(latency_dist, monitor):
    env = simpy.Environment()
    q = Queue(env)
    env.process(generate_load(env, latency_dist, monitor, q))

    env.run(until=SIM_DURATION)

monitors = {}
for latency_dist in [CONSTANT_DIST, UNIFORM_DIST]:
    monitor = Monitor([], [], [], [])
    simulate_queue(latency_dist, monitor)
    monitors[latency_dist] = monitor

    print(f&quot;Wait times: {monitor.wait_times}&quot;)
    print(f&quot;Queue lengths: {monitor.queue_lengths}&quot;)
    print(f&quot;Latencies: {monitor.latencies}&quot;)

    print()
    print(f&quot;Average wait time: {sum(monitor.wait_times) / len(monitor.wait_times):.2f}&quot;)
    print(f&quot;Average latency: {sum(monitor.latencies) / len(monitor.latencies):.2f}&quot;)
    print(f&quot;p99 Latency: {np.percentile(np.array(monitor.latencies), 99):.2f}&quot;)
    print(f&quot;Average queue length: {sum(monitor.queue_lengths) / len(monitor.queue_lengths):.2f}&quot;)
    print(f&quot;Average processing time: {sum(monitor.processing_times) / len(monitor.processing_times):.2f}&quot;)

    print()

plot_monitors(monitors)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is set up to run multiple different processing time distributions, which you can mix and match to compare. The main thing to know about the pattern that &lt;code&gt;SimPy&lt;/code&gt; employs is that you &lt;code&gt;yield&lt;/code&gt; events, which basically means “wait for this event to occur.” In &lt;code&gt;generate_load&lt;/code&gt; we first &lt;code&gt;yield env.timeout(1)&lt;/code&gt;, which is how we say to send requests every 1 second. To be pedantic, this just sends it every one “time unit,” and we are just interpreting it as the unit being seconds.&lt;/p&gt;

&lt;p&gt;After that timeout completes, we run the &lt;code&gt;request&lt;/code&gt; function which interacts with the queue. &lt;code&gt;SimPy&lt;/code&gt; has the concept of a &lt;code&gt;Resource&lt;/code&gt; which is a thing that can only be accessed a finite number of times. A &lt;code&gt;Resource&lt;/code&gt; with a capacity set to 1 is equivalent to a queue with 1 processor. We wait for the queue to be available with:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;with queue.queue.request() as req:
    yield req
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then we pick a distribution and sample a value out of it, and wait for another timeout event  with &lt;code&gt;yield env.timeout(execution_time)&lt;/code&gt; which simulates the request processing time. We pass a &lt;code&gt;Monitor&lt;/code&gt; object throughout which keeps track of the various raw pieces of data so we can plot them later.&lt;/p&gt;

&lt;p&gt;Here’s the definition of &lt;code&gt;plot_monitors&lt;/code&gt; for completeness:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def plot_monitors(monitors):
    for dist, monitor in monitors.items():
        plot_monitor(dist, monitor,)

def plot_monitor(dist, monitor):
    min_len = min(len(monitor.wait_times), len(monitor.latencies), len(monitor.queue_lengths), len(monitor.processing_times))

    x_values = range(min_len)
    y_values_list = [
        monitor.processing_times[:min_len],
        monitor.latencies[:min_len],
        monitor.wait_times[:min_len],
        monitor.queue_lengths[:min_len],
    ]
    y_labels = [&quot;Processing Time (s)&quot;, &quot;Latency (s)&quot;, &quot;Wait Time (s)&quot;, &quot;Queue Length&quot;]
    titles = [&quot;Processing Time&quot;, &quot;Latency&quot;, &quot;Wait Time&quot;, &quot;Queue Length&quot;]

    num_subplots = len(y_values_list)
    fig, axes = plt.subplots(num_subplots, 1, figsize=(8, 6))

    for i, ax in enumerate(axes):
        ax.plot(x_values, y_values_list[i], linestyle='-', label=y_labels[i])
        ax.set_title(titles[i], fontweight='bold')
        ax.set_xlabel(&quot;Request Number&quot;)
        ax.set_ylabel(y_labels[i])
        ax.grid(True, linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.show()
&lt;/code&gt;&lt;/pre&gt;</content><author><name>Alex Weisberger</name></author><category term="performance" /><summary type="html">System performance boils down to the timing behavior of various interacting queues. Queues are one of those incredibly simple but powerful concepts, but they have some unintuitive or non-obvious behavior when only thinking about them mathematically. Simulating queueing scenarios gives us a better picture about how queues operate in practice.</summary></entry><entry><title type="html">Controlling Nondeterminism in Model-Based Tests with Prophecy Variables</title><link href="/prophecy-variables/" rel="alternate" type="text/html" title="Controlling Nondeterminism in Model-Based Tests with Prophecy Variables" /><published>2024-12-23T00:00:00+00:00</published><updated>2024-12-23T00:00:00+00:00</updated><id>/prophecy-variables</id><content type="html" xml:base="/prophecy-variables/">&lt;p&gt;We have to constantly wrestle with nondeterminism in tests. Model-based tests present unique challenges in dealing with it, since the model must support the implementation’s nondeterministic behavior without leading to flaky failures. In traditional example-based tests, nondeterminism is often controlled by adding stubs, but it’s not immediately clear how to apply this in a model-based context where tests are generated. We’ll look to the theory of refinement mappings for a solution.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;In model-based testing, we construct a model of the system and use it as an executable specification in tests. One of the main benefits of doing this is that we end up with a highly-simplified description of the system’s behavior, bereft of low-level details like network protocols, serialization, concurrency, asynchronicity, disk drives, operating system processes, etc. The implementation, however, has all of these things, and is beholden to their semantics.&lt;/p&gt;

&lt;p&gt;This generally means that model states are not equivalent to implementation states, and are thus not directly comparable. This is fine, &lt;a href=&quot;/model-based-testing-theory/&quot;&gt;because we can define a refinement mapping between them&lt;/a&gt; and carry on. Nondeterminism complicates this mapping though.&lt;/p&gt;

&lt;p&gt;Let’s look at a concrete example. Here’s a model of an authentication system, that allows for the creation of new users:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;type User = {
  id: number;
  username: string;
  password: string;
}

type CreateUser = {
  username: string;
  password: string;
}

type AuthError = 'username_exists';

class Auth {
  users: User[] = [];
  error: AuthError | null = null;

  createUser(toCreate: CreateUser) {
    if (this.users.some(u =&amp;gt; u.username === toCreate.username)) {
      this.error = 'username_exists';
      return;
    }

    const user: User = {
      username: toCreate.username,
      password: toCreate.password
    }

    this.users.push(user);
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this model, we have a set of Users that we can add to, or in doing so there might be an error if the username is already taken. This error is a domain error, related to the logic of authentication, so it’s essential to include in the model.&lt;/p&gt;

&lt;p&gt;Not all errors are alike. In a real implementation, we’re going to have timeouts set on the web request as well as database statements. Timeouts are unrelated to the domain of authentication, and they also happen to be non-deterministic: for the same inputs, a timeout may or may not occur based on system load. It’s not obvious what to do about this, but if we do nothing, two problems arise:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A timeout in the test could lead to a flaky test failure.&lt;/li&gt;
  &lt;li&gt;We don’t sufficiently test the timeout-handling codepath.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These need to be addressed.&lt;/p&gt;

&lt;h1 id=&quot;handling-implementation-level-errors-in-a-model&quot;&gt;Handling Implementation-Level Errors in a Model&lt;/h1&gt;

&lt;p&gt;What does a timeout in the implementation mean in terms of the model? There’s two main interpretations:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It corresponds to a no-op in the model (aka a stutter step).&lt;/li&gt;
  &lt;li&gt;It maps to some separate error value in the model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Either one isn’t more correct than the other, but be aware that allowing for stutter steps leads to potential false positive passing tests. If a timeout occurs in the &lt;code&gt;createUser&lt;/code&gt; operation, no new users will be added to the set of all users, but the test will still pass because we chose to allow for equal initial and final states. Stutter steps are necessary in theory, but we should be careful when allowing for them in tests otherwise our test suite will pass on a run where 100% of calls to &lt;code&gt;createUser&lt;/code&gt; time out.&lt;/p&gt;

&lt;p&gt;There are ways of mitigating the risk of vacuously passing tests. For example, we could make a statistical correctness statement: the test only passes if no more than 10% of &lt;code&gt;createUser&lt;/code&gt; operations time out. This is more of a statement about &lt;em&gt;reliability&lt;/em&gt; though, and not a statement about functional behavior. I think it’s best to keep functional behavior tests in the domain of logical time, and to instead use observability tools for collecting reliability metrics.&lt;/p&gt;

&lt;p&gt;For functional testing, there’s a better way that avoids statistical correctness statements. It just involves predicting the future.&lt;/p&gt;

&lt;h1 id=&quot;tests-oracles-and-prophecy&quot;&gt;Tests, Oracles, and Prophecy&lt;/h1&gt;

&lt;p&gt;A brief philosophical aside. Tests are almost entirely about seeing into the future. By simply writing down the expected outputs of an operation, that means that we know what they should be ahead of time. We are the so called test oracle. In model-based testing, we instead delegate this prediction to the model: the model is the oracle.&lt;/p&gt;

&lt;p&gt;There’s a very well-known solution to the problem of predicting the future of a nondeterministic operation in a test: test doubles. Stubs in particular are commonly used to control things like timeouts. Say we have a client-server implementation of our &lt;code&gt;Auth&lt;/code&gt; module. We’d likely make client-side network requests through an interface and use stubs in our tests to control the code path taken:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;type User { ... }
type AuthSystemError = 'timeout';
type AuthError = 'username_exists';
type AuthServerResponse = User | AuthError | AuthSystemError;

interface AuthServer {
  createUser(toCreate: CreateUser): AuthServerResponse;
}

class AuthClient {
  users: User[] = [];
  server: AuthServer;
  error: string | null = null;

  constructor(server: AuthServer) {
    this.server = server;
  }

  createUser(toCreate: CreateUser) {
    const resp = this.server.createUser(toCreate);
    if (resp === 'timeout') {
      this.error = 'There was a problem creating the user. Please try again or contact support.';
    } else if (resp === 'username_exists') {
      this.error = 'That username is already taken. Please choose another.';
    } else {
      this.users.push(resp);
    }
  }
}

// test file:
class AuthServerTimeout implements AuthServer {
  createUser(toCreate: CreateUser): AuthServerResponse {
    return 'timeout';
  }
}

describe('Timeout behavior', () =&amp;gt; {
  it('displays a timeout message when the request times out', () =&amp;gt; {
    const auth = new AuthClient(new AuthServerTimeout());
    auth.createUser({ username: 'user', password: 'pass' });

    expect(auth.error).toEqual('There was a problem creating the user. Please try again or contact support.');
  });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This pattern is ingrained in our muscle memory, but it’s actually quite interesting from the perspective of oracles and predicting the future. The simple &lt;code&gt;AuthClient&lt;/code&gt; has code paths that are not ergonomic to trigger in a test (we like to avoid the use of &lt;code&gt;sleep&lt;/code&gt; anywhere in tests, and otherwise the timeout will be dependent on nondeterministic system load). So instead of triggering the scenario that leads to a timeout, we simply setup the code in a way that guarantees the timeout code path is taken. In effect, we tell the code under test what its own destiny is, and use that to be able to create a dependable, deterministic assertion in the test.&lt;/p&gt;

&lt;p&gt;From the test-writers point of view, this is a simple technique, but from the code’s point of view, it’s as if we’re showing it a prophecy of its life ahead of time. We are an oracle indeed!&lt;/p&gt;

&lt;p&gt;In model-based tests, we don’t create individual test cases, so we need a way to generate different stub configurations if we want to test a timeout code path. Once we put it that way, the answer is simple: just generate a variable that we can use to dynamically configure stubs. Because this variable predicts future execution, we call it a &lt;em&gt;prophecy variable&lt;/em&gt;.  For this, we can name it &lt;code&gt;isTimeout&lt;/code&gt;, and go from there. First we extend the model to be aware of this variable:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;class Auth {
  // ...
  error: AuthError | AuthSystemError | null = null;

  createUser(toCreate: CreateUser, isTimeout: boolean) {
    if (isTimeout) {
      this.error = 'timeout';
      return;
    }

    // ...
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This avoids the stutter-step issue from before. We elevate the system-level error to the model level, and we make it so that  timeout error only occurs when &lt;code&gt;isTimeout&lt;/code&gt; tells it to. This is how we can be sure that unintended timeouts aren’t happening in the tests. Then, the implementation:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;class AuthServerImpl {
    users: User[] = [];

    createUser(toCreate: CreateUser): AuthServerResponse {
      // real networking / server impl
    }
}

class Client {
    users: User[] = [];
    error: AuthError | null = null;
    implError: AuthSystemError | null = null;
    
    server: AuthServer;

    constructor(server: AuthServer) {
      this.server = server;
    }

    createUser(toCreate: CreateUser) {
      const result = this.server.createUser(toCreate);
      if (result === 'timeout') {
        this.implError = result;
        return;
      }

      if (result === 'username_exists') {
        this.error = result;
        return;
      }

      this.users.push(result);
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And here’s what the model-based test would look like:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;const genToCreate = () =&amp;gt; fc.record({
  username: fc.string(),
  password: fc.string()
});

const genUser = () =&amp;gt; fc.record({
  id: fc.integer(),
  username: fc.string(),
  password: fc.string()
});

const genUsers = () =&amp;gt; fc.array(genUser());

const genProphecy = () =&amp;gt; fc.boolean();

const externalAuthState = (auth: Auth): AuthState =&amp;gt; {
  return {
    users: auth.users,
    error: auth.error
  }
}

const externalClientState = (client: Client): ClientState =&amp;gt; {
  return {
    users: client.users,
    error: client.error,
    implError: client.implError,
  }
}

const refinementMapping = (isTimeout: boolean, implState: ClientState): AuthState =&amp;gt; {
  return {
    users: implState.users,
    error: isTimeout? implState.implError : implState.error,
  }
}

describe('Prophecy-aware Auth test', () =&amp;gt; {
  it('should correspond to the model', () =&amp;gt; {
    fc.assert(
      fc.property(genUsers(), genToCreate(), genProphecy(), (users, toCreate, isTimeout) =&amp;gt; {
        const auth = new Auth();
        auth.users = [...users];

        let server: AuthServer;
        if (isTimeout) {
          server = new AuthServerTimeout();
        } else {
          const realServer = new AuthServerImpl();
          server = realServer;
        }
        const client = new Client(server);
        client.users = [...users];

        auth.createUser(toCreate, isTimeout);
        client.createUser(toCreate);

        const authState = externalAuthState(auth);
        const mappedState = refinementMapping(isTimeout, externalClientState(client));
        expect(mappedState).toEqual(authState);
      }),
      { endOnFailure: true, numRuns: 10000}
    );
  });
  });
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We use &lt;code&gt;isTimeout&lt;/code&gt; to choose which &lt;code&gt;AuthServer&lt;/code&gt; implementation to use, and we compare the now-prophecy-aware implementation to the model. To compare the different state values, we do a little bit of bookkeeping, first by projecting each object to an “external” state which omits any implementation details. We also create a &lt;code&gt;refinementMapping&lt;/code&gt; function which maps implementation states to model states. The refinement mapping is also aware of the &lt;code&gt;isTimeout&lt;/code&gt; variable, and uses that to make sure we only elevate the implementation error to the model when it is prophesied.&lt;/p&gt;

&lt;p&gt;Now, we have a pattern for building property-based tests that can account for nondeterministic errors.&lt;/p&gt;

&lt;h1 id=&quot;a-brief-note-on-the-theory-of-prophecy-variables&quot;&gt;A Brief Note on The Theory of Prophecy Variables&lt;/h1&gt;

&lt;p&gt;Prophecy variables are much more powerful than simple stubs in example-based tests, but I can’t but help notice the practical similarity between them. They were introduced in the paper &lt;a href=&quot;https://pdf.sciencedirectassets.com/271538/1-s2.0-S0304397500X02873/1-s2.0-030439759190224P/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBQaCXVzLWVhc3QtMSJHMEUCIQD%2BBLlExS4dmhBZPzTgmVAoPQjqARHVn12RhMRMoUfqCAIgJbSyIbWbqun%2Fzg2z7obtZ59EZg6ezZdKfTOTQuBAXmYquwUI3P%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwwNTkwMDM1NDY4NjUiDMuLkQag4yi1ilbtBiqPBZrCA7WT87lx1Y6Mf4DkO1ch8rhes0DvkLxWJqX8R3j43Pyl2Olt2FBVvsmJfJPV7CGXXVqB307Qn7AXpskbr1konRFi6fHcSXd247S1qFNONVWffdR5Z2Uz8AVDkDYos0nyNVKsL%2B7Ev3eqF3VWdaYS8lpHfNwJkSZdOXmEIQ2rvajiwIFu2JCoo0JdOqutSR6Sf94ILpjy8mkmLBbJdybaObPn%2FHcBSl549PqEVVuxDvJyEr0vwei%2Bhl0ngiQT2Hbmzq5V3QWw7IxMzvRTDEhyLxm3KecyoRaNi4%2Bz4ujFGVjH0A0g88B56a5OrfjR7SkYMkAZs4Gb7dtPt6tzrTXNPZuA0R1YoxBFzsHMIXM%2FZgBpAG9yJ8MWeP%2F61lG3S08o1snsval%2BPoDbAgkga22RPDlPeyQnHRSL9%2F0rXh5oAbXIOX9M%2FwRlSqyKkcjyUwTaexa7R2qzeS76uECr9EVQZ0lPGFhawt363Bn0ozk8OF6%2BBxRRvE0qBYXAgq0sEGb3csj8p6FlNKZ7UTHLClN9QddmqrXg1yURHrIveoBiQ2WrS3lKfEMAyre5cFi7yaYHXZEidyVY7hZMBrCPCFUef1He9uIh1aC0rfBAL7LCBLYo0Rz2LRRT650IaK%2FDAzVFMbku0bJeLhrqKmNXntyqYSYEVMC6WkMJ4AIda61eFQgEm8x239eoWIoHiGE2N93QmHY4Z1LiaohdnsiMsXiwLmdz45qQMOvHy%2BzoU3AbMnI1KY4k8UjRL%2FuC303nk8eUs7pjrLzRJY8pSAKK9v5311tg%2F1jYnCeoax6daTsazgl8cP9NtLMTf6LCCkZqoQdEYzEkwVzyQ0a9HGqWz6i3017fw%2FXbR8uqOz8x0QQw0OimuwY6sQEBLik2N00q8xavkw2mRr%2BGTNafaoIuTpHt%2FBu2HLGPb2NwLDWDTUWDjuj9A%2FWN1tFG7HL%2FlU4sD4sPcwRJwZAUkm4aNqyBNJddQiZzPXo9V640IbswUVPcDk8zN4ILDRuSb7TKqKkyMs0KwMvPaZY0r6NOe6E8sOD3ZLsT7T5t9XRWuS8uEbpt5uDOj0KVGO9lF5vwIMcpL1uYhuz8RNzYhVw%2FiVNb%2FAcZvyqNcXlhfc8%3D&amp;amp;X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;amp;X-Amz-Date=20241223T194402Z&amp;amp;X-Amz-SignedHeaders=host&amp;amp;X-Amz-Expires=300&amp;amp;X-Amz-Credential=ASIAQ3PHCVTYSDYZAUMU%2F20241223%2Fus-east-1%2Fs3%2Faws4_request&amp;amp;X-Amz-Signature=9b7b0f8cf3372c1c2573f3a00b58371fec5ffe819c2785514f70667158d3db09&amp;amp;hash=970922d7eed44e51e9c8084829e6c19241d4551fa00175b640cf16a58648c71b&amp;amp;host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&amp;amp;pii=030439759190224P&amp;amp;tid=spdf-8f7ba5e4-2f0f-4b0a-8139-fab3ae67cd33&amp;amp;sid=86cccb5f88e9a045366b20d6625a7c20af4cgxrqa&amp;amp;type=client&amp;amp;tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&amp;amp;ua=0f155d0955060b56575e00&amp;amp;rr=8f6ad811eb928ca5&amp;amp;cc=us&quot;&gt;The Existence of Refinement Mappings&lt;/a&gt; to solve the theoretical problem of proving refinement between specifications with nondeterminism. The paper showed that there are programs where proving the refinement of their specification is impossible due to nondeterminism. Not only do prophecy variables solve that problem, they also lead to a &lt;em&gt;complete&lt;/em&gt; solution to the problem. The main result in the paper is that we can find a suitable refinement mapping for &lt;em&gt;any&lt;/em&gt; program to any specification, as long as we are able to add history and prophecy variables to the refinement mapping in a way that doesn’t alter the observable behavior of either the program or the spec.&lt;/p&gt;

&lt;p&gt;That’s true of test doubles: they don’t alter the code under test, they just allow for specifying values ahead of time, which again is the key to dealing with nondeterminism in tests. Our usage of prophecy variables here differs slightly from the theoretical versions (we pass ours into the refinement mapping function rather than keeping the function as a pure mapping from implementation to model state, and we also use interfaces and stubs to modify the behavior rather than only limiting ourselves to state variables). Still, this departure is only surface-level, since we could map this all to the TLA+-style state framework if we wanted to. Using the idioms of the particular programming language we’re in makes for a more practical experience.&lt;/p&gt;

&lt;p&gt;For more info, there’s a deeper dive into the theory of refinement mapping in &lt;a href=&quot;/model-based-testing-theory/&quot;&gt;Efficient and Flexible Model-Based Testing&lt;/a&gt;. There’s a whole paper dedicated to prophecy variables in &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/simple.pdf&quot;&gt;Prophecy Made Simple&lt;/a&gt;. There’s also the fantastic blog post &lt;a href=&quot;https://surfingcomplexity.blog/2024/09/22/linearizability-refinement-prophecy/&quot;&gt;Linearizability! Refinement! Prophecy!&lt;/a&gt; that goes into a really detailed example of using prophecy variables to prove properties of nondeterministic queues.&lt;/p&gt;

&lt;h1 id=&quot;prophecy-aware-dependencies-and-modal-determinism&quot;&gt;Prophecy-Aware Dependencies and Modal Determinism&lt;/h1&gt;

&lt;p&gt;This will be the most ambitious part of the post. It begins with a statement: we should design our dependencies to be prophecy-aware.&lt;/p&gt;

&lt;p&gt;Dependencies are a double-edged sword, especially infrastructure dependencies like a database. On the one hand, we get an incredible amount of power and reliability that would be impossible to implement on our own. On the other, we lose control, and are beholden to extremely fine-grained semantics that, among other things, make holistic testing difficult. I greatly believe in integration testing, especially against something like a database, because of such semantics that our applications come to depend on. I wrote about this in &lt;a href=&quot;/txn-isolation-testing/&quot;&gt;Does Your Test Suite Account For Weak Transaction Isolation?&lt;/a&gt;. Things like transaction isolation ultimately affect the correctness of our applications, so their absence from most application test suites is an unideal blind spot.&lt;/p&gt;

&lt;p&gt;This absence is totally understandable though: testing for it is a pain, precisely due to the inability to control nondeterminism. To systems and infrastructure developers: please account for the testing of nondeterministic functionality in the design of your tools. All nondeterministic choices should be able to be controllable by parameters. This allows nondeterminism to be used where necessary (and it often is necessary and not just a mistake, e.g. for performance or concurrency), while also being able to be controlled in tests. There’s definitely an upswing in projects thinking about this up front, notable examples being FoundationDB and TigerBeetle. I don’t want to make light of it, because it can radically alter the design of a system. But, having controllable determinism will always be a good thing in my book.&lt;/p&gt;

&lt;p&gt;However, in the meantime, most of our dependencies are not prophecy-aware, so we do need an approach for handling them as-is. For this, I think our best bet is to create wrapper fakes which model a given dependency. These models will need to be nondeterministic, since the implementation is, however we can design them to also be prophecy-aware and thus controllable in tests as well. Because such models have this dual behavior, I think of this as “modal determinism.”&lt;/p&gt;

&lt;p&gt;Let’s continue with the example of transaction isolation in Postgres. And let’s say that we first discovered weak transaction isolation and the Read Committed isolation level. We start to hone in on this being an issue, and we first write this test (against a real PG DB):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create test schema
create table txn_iso (ival int);
insert into txn_iso (ival) values(1);
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;import * as fc from 'fast-check';
import { Database, DBModelNondet } from './database';
import { PoolClient } from 'pg';

type Tuple = {column: string; value: any}[];

class Database {
    pool: pg.Pool;

    constructor() {
      this.pool = new Pool(/* connection info */);
    }

    async selectInClient(client: pg.PoolClient): Promise&amp;lt;Tuple[]&amp;gt; {
      const res = await client.query(`SELECT * FROM txn_iso`);
      return res.rows.map((row) =&amp;gt; {
        return [{ column: 'ival', value: row['ival'] }];
      });
    }

    async update(val: number)  {
      const client = await this.pool.connect();

      await this.updateInClient(client, val);

      client.release();
    }

    async updateInClient(client: pg.PoolClient, val: number) {
      return client.query(`UPDATE txn_iso SET ival = $1`, [val]);
    }
}

const genUpdateVal = () =&amp;gt; fc.integer({ min: 0, max: 10 });

const genTxnOrder = () =&amp;gt; fc.uniqueArray(fc.integer({ min: 0, max: 2 }), {minLength: 3, maxLength: 3});

const initialVal = 1;

describe('Database nondeterministic transaction reads', () =&amp;gt; {
  it('should return consistent reads', async () =&amp;gt; {
    let db: Database;
    let c1: PoolClient;
    let c2: PoolClient;
    await fc.assert(
      fc.asyncProperty(
        genUpdateVal(),
        genTxnOrder(),
        async (val, txnOrder) =&amp;gt; {
          db  = new Database();

          c1 = await db.pool.connect();
          c2 = await db.pool.connect(); 

          await c1.query('BEGIN');
          await c2.query('BEGIN');
          const prevRead = await db.selectInClient(c2);
          await db.updateInClient(c1, val);

          const operations = [c1.query('COMMIT'), c2.query('COMMIT'), db.selectInClient(c2)];
          let orderedOperations = [];
          let readIdx = txnOrder[2];
          for (let i = 0; i &amp;lt; txnOrder.length; i++) {
            orderedOperations[txnOrder[i]] = operations[i];
          }

          const results = await Promise.allSettled(orderedOperations);
          const read = results[readIdx];
          
          if (read.status === 'fulfilled') {
            expect(read.value).toEqual(prevRead);
          } else {
            fail('Read failed');
          }
      }).afterEach(async () =&amp;gt; {
        await db.update(initialVal);

        c1.release();
        c2.release();

        await db.pool.end();
      }),
      { endOnFailure: true, numRuns: 100}
    )
  });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This test creates two DB connections, one of which is updating a value in the &lt;code&gt;txn_iso&lt;/code&gt; table, and another which reads it multiple times. We expect that the multiple reads return the same value, but they don’t. We also randomize the order of the commits of the transactions to exacerbate the issue, but even without that the test will fail nondeterminstically.&lt;/p&gt;

&lt;p&gt;This is complex and surprising behavior, and we want to build a model of it so that we can deterministically control it in our application tests to get more realistic coverage. The key is recognizing that the model has to support this nondeterminism by returning &lt;em&gt;multiple&lt;/em&gt; possible values for select statements instead of just a single one. We can then create a model-based test that allows for any of the possible values to be returned in the implementation. This draws inspiration from the &lt;a href=&quot;https://trustworthy.systems/publications/nicta_full_text/3087.pdf&quot;&gt;nondeterministic seL4 specification&lt;/a&gt; which defines nondeterminism as transitioning between multiple allowable states.&lt;/p&gt;

&lt;p&gt;We create the following model:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;type Tuple = {column: string; value: any}[];
type Relation = { name: string, data: Tuple[] };
type Transaction = {id: number, isDirty: boolean, prev: Relation[], next: Relation[]};

type DBState = {
  relations: Relation[];
  transactions: Transaction[];
};

class DBModelNondet {
  state: DBState[] = [];

  select(txnId: number, relation: string): Tuple[][] {
    return this.state.map((s) =&amp;gt; {
      const dirtyTxn = s.transactions.find((txn) =&amp;gt; txn.id === txnId &amp;amp;&amp;amp; txn.isDirty);
      if (dirtyTxn) {
        return dirtyTxn.next.find((rel) =&amp;gt; rel.name === relation)?.data ?? [];
      }

      return s.relations.find((rel) =&amp;gt; rel.name === relation)?.data ?? []
    });
  }    
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We model the database (&lt;code&gt;DBState&lt;/code&gt;) as a list of &lt;code&gt;Relations&lt;/code&gt;, where each &lt;code&gt;Relation&lt;/code&gt; is itself a list of &lt;code&gt;Tuples&lt;/code&gt;. We also model transactions as having an id, a previous list of relations, a next list of relations, as well as an &lt;code&gt;isDirty&lt;/code&gt; flag which signals whether or not the transaction has written any data at this point in time. The &lt;code&gt;prev&lt;/code&gt; list of relations tracks the snapshot of the DB state when the transaction was started, and &lt;code&gt;next&lt;/code&gt; tracks the current state including any transaction-local modifications that haven’t been committed yet.&lt;/p&gt;

&lt;p&gt;We then store an &lt;em&gt;array&lt;/em&gt; of these &lt;code&gt;DBStates&lt;/code&gt;, not just a single one. Because the database hides a nondeterministic choice from us (the order of operations of when concurrent connections are scheduled), we have to support multiple initial starting states in the model. This allows us to handle both cases of the race condition here: where connection &lt;code&gt;c1&lt;/code&gt; is committed before and after the second read in &lt;code&gt;c2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then, we write a simplified &lt;code&gt;select&lt;/code&gt; model that executes within a specified transaction and returns all rows of a particular &lt;code&gt;relation&lt;/code&gt;. For each current &lt;code&gt;state&lt;/code&gt;, the select either returns tuples that have been modified in an in-progress transaction, or falls back to the committed state if the transaction hasn’t modified anything. Because there can be multiple &lt;code&gt;states&lt;/code&gt;, &lt;code&gt;select&lt;/code&gt; is also nondeterministic and returns a list of &lt;code&gt;Tuple&lt;/code&gt; lists.&lt;/p&gt;

&lt;p&gt;This surprisingly simple model allows us to accurately model non-repeatable reads. We can use it to ensure it supports the nondeterminism caught in the previous test:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;describe('Database reads nondet model', () =&amp;gt; {
  it('should return any of a set of allowable reads', async () =&amp;gt; {
    let db: Database;
    let c1: PoolClient;
    let c2: PoolClient;

    await fc.assert(
      fc.asyncProperty(
        genUpdateVal(),
        genTxnOrder(),
        async (val, txnOrder) =&amp;gt; {
          db  = new Database();
          const model = new DBModelNondet();

          c1 = await db.pool.connect();
          c2 = await db.pool.connect(); 

          await c1.query('BEGIN');
          await c2.query('BEGIN');
          await db.updateInClient(c1, val);

          model.state = [
            // State 1: write transaction has not been committed yet.
            {
              relations: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }],
              transactions: [
                { 
                  id: 1,
                  isDirty: false,
                  prev: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }],
                  next: [{ name: 'txn_iso', data: [[{ column: 'ival', value: val }]] }] 
                },
                {
                  id: 2,
                  isDirty: false,
                  prev: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }],
                  next: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }] 
                },
              ]
            },

            // State 2: write transaction has been committed
            {
              relations: [{ name: 'txn_iso', data: [[{ column: 'ival', value: val }]] }],
              transactions: [
                {
                  id: 2,
                  isDirty: false,
                  prev: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }],
                  next: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }]
                },
              ]
            }
          ];

          const operations = [c1.query('COMMIT'), c2.query('COMMIT'), db.selectInClient(c2)];
          let orderedOperations = [];
          let readIdx = txnOrder[2];
          for (let i = 0; i &amp;lt; txnOrder.length; i++) {
            orderedOperations[txnOrder[i]] = operations[i];
          }

          const results = await Promise.allSettled(orderedOperations);
          const modelResults = model.select(2, 'txn_iso');
          const read = results[readIdx];

          if (read.status === 'fulfilled') {
            // Check that DB state matches ANY model state
            expect(modelResults).toContainEqual(read.value);
          } else {
            fail('Read failed');
          }
      }).afterEach(async () =&amp;gt; {
        await db.update(initialVal);

        c1.release();
        c2.release();

        await db.pool.end();
      }),
      { endOnFailure: true, numRuns: 100}
    )
  });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The race condition is explicitly modeled here in how we initialize &lt;code&gt;model.states.&lt;/code&gt; Zooming in, the second state shows the state of the world after the write transaction has been committed: This manifests as the new written value (&lt;code&gt;val&lt;/code&gt;) appearing in the committed &lt;code&gt;relations&lt;/code&gt; state, and there only being one open transaction which hasn’t modified any data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  relations: [{ name: 'txn_iso', data: [[{ column: 'ival', value: val }]] }],
  transactions: [
    {
      id: 2,
      isDirty: false,
      prev: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }],
      next: [{ name: 'txn_iso', data: [[{ column: 'ival', value: initialVal }]] }]
    },
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The other state has both transactions open, and the new value has not yet been written. Running this test passes against the test PG instance. We’ve accurately modeled the nondeterminism.&lt;/p&gt;

&lt;p&gt;This is great, but doesn’t help us in our application tests. To do that, we need to pick which value is correct. Because we know that &lt;code&gt;select&lt;/code&gt; returns one result set for each nondeterministic initial state its configured with, we can accept a prophecy variable that picks a single one:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class DBModelProphecy {
    modelNondet: DBModelNondet = new DBModelNondet();

    select(txnId: number, relation: string, initialStateProphecy: number): Tuple[] {
        return this.modelNondet.select(txnId, relation)[initialStateProphecy];
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This allows a test to use the nondeterministic model in deterministic “mode,” which will make sure that the application either handles both cases correctly, or leads to an implementation change.&lt;/p&gt;

&lt;h1 id=&quot;in-closing&quot;&gt;In Closing&lt;/h1&gt;

&lt;p&gt;Nondeterminism has been a major thorn in my side when writing model-based tests for real applications. I think prophecy variables as presented here provide a clear pattern for dealing with it. There’s a lot more to build out to have a production-grade model of a database like Postgres, but it’s encouraging to see that the idea does work in principle. It’s also really nice that the same technique can be applied to testing timeouts all the way to testing transaction isolation levels.&lt;/p&gt;

&lt;p&gt;This all started from talking about the difficulty of property-based testing nondeterministic dependencies on &lt;a href=&quot;https://lobste.rs&quot;&gt;lobste.rs&lt;/a&gt; with Stevan, the author of &lt;a href=&quot;https://stevana.github.io/the_sad_state_of_property-based_testing_libraries.html&quot;&gt;The sad state of propery-based testing libraries&lt;/a&gt;. I appreciate their views on the topic, you should read that post as well.&lt;/p&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><category term="refinement" /><summary type="html">We have to constantly wrestle with nondeterminism in tests. Model-based tests present unique challenges in dealing with it, since the model must support the implementation’s nondeterministic behavior without leading to flaky failures. In traditional example-based tests, nondeterminism is often controlled by adding stubs, but it’s not immediately clear how to apply this in a model-based context where tests are generated. We’ll look to the theory of refinement mappings for a solution.</summary></entry><entry><title type="html">Does Your Test Suite Account For Weak Transaction Isolation?</title><link href="/txn-isolation-testing/" rel="alternate" type="text/html" title="Does Your Test Suite Account For Weak Transaction Isolation?" /><published>2023-12-31T00:00:00+00:00</published><updated>2023-12-31T00:00:00+00:00</updated><id>/txn-isolation-testing</id><content type="html" xml:base="/txn-isolation-testing/">&lt;p&gt;Transaction isolation is the kind of thing that you learn about and it fills you with fear. Specifically, there are &lt;em&gt;weak&lt;/em&gt; transaction isolation levels which allow some fairly unexpected behavior. Tools like Jepsen are used to test the general isolation guarantees of databases, but it’s pretty uncommon to check the application layer for issues related to isolation anomalies. These anomalies can impact actual domain logic, so it’s important to understand them as well as how we can test them.&lt;/p&gt;

&lt;h1 id=&quot;what-is-weak-transaction-isolation&quot;&gt;What is Weak Transaction Isolation?&lt;/h1&gt;

&lt;p&gt;Transaction isolation means that concurrent transactions against a database will be independent of one another. It’s the “I” in ACID. Unfortunately, “independence” in this context is a spectrum, and there are actually different isolation levels that are supported, each with subtly different behavior.&lt;/p&gt;

&lt;p&gt;Here’s a quick example script which makes concurrent queries against a database (Postgres):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;create table txn_iso (ival int);
insert into txn_iso (ival) values(1);
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;import { Pool, Transaction } from 'https://deno.land/x/postgres/mod.ts';

const pool = new Pool({
  user: 'postgres',
  hostname: 'localhost',
  database: 'postgres',
  port: 5433,
  password: 'test1234',
}, 10);

async function runQuery(
  txn: Transaction,
  query: string,
  args: (string | number)[],
  beforeMsg: string,
  afterMsg: (result: any) =&amp;gt; string
) {
  console.log(beforeMsg);
  const result = await txn.queryObject(query, args);
  console.log(afterMsg(result));
}

async function readTransaction() {
  const query = 'select ival from txn_iso';
  const printResult = (result: any) =&amp;gt; `Read result: ${result.rows[0].ival}`;

  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  await runQuery(txn, query, [], 'Executing first read...',  printResult);

  // Wait for concurrent write to occur
  await new Promise(resolve =&amp;gt; setTimeout(resolve, 2000));

  await runQuery(txn, query, [], 'Executing second read...', printResult);

  await txn.commit();

  await client.release();
}

async function writeTransaction() {
  await new Promise(resolve =&amp;gt; setTimeout(resolve, 1000));
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  const updateVal = Math.floor(Math.random() * 1000);
  const updateMsgBefore = `Updating ival to ${updateVal}...`;
  const query = 'update txn_iso set ival = $1';
  await runQuery(txn, query, [updateVal], updateMsgBefore, () =&amp;gt; 'ival updated');

  await txn.commit();

  await client.release();
}

await Promise.allSettled([readTransaction(), writeTransaction()]);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This script executes two transactions concurrently: one that reads the &lt;code&gt;txn_iso.ival&lt;/code&gt; column two different times, and another which modifies the value of that column. There’s some sleeps sprinkled in so that the second read occurs after the write. The question is: do both reads return the same value?&lt;/p&gt;

&lt;p&gt;In Postgres, with the default transaction isolation level set, the answer is surprisingly no. This is an example output of running the script:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;Executing first read...
Read result: 839
Updating ival to 79...
ival updated
Executing second read...
Read result: 79
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first read will return the value of the column at the time that the read transaction begins, but the second read will return the value that was updated by the concurrent write transaction. That’s because the default level is Read Committed, which allows non-repeatable reads. A non-repeatable read means that in the span of the same transaction, queries to the same column may return different results! This isn’t unique to Postgres either - Read Committed is the default isolation level in Oracle and SQL Server as well&lt;sup id=&quot;fnref:fn1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;This is surprising to a lot of people, and rightfully so, since it seems to go against the very definition of what a transaction is. But that’s because Read Committed is a &lt;em&gt;weak&lt;/em&gt; transaction isolation level. Weak isolation means that transactions aren’t truly independent from one another, and the effects of one concurrent transaction can be seen in another. There’s 4 isolation levels defined by the ANSI SQL standard&lt;sup id=&quot;fnref:fn2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. All but Serializable, which is the strictest, are weak and allow some kind of interference between transactions.&lt;/p&gt;

&lt;p&gt;Hopefully it’s clear why this is an issue. If you have an important column value, say a user’s account balance, you might query multiple different values in the same transaction which will surely result in a domain logic bug. Will our test suites catch such bugs? That depends how we set up the tests.&lt;/p&gt;

&lt;h1 id=&quot;simulating-concurrent-connections&quot;&gt;Simulating Concurrent Connections&lt;/h1&gt;

&lt;p&gt;The difficulty with coming up with tests that expose transaction isolation anomalies is that the test has to simulate multiple concurrent connections. Test cases almost always have the implicit assumption that they’re being executed by a single user, and isolation anomalies don’t show up in that scenario.&lt;/p&gt;

&lt;p&gt;As an example, here’s some oversimplified code for making outbound transfers from an account with overdraft protection:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;interface BalanceRepository {
  getBalance(txn: Transaction): Promise&amp;lt;number&amp;gt;;
  updateBalance(txn: Transaction, amount: number): Promise&amp;lt;void&amp;gt;;
}

async function checkOverdraftProtection(txn: Transaction, balanceRepo: BalanceRepository, amount: number) {
  const balance =  await balanceRepo.getBalance(txn);
  if (balance &amp;gt;= amount) {
    return;
  }

  balanceRepo.updateBalance(txn, balance + 100);
}

async function applyFundTransfer(txn: Transaction, balanceRepo: BalanceRepository, amount: number) {
  const balance = await balanceRepo.getBalance(txn);
  if (balance &amp;lt; amount) {
    console.error(&quot;Insufficient funds&quot;);
    return;
  }

  await balanceRepo.updateBalance(txn, balance - amount);
}

async function transferFunds(balanceRepo: BalanceRepository, amount: number) {
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  await checkOverdraftProtection(txn, balanceRepo, amount);
  await applyFundTransfer(txn, balanceRepo, amount);
  
  await txn.commit();

  await client.release();
}

const protectedBalanceRepo = {
  balance: 90,
  async getBalance(txn: Transaction) {
    return this.balance;
  },
  async updateBalance(txn: Transaction, amount: number) {
    this.balance = amount;
  }
}

await transferFunds(protectedBalanceRepo, 100);
console.assert(protectedBalanceRepo.balance === 90);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The main logic that we want to test is that overdraft protection adds additional funds when there’s not enough to cover a transfer, and that the final balance is correct. To test this, we’re placing all queries behind a &lt;code&gt;BalanceRepository&lt;/code&gt; interface and creating a &lt;code&gt;protectedBalanceRepo&lt;/code&gt; which starts out with insufficient funds but updates the balance based on overdraft protection.&lt;/p&gt;

&lt;p&gt;This is the operation from the perspective of a single user and thus a single DB connection, so the insufficient funds error won’t get hit. As we saw with the Read Committed example though, another concurrent transaction can affect a value that’s read multiple times. So one way to simulate a concurrent transaction is to simply ignore the overdraft protection and specify a different balance result directly.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;const concurrentWriteRepo = {
  currBalance: 0,
  balances: [100, 90],
  async getBalance(txn: Transaction) {
    return this.balances[this.currBalance++];
  },
  async updateBalance(txn: Transaction, amount: number) {
  }
}

...

await transferFunds(concurrentWriteRepo, 100);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This test double sets up two different balance results: it’ll first return 100, which will bypass overdraft protection, but the next balance check will return 90 which will result in an insufficient funds error. One way this would be possible in real life is if multiple people have access to the same account and initiate a transfer in close proximity to one another.&lt;/p&gt;

&lt;p&gt;There’s a simple fix for this failure: just don’t read the balance multiple times, and instead pass in the sampled balance to any function that needs it:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;async function transferFunds(balanceRepo: BalanceRepository, amount: number) {
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  const balance = await balanceRepo.getBalance(txn);
  const protectedBalance = await checkOverdraftProtection(txn, balance, balanceRepo, amount);
  await applyFundTransfer(txn, protectedBalance, balanceRepo, amount);
  
  await txn.commit();

  await client.release();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now the logic of &lt;code&gt;checkOverdraftProtection&lt;/code&gt; and &lt;code&gt;applyFundTransfer&lt;/code&gt; can get changed to use a balance value instead of querying it. This also means that &lt;code&gt;checkOverdraftProtection&lt;/code&gt; has to return the balance after protection is applied since &lt;code&gt;applyFundTransfer&lt;/code&gt; used to get this value with the second balance query, and using the pre-protection balance will result in an insufficient funds error.&lt;/p&gt;

&lt;p&gt;This solves the repeatable read anomaly by avoiding multiple reads, but there’s still a major issue: there’s a race condition between multiple concurrent transactions that can result in an incorrect balance.&lt;/p&gt;

&lt;h1 id=&quot;race-conditions-and-serializability&quot;&gt;Race Conditions and Serializability&lt;/h1&gt;

&lt;p&gt;To show the error we can execute two fund transfers concurrently against the actual DB, and we can introduce a write delay so that we can control which one writes last (we’d see errors even without this, but this reduces the non-determinism):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;create table accounts (balance int);
insert into accounts (balance) values (100);
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;async function transferFunds(balanceRepo: BalanceRepository, amount: number, writeDelay?: number) {
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  const balance = await balanceRepo.getBalance(txn);

  if (writeDelay) {
    await new Promise(resolve =&amp;gt; setTimeout(resolve, writeDelay));
  }

  const protectedBalance = await checkOverdraftProtection(txn, balance, balanceRepo, amount);
  await applyFundTransfer(txn, protectedBalance, balanceRepo, amount);
  
  await txn.commit();

  await client.release();
}

const postgresBalanceRepo = {
  async getBalance(txn: Transaction): Promise&amp;lt;number&amp;gt; {
    const result = await txn.queryObject('select balance from accounts');
    return result.rows[0].balance;
  },
  async updateBalance(txn: Transaction, amount: number) {
    await txn.queryObject('update accounts set balance = $1', [amount]);
  }
}

async function runInTransaction&amp;lt;T&amp;gt;(f: (txn: Transaction) =&amp;gt; Promise&amp;lt;T&amp;gt;) {
  const conn = await pool.connect()
  const txn = conn.createTransaction()
  await txn.begin();

  const result = await f(txn);

  await txn.commit();
  await conn.release();

  return result
}

// Setup
await runInTransaction((txn) =&amp;gt; {
  return postgresBalanceRepo.updateBalance(txn, 100);
});

// Run two concurrent fund transfers
await Promise.allSettled([transferFunds(postgresBalanceRepo, 80, 2000), transferFunds(postgresBalanceRepo, 60)]);

const balance = await runInTransaction((txn) =&amp;gt; {
  return postgresBalanceRepo.getBalance(txn)
});

console.assert(balance === 60);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When this is run, the transfers will see the same initial balance (100), but the one with the write delay will overwrite the balance set in the other one. This also means that neither transfer will trigger overdraft protection, there will be no insufficient funds error, and the resulting balance will be an incorrect value of 20.&lt;/p&gt;

&lt;p&gt;This is a &lt;em&gt;serialization anomaly&lt;/em&gt;. The Postgres docs define this as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are only two possible orderings of the two fund transfers here:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Transfer 80, then transfer 60&lt;/li&gt;
  &lt;li&gt;Transfer 60, then transfer 80&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since the starting balance is 100, both of these cases should trigger overdraft protection on the second transfer, and the resulting balance in both cases should be 60 (100 + 100 - 80 - 60). Race conditions can exist when transactions don’t adhere to serializability, and that’s what’s going on here - two fund transfers are initiated, but only one is accounted for because of a concurrent race. This is known as the “lost update” problem.&lt;/p&gt;

&lt;p&gt;There’s a few different ways to fix this, but the simplest is to lock the row for the duration of the transaction with &lt;code&gt;FOR UPDATE&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;const postgresBalanceRepo = {
  async getBalance(txn: Transaction): Promise&amp;lt;number&amp;gt; {
    const result = await txn.queryObject('select balance from accounts FOR UPDATE');
    return result.rows[0].balance;
  },
  ...
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Again from the Postgres docs:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;FOR UPDATE causes the rows retrieved by the SELECT statement to be locked as though for update. This prevents them from being locked, modified or deleted by other transactions until the current transaction ends.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means that each transaction will grab a lock on the &lt;code&gt;accounts&lt;/code&gt; row and that will block all other transactions from modifying that row until it’s complete, i.e. the transactions will execute in a serializable fashion. It’s worth noting that this is now slower. Without the lock the transactions could truly operate concurrently, but now they have to wait in contention over balance updates to the same account. This is necessary for correct behavior, but it’s worth understanding the tradeoff.&lt;/p&gt;

&lt;p&gt;Also of note, test doubles alone won’t help with this bug because the fix is in the real Postgres repository implementation. Test doubles are useful for testing application code independent of the database in many cases, but transaction isolation is a case where the different levels are so subtly different that it’s simpler to test against the real thing.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Transaction isolation has a major impact on an application, both in terms of performance as well as its influence on domain logic. It’s important to integration test against the real database to make sure that a weak transaction isolation level isn’t the cause of concurrency bugs. To expose such bugs, we have to execute at least two concurrent transactions in a test case. Unfortunately this can require some amount of time-based coordination which is never ideal, but is often necessary when tools like databases have opaque non-deterministic behavior that’s out of our control.&lt;/p&gt;

&lt;p&gt;Still, it is something we can and should test for at the application level.&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:fn1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Interestingly, MySQL’s default isolation level is Repeatable Read, which avoids the bug presented here. Still, it’s very uncommon for any DB to have a default level of Serializable, so most databases are operating with weak isolation. A notable exception is &lt;a href=&quot;https://www.google.com/search?q=foundationdb&amp;amp;rlz=1C5GCEM_enUS1058US1060&amp;amp;oq=foundationdb&amp;amp;gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARBFGDkyBggCEEUYOzIGCAMQRRg7MgYIBBBFGDwyBggFEEUYPDIGCAYQRRg8MgYIBxBFGDzSAQc2OTlqMGo3qAIAsAIA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&quot;&gt;FoundationDB&lt;/a&gt;, which does default to Serializable. &lt;a href=&quot;#fnref:fn1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Postgres actually only implements 3 out of 4, because Read Uncommitted conflicts with Postgres’ MVCC implementation. For more detail on isolation levels as Postgres implements them, see: &lt;a href=&quot;https://www.postgresql.org/docs/current/transaction-iso.html&quot;&gt;https://www.postgresql.org/docs/current/transaction-iso.html&lt;/a&gt; &lt;a href=&quot;#fnref:fn2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><category term="databases" /><summary type="html">Transaction isolation is the kind of thing that you learn about and it fills you with fear. Specifically, there are weak transaction isolation levels which allow some fairly unexpected behavior. Tools like Jepsen are used to test the general isolation guarantees of databases, but it’s pretty uncommon to check the application layer for issues related to isolation anomalies. These anomalies can impact actual domain logic, so it’s important to understand them as well as how we can test them.</summary></entry><entry><title type="html">Forward and Backward Reasoning in Proof Assistants</title><link href="/proof-assistants-direction/" rel="alternate" type="text/html" title="Forward and Backward Reasoning in Proof Assistants" /><published>2023-10-01T00:00:00+00:00</published><updated>2023-10-01T00:00:00+00:00</updated><id>/proof-assistants-direction</id><content type="html" xml:base="/proof-assistants-direction/">&lt;p&gt;Proof assistants are really fascinating tools, but the learning curve can be extremely steep. If you’re a programmer by trade and not a mathematician, this curve can be even steeper, because it’s not like programmers are doling out proofs left and right at work. One particular sticking point that I had trouble overcoming is the difference between forward vs. backward reasoning - proofs assistants support both.&lt;/p&gt;

&lt;h1 id=&quot;forward-reasoning&quot;&gt;Forward Reasoning&lt;/h1&gt;

&lt;p&gt;When thinking about logic, we generally think about forward arguments which get built up from one statement to the next, in sequence. For example, let’s make a logical argument about monitoring. We want to get an alert when our app goes down, and one way we know that the app is down is when a test user can’t login and see the home page. The way to express that in logic is to say that the home page not loading implies that the app is down:&lt;/p&gt;

\[HomePageDoesntLoad \implies AppIsDown\]

&lt;p&gt;Implication is a useful thing to know, but it only tells us about the overall relationship and doesn’t tell us whether the app is down right now or not. We want to know the current state so we can determine if we should page someone, and for that we can use one of the oldest rules in all of logic: modus ponens.&lt;/p&gt;

&lt;p&gt;Modus ponens is also known as “implication elimination,” which more accurately describes its behavior. It allows us to infer something about an implication, but the conclusion no longer contains one - the implication gets eliminated:&lt;/p&gt;

\[\dfrac{P~~~~~~~P \implies Q}{ Q }\]

&lt;p&gt;This is written out as an inference rule, which in this case means that if we know P is true, and we know that P implies Q, then we can infer that Q is also true. On top of the bar are the premises, and on the bottom is the conclusion which we can infer if the premises are true. The reason that this rule is so old is that it’s just a formal description of common sense - if P implies Q, and we know P is true, &lt;em&gt;of course&lt;/em&gt; Q is true. That’s what implies means.&lt;/p&gt;

&lt;p&gt;In our monitoring context, we can take P to be “the home page doesn’t load” and Q to be “the app is down,” and by this rule we can conclude that the app is down if we actually observe the home page being unable to load. This is a forward argument - when an inference rule is taken from top to bottom.&lt;/p&gt;

&lt;p&gt;Proof assistants almost always support forward reasoning. One way to do this in Isabelle is with the &lt;code&gt;frule&lt;/code&gt; tactic:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;lemma 
  assumes HomePageDoesntLoad 
    and &quot;HomePageDoesntLoad ⟶ AppIsDown&quot;
  shows &quot;AppIsDown&quot;
  using assms
  by (frule_tac P=HomePageDoesntLoad and Q=&quot;AppIsDown&quot; in mp)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;mp&lt;/code&gt; is the rule for modus ponens, which is defined like this&lt;sup id=&quot;fnref:fn1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;lemma 
  assumes &quot;P ⟶ Q&quot;
    and P
  shows Q
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;frule_tac&lt;/code&gt; allows us to take a forward logical step if the premises are shown to be true. Since they’re assumed here, they are true, and we prove &lt;code&gt;AppIsDown&lt;/code&gt; in one step.&lt;/p&gt;

&lt;h1 id=&quot;backward-reasoning&quot;&gt;Backward Reasoning&lt;/h1&gt;

&lt;p&gt;Proof assistants also allow us to work backwards from a goal.&lt;/p&gt;

&lt;p&gt;Let’s take a look at a backward proof of our lemma:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;lemma 
  assumes hp_load: HomePageDoesntLoad 
    and imp_appdown: &quot;HomePageDoesntLoad ⟶ AppIsDown&quot;
  shows &quot;AppIsDown&quot;
  apply(rule_tac P=HomePageDoesntLoad and Q=AppIsDown in mp)
  using imp_appdown
    apply(assumption)
  using hp_load
    apply(assumption)
  done
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here, instead of &lt;code&gt;frule_tac&lt;/code&gt;, we use &lt;code&gt;rule_tac&lt;/code&gt;, which applies a rule in a backward fashion. Instead of going from top to bottom in the rule, we replace the current proof goal with the premises in the top of the rule. This allows us to prove each one separately, which is one of the main benefits of backward rule application: we can more easily divide and conquer a complicated proof.&lt;/p&gt;

&lt;p&gt;It works because an inference rule can be interpreted in two ways. As we said, the forward interpretation is: “we can conclude the bottom if the top premises are true.” The backward interpretation is: “to prove the bottom, it suffices to prove the top premises.” These are logically equivalent.&lt;/p&gt;

&lt;p&gt;To dive in a bit more, we can look at the proof state after each step in the proof above. At the beginning of the proof, the goal is simply the final conclusion we want to show:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;lemma 
  assumes hp_load: HomePageDoesntLoad 
    and imp_appdown: &quot;HomePageDoesntLoad ⟶ AppIsDown&quot;
  shows &quot;AppIsDown&quot;

goal (1 subgoal):
 1. AppIsDown 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we apply modus ponens backwards:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;apply(rule_tac P=HomePageDoesntLoad and Q=AppIsDown in mp)

goal (2 subgoals):
 1. HomePageDoesntLoad ⟶ AppIsDown
 2. HomePageDoesntLoad
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Instead of having to show &lt;code&gt;AppIsDown&lt;/code&gt; directly, we now just have to show that &lt;code&gt;HomePageDoesntLoad ⟶ AppIsDown&lt;/code&gt; and &lt;code&gt;HomePageDoesntLoad&lt;/code&gt;. In a real proof, we’d have to figure out how to prove these independently, but here both of these are true by assumption so the rest of the proof just pulls in the appropriate one and applies it.&lt;/p&gt;

&lt;h1 id=&quot;which-ones-better&quot;&gt;Which One’s Better?&lt;/h1&gt;

&lt;p&gt;The unfortunate answer is that there’s no preferred direction, and we’ll often want to use both. We can also use higher-level and more powerful tactics anyway, which abstract the underlying reasoning. This monitoring example is very trivial, and can be proven in Isabelle with a variety of one liners, like:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;lemma 
  assumes HomePageDoesntLoad 
    and &quot;HomePageDoesntLoad ⟶ AppIsDown&quot;
  shows &quot;AppIsDown&quot;
  by (auto simp: assms)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Backward reasoning seems more natural many times, but this is likely because of the history of proof assistants: they were pretty much designed around backward reasoning and interactivity from the start. The line gets blurred with more recent developments like Isar, which is an Isabelle sublanguage for defining structured proofs. In Isar, individual steps might be proven in a backwards fashion, but the proof proceeds in a structured and forward manner. Isar proofs are almost always preferred because they more closely resemble pen-and-paper proofs, and bring the very relevant intermediate proof state to the foreground.&lt;/p&gt;

&lt;p&gt;Here’s one for the monitoring example:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;lemma 
  assumes imp_appdown: &quot;HomePageDoesntLoad ⟶ AppIsDown&quot;
    and hp_load: HomePageDoesntLoad
  shows &quot;AppIsDown&quot;
proof (rule mp[where P=HomePageDoesntLoad and Q=AppIsDown])
  from imp_appdown show &quot;HomePageDoesntLoad ⟶ AppIsDown&quot; by assumption
  from hp_load show HomePageDoesntLoad by assumption
qed 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This pretty closely mirrors the backward proof from before, and that’s because the structure of the proof is based on the backward application of &lt;code&gt;mp&lt;/code&gt; by choosing &lt;code&gt;rule&lt;/code&gt; and not &lt;code&gt;frule&lt;/code&gt; in the &lt;code&gt;proof&lt;/code&gt; command. But now the intermediate goals are visible, which gives the proof more structure. This is especially helpful for more complicated goals that can’t be proven in a single step because each goal can be respectively built up via intermediate steps.&lt;/p&gt;

&lt;p&gt;All this to say: the logical direction often changes throughout a proof in a proof assistant, and the same rules can be used both forwards and backwards. Knowing which direction is being used is crucial for understanding our proofs.&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:fn1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;It’s actually defined as an axiom, which means it’s implicitly taken to be true, and it also uses the older-style Isabelle syntax which lists assumptions in brackets: &lt;code&gt;&quot;⟦P ⟶ Q; P⟧ ⟹ Q&quot;&lt;/code&gt;. But this is equivalent to the &lt;code&gt;assumes ... shows ...&lt;/code&gt; syntax being used here. &lt;a href=&quot;#fnref:fn1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Alex Weisberger</name></author><category term="formal_methods" /><summary type="html">Proof assistants are really fascinating tools, but the learning curve can be extremely steep. If you’re a programmer by trade and not a mathematician, this curve can be even steeper, because it’s not like programmers are doling out proofs left and right at work. One particular sticking point that I had trouble overcoming is the difference between forward vs. backward reasoning - proofs assistants support both.</summary></entry><entry><title type="html">Compiling a Test Suite</title><link href="/test-compilation/" rel="alternate" type="text/html" title="Compiling a Test Suite" /><published>2023-08-23T00:00:00+00:00</published><updated>2023-08-23T00:00:00+00:00</updated><id>/test-compilation</id><content type="html" xml:base="/test-compilation/">&lt;p&gt;When I first stumbled upon certifying compilation&lt;sup id=&quot;fnref:fn1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, I was absolutely awestruck. I thought a compiler was a very specific thing, a translator from source to target language. But a certifying compiler goes further: it also proves its own correctness. My motto has become &lt;a href=&quot;/generated-tests/&quot;&gt;“most tests should be generated”&lt;/a&gt;, so this immediately seemed like a promising approach to my goal of improving the generative testing of interactive applications. It wasn’t immediately clear how exactly to incorporate this into that context, but after a little experimentation I now have a prototype of what it might look like.&lt;/p&gt;

&lt;p&gt;First, rather than describe the theory, let me show you what the workflow of certifying compilation looks like. Imagine invoking a command like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;certc source.cc -o myprogram -p proof
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;certc&lt;/code&gt; compiles the source file into an executable, like every other compiler, but in addition it outputs this &lt;code&gt;proof&lt;/code&gt; file. Imagine that you can open up this file, and from its contents be convinced that the compilation run contained zero bugs, and the output &lt;code&gt;myprogram&lt;/code&gt; is a perfect translation of &lt;code&gt;source.cc&lt;/code&gt;&lt;sup id=&quot;fnref:fn2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. The compilation run is &lt;em&gt;certified&lt;/em&gt; by this proof. Such compilers are sometimes referred to as &lt;em&gt;self-certifying&lt;/em&gt; for this reason - they produce their own proof of correctness.&lt;/p&gt;

&lt;p&gt;We know that proofs are hard though, and for most of us tests are sufficient. So what if instead, we had this workflow:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;certc source.cc -o myprogram -t test
./test
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Instead of generating a proof, we now generate a test suite, and instead of opening it up to inspect it, we run it. If it passes, we’re still convinced that the compilation run was correct. Visually, certifying compilation just adds one more output artifact to a compilation run, which we can call a “checker,” and looks something like this:&lt;/p&gt;

&lt;div style=&quot;display:flex;justify-content:center&quot;&gt;
&lt;script type=&quot;text/typogram&quot;&gt;
                       .--------------.
     .----------------&gt;|    Checker   |
     |                 .--------------.
     |
     |
.----------.             .-----------.
|  Source  |------------&gt;|   Target  |
.----------.             .-----------.

&lt;/script&gt; 
&lt;/div&gt;

&lt;h1 id=&quot;from-programs-to-applications&quot;&gt;From Programs to Applications&lt;/h1&gt;

&lt;p&gt;At this point, this doesn’t look very applicable to something like a web application, and I’m mostly interested in testing interactive distributed applications. The idea of compiling a source model into a full-fledged web app is farfetched to say the least. I actually tried going down that path for a bit, and I can confirm: it is hard. It’s definitely an interesting research area, but for now let me pitch an alternative workflow that’s still based on the mental model of certifying compilation.&lt;/p&gt;

&lt;p&gt;What if we assume that our target application is something that we hand-modify out of band, and we just generate the checker for it, i.e.:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;certc model -c test
./test
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And visually:&lt;/p&gt;

&lt;div style=&quot;display:flex;justify-content:center&quot;&gt;
&lt;script type=&quot;text/typogram&quot;&gt;
                       .--------------.
     .----------------&gt;|    Checker   |
     |                 .--------------.
     |
     |
.----------.             .-----------.
|  Model   | - - - - - -&gt;|    App    |
.----------.             .-----------.

&lt;/script&gt; 
&lt;/div&gt;

&lt;p&gt;In this workflow, we hand-develop the implementation application as we do normally, but we still generate the checker from a model. This puts us under the umbrella of model-based testing, but we’re going to look at the proof techniques that a certifying compiler uses as inspiration for how we should generate the correctness tests. Because of this difference, I’d call this paradigm “certifying specification.”&lt;/p&gt;

&lt;p&gt;What’s nice about this is that it slots right in to existing workflows. We can even TDD with this if we’re so inclined, by first changing logic in the model and then generating the failing tests before implementing them. Workflow-wise, it’s simple enough to work.&lt;/p&gt;

&lt;h1 id=&quot;writing-a-model&quot;&gt;Writing a Model&lt;/h1&gt;

&lt;p&gt;Since the checker generation depends on the existence of a model, we should first talk about how to write one. The first question to ask is: should we use an existing language or a new language to write models in? I really try to avoid thinking about or suggesting the introduction of new languages into the ecosystem. But, the question has to be asked, because using an existing language has a lot of tradeoffs with respect to specification:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Existing languages have no notion of system structure, i.e. how do we distinguish system state vs. local variables? How do we distinguish system actions vs. local mutation? How do we parse an arbitrary program and get relevant information out of it to help with test generation?&lt;/li&gt;
  &lt;li&gt;Programming languages are meant for programming. There are aspects of specification that require other language features, such as the ability to express logical properties and the ability to control aspects of test generation.&lt;/li&gt;
  &lt;li&gt;Programming languages have additional features that aren’t necessary in a modeling context. For example, a model has no need for filesystem operations or networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These can be overcome by creating an embedded DSL within an existing language to restrict the structure of models, but embedded DSLs have their own set of tradeoffs&lt;sup id=&quot;fnref:fn3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;One other option is to use an existing specification language, like TLA+. TLA+ in particular is too powerful for us here - we really want to limit models to be &lt;em&gt;executable&lt;/em&gt; so that we can use their logic in the checker.&lt;/p&gt;

&lt;p&gt;I think these are all viable approaches, but I also think that there are enough reasons to create a language that’s purpose-built for this use case. I’ve been experimenting with one that I call &lt;a href=&quot;https://github.com/amw-zero/sligh&quot;&gt;Sligh&lt;/a&gt;. Here’s a model of a counter application in Sligh:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sligh&quot; data-lang=&quot;sligh&quot;&gt;record Counter:
  name: Id(String)
  value: Int
end

process CounterApp:
  counters: Set(Counter)
  favorites: Set(String)

  def GetCounters():
    counters
  end

  def CreateCounter(name: String):
    counters := counters.append(Counter.new(name, 0))
  end

  def Increment(name: String):
    def findCounter(counter: Counter):
      counter.name.equals(name)
    end

    def updateCounter(counter: Counter):
      Counter.new(counter.name, counter.value + 1)
    end

    counters := counters.update(findCounter, updateCounter)
  end

  def AddFavorite(name: String):
    favorites := favorites.append(name)
  end

  def DeleteFavorite(name: String):
    def findFavorite(favName: String):
      name.equals(favName)
    end

    favorites := favorites.delete(findFavorite)
  end
end&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Sligh is not meant to be revolutionary in any way at the language level (in fact it aims to be much simpler than the average general purpose language), and hopefully the functionality is clear here. The main goal is that it supports enough analysis so that we can generate our model-based tests. The main notable syntactic features are the &lt;code&gt;:=&lt;/code&gt; operator and the structure of the &lt;code&gt;process&lt;/code&gt; definition. The &lt;code&gt;:=&lt;/code&gt; operator denotes updates of the &lt;em&gt;system&lt;/em&gt; state, distinguished from any modification of local variables. The &lt;code&gt;CounterApp&lt;/code&gt; app has a set of counters and a set of favorites as system state. Local variables exist, but mutations to those are implementation details and don’t matter from the perspective of testing. Having a specific operator for the system state allows simple syntactic analysis to find state changes, which is essential for generating the certification test.&lt;/p&gt;

&lt;p&gt;For example, in the &lt;code&gt;Increment&lt;/code&gt; action, we know that the &lt;code&gt;counters&lt;/code&gt; state variable is modified, and in the &lt;code&gt;AddFavorite&lt;/code&gt; action the &lt;code&gt;favorites&lt;/code&gt; state variable is modified. If no assignments occur on a state variable in the span of an action, then we know for sure that it’s not modified in that action. This becomes very important later when we can exploit this to generate the minimum amount of test data necessary for a given test iteration.&lt;/p&gt;

&lt;p&gt;Sligh processes also support nested &lt;code&gt;def&lt;/code&gt;s which define system &lt;em&gt;actions&lt;/em&gt;. System actions are the atomic ways that the system state can change, like adding or incrementing counters. For those conceptual user operations, we have corresponding &lt;code&gt;CreateCounter&lt;/code&gt;, and &lt;code&gt;Increment&lt;/code&gt; actions. This is what Sligh uses to determine which operations to generate tests for.&lt;/p&gt;

&lt;p&gt;These syntactic restrictions lead to a very powerful semantic model of a system that’s also statically analyzable - they effectively form a DSL for describing state machines.&lt;/p&gt;

&lt;h1 id=&quot;compiling-the-test-suite&quot;&gt;Compiling the Test Suite&lt;/h1&gt;

&lt;p&gt;A Sligh model doesn’t get compiled into a test suite directly. To compile the above counter model, we’d run:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;sligh counter.sl -w witness
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which generates a “witness” file. This is a good time to talk a bit about the compiler internals and why that is.&lt;/p&gt;

&lt;p&gt;It’s common for certifying compilers to decouple per-program generated output from a separate checker&lt;sup id=&quot;fnref:fn4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; that’s written once. This makes the code generation phase of the compiler simpler, but also allows the checker to be written and audited independently. This is extra important since the checker is our definition of correctness for the whole application, and a misstatement there affects the guarantees our certification test gives us.&lt;/p&gt;

&lt;p&gt;Here’s the current checker that’s in use:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;export function makeTest(
    actionName: string,
    stateType: &amp;quot;read&amp;quot; | &amp;quot;write&amp;quot;,
    stateGen: any,
    implSetup: any,
    dbSetup: any,
    model: any,
    modelArg: any,
    clientModelArg: any,
    runImpl: any,
    expectations: any,
  ) {
    test(`Test local action simulation: ${actionName}`, async () =&amp;gt; {
      let impl: StoreApi&amp;lt;ClientState&amp;gt;;
  
      await fc.assert(fc.asyncProperty(stateGen, async (state) =&amp;gt; {
        impl = makeStore();        
  
        const clientState = implSetup(state);

        // Initialize client state
        impl.setState(clientState);

        // Initialize DB state
        await impl.getState().setDBState(dbSetup(state));

        // Run implementation action
        await runImpl(impl.getState(), state);

        // Run model action and assert
        switch (stateType) {
          case &amp;quot;write&amp;quot;: {
            const clientModelResult = model(clientModelArg(state));
            for (const expectation of expectations) {
              const { modelExpectation, implExpectation } = expectation(clientModelResult, impl.getState());
    
              expect(implExpectation).toEqual(modelExpectation);
            }
            break;
          }
          case &amp;quot;read&amp;quot;: {
            let modelResult = model(modelArg(state));
            for (const expectation of expectations) {
              const { modelExpectation, implExpectation } = expectation(modelResult, impl.getState());
    
              expect(implExpectation).toEqual(modelExpectation);
            }
            break;
          }
        }
      }).afterEach(async () =&amp;gt; {
        // Cleanup DB state
        await impl.getState().teardownDBState();
      }), { endOnFailure: true, numRuns: 25 });
    });
  }&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This looks similar to other &lt;a href=&quot;/model-based-testing-theory/&quot;&gt;model-based tests&lt;/a&gt; we’ve built before in that it compares the output of the model and implementation for a given action at a given initial state. This test is parameterized though, and all of the input parameters for a given test come from the witness.&lt;/p&gt;

&lt;p&gt;A “witness” in the certifying compilation world refers to data that’s extracted from the source program during compilation. Here’s the witness output for the &lt;code&gt;CreateCounter&lt;/code&gt; action:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;interface Counter {
  name: string;
  value: number;
}

interface CreateCounterDBState {
  counters: Array&amp;lt;Counter&amp;gt;;
}

interface CreateCounterType {
  counters: Array&amp;lt;Counter&amp;gt;;
  name: string;
  db: CreateCounterDBState;
}

interface CreateCounterModelIn {
  name: string;
  counters: Array&amp;lt;Counter&amp;gt;;
}

let CreateCounterModel = (params: CreateCounterModelIn) =&amp;gt; {
  let name = params.name;
  let counters = params.counters;
  counters = (() =&amp;gt; {
    let a = [...counters];
    a.push({ name: name, value: 0 });
    return a;
  })();
  return { counters: counters };
};

// ...

{
  name: &amp;quot;CreateCounter&amp;quot;,
  type: &amp;quot;write&amp;quot;,
  stateGen: fc.record({
    counters: fc.uniqueArray(
      fc.record({ name: fc.string(), value: fc.integer() }),
      {
        selector: (e: any) =&amp;gt; {
          return e.name;
        },
      }
    ),
    name: fc.string(),
    db: fc.record({
      counters: fc.uniqueArray(
        fc.record({ name: fc.string(), value: fc.integer() }),
        {
          selector: (e: any) =&amp;gt; {
            return e.name;
          },
        }
      ),
    }),
  }),
  implSetup: (state: CreateCounterType) =&amp;gt; {
    return { counters: state.counters };
  },
  dbSetup: (state: CreateCounterType) =&amp;gt; {
    return { counters: state.db.counters, name: state.name };
  },
  model: CreateCounterModel,
  modelArg: (state: CreateCounterType) =&amp;gt; {
    return { counters: state.db.counters, name: state.name };
  },
  clientModelArg: (state: CreateCounterType) =&amp;gt; {
    return { counters: state.counters, name: state.name };
  },
  runImpl: (impl: ClientState, state: CreateCounterType) =&amp;gt; {
    return impl.CreateCounter(state.name);
  },
  expectations: [
    (modelResult: CreateCounterModelOut, implState: ClientState) =&amp;gt; {
      return {
        modelExpectation: { counters: modelResult.counters },
        implExpectation: { counters: implState.counters },
      };
    },
  ],
},

// ...&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The details here are likely to change over time, but the key thing to notice is that all of this information is generated from the definition of &lt;code&gt;CreateCounter&lt;/code&gt; in the model. Here’s the &lt;code&gt;CreateCounter&lt;/code&gt; definition again for reference:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sligh&quot; data-lang=&quot;sligh&quot;&gt;def CreateCounter(name: String):
  counters := counters.append(Counter.new(name, 0))
end&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This action takes a &lt;code&gt;name&lt;/code&gt; string as input, but it also modifies the &lt;code&gt;counters&lt;/code&gt; state variable (which Sligh is able to detect because of the presence of the &lt;code&gt;:=&lt;/code&gt; operator). From this, one of the things we generate is a type for all of the test’s input data, &lt;code&gt;CreateCounterType&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;interface CreateCounterType {
  counters: Array&amp;lt;Counter&amp;gt;;
  name: string;
  db: CreateCounterDBState;
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;And the &lt;code&gt;stateGen&lt;/code&gt; property of the &lt;code&gt;witness&lt;/code&gt; object gets a corresponding data generator for this type:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;fc.record({
  counters: fc.uniqueArray(
    fc.record({ name: fc.string(), value: fc.integer() }),
    {
      selector: (e: any) =&amp;gt; {
        return e.name;
      },
    }
  ),
  name: fc.string(),
  db: fc.record({
    counters: fc.uniqueArray(
      fc.record({ name: fc.string(), value: fc.integer() }),
      {
        selector: (e: any) =&amp;gt; {
          return e.name;
        },
      }
    ),
  }),
})&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Also, note what this excludes. The test doesn’t have to generate the &lt;code&gt;favorites&lt;/code&gt; variable since it’s not referenced or modified in the span of this particular action. The test for each action only has to generate the bare minimum amount of data it needs to function. And most importantly, this means we totally avoid creating any global system states. I think this will be the key to testing a larger application in this way.&lt;/p&gt;

&lt;p&gt;Other than that, the other params are similarly extracted from the &lt;code&gt;CreateCounter&lt;/code&gt; signature and code, providing overall assistance to the checker. I expect to be able to hone these witness definitions over time, but this works for now.&lt;/p&gt;

&lt;p&gt;At this point it should be apparent that the compiler and checker both have to know about some very important system details. They need to know what language the test is written in. They need to know the pattern for executing actions on both the implementation and the model (here the implementation interface is a &lt;a href=&quot;https://github.com/pmndrs/zustand&quot;&gt;Zustand&lt;/a&gt; store meant to be embedded in a React app). They need to know what testing libraries are being used - here we’re using &lt;a href=&quot;https://github.com/vitest-dev/vitest&quot;&gt;vitest&lt;/a&gt; and &lt;a href=&quot;https://github.com/dubzzz/fast-check&quot;&gt;fast-check&lt;/a&gt;. And they need to be able to set up the state of external dependencies like the database, done here with calls to &lt;code&gt;impl.getState().setDBState&lt;/code&gt; and &lt;code&gt;impl.getState().teardownDBState()&lt;/code&gt;, which means that the server has to be able to help out with initializing data states.&lt;/p&gt;

&lt;p&gt;Still, lots of the functionality is independent of these concerns, and my hope is to make the compiler extensible to different infrastructure and architectures via compiler backends. For now, sticking with this single architecture has supported the development of the prototype of this workflow.&lt;/p&gt;

&lt;p&gt;Finally, the test gets wired up together in a single file runnable by the test runner:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;import { makeTest } from &amp;#39;./maketest&amp;#39;;
import { witness } from &amp;#39;./witness&amp;#39;;

for (const testCase of witness) {
    makeTest(
      testCase.name, 
      testCase.type as &amp;quot;read&amp;quot; | &amp;quot;write&amp;quot;,
      testCase.stateGen, 
      testCase.implSetup, 
      testCase.dbSetup, 
      testCase.model, 
      testCase.modelArg, 
      testCase.clientModelArg,
      testCase.runImpl, 
      testCase.expectations
    );
  }&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h1 id=&quot;outro&quot;&gt;Outro&lt;/h1&gt;

&lt;p&gt;Ok, I went into a lot of details about the internals of the Sligh compiler. But to reiterate, the developer workflow is just:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;sligh counter.sl -w witness
./test
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I’m using this on a working Next.js application, and workflow-wise it feels great. I’m excited to see what other challenges come up as the application grows.&lt;/p&gt;

&lt;p&gt;I can’t rightfully end the post without talking about a few tradeoffs. I can probably write a whole separate post about that, since this one is already quite long, but two big ones are worth mentioning now. First, because we’re testing single state transitions, a test failure won’t tell you how to actually reproduce the failure. It might take a series of very particular action invocations to arrive at the starting state of the simulation test, and it’s not always clear if the specific state is likely or even legitimately possible in regular application usage. I have ideas there - similar to property-based testing failure minimization, it should be possible to search for action sequences that result in the failing initial state.&lt;/p&gt;

&lt;p&gt;The second tradeoff is that data generation for property tests of a full application is non-trivial. Sligh is currently doing the bare minimum here, which is use type definitions to create data generators. I’m hoping the language can help out here though, and more intelligent generators might be able to be extracted from the model logic.&lt;/p&gt;

&lt;p&gt;And lastly, I have to call out the awesome &lt;a href=&quot;https://cogent.readthedocs.io/en/latest/&quot;&gt;Cogent&lt;/a&gt; project one last time. So many of these ideas were inspired by the many publications from that project. Specifically, check out this paper: &lt;a href=&quot;https://trustworthy.systems/publications/full_text/Chen_OKKH_17.pdf&quot;&gt;The Cogent Case for Property-Based Testing&lt;/a&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:fn1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I first heard about certifying compilation through a &lt;a href=&quot;https://www.youtube.com/watch?v=sJwcm_worfM&quot;&gt;talk on YouTube&lt;/a&gt; and &lt;a href=&quot;https://trustworthy.systems/publications/nicta_full_text/9425.pdf&quot;&gt;a corresponding paper&lt;/a&gt; (by Liam O’Connor, Zilin Chen, Christine Rizkallah, Sidney Amani, Japheth Lim, Toby Murray, Yutaka Nagashima, Thomas Sewell, and Gerwin Klein). These are about the &lt;a href=&quot;https://cogent.readthedocs.io/en/latest/&quot;&gt;Cogent&lt;/a&gt; language, which compiles from itself to C, but also generates a proof of its correctness in Isabelle. &lt;a href=&quot;#fnref:fn1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Any compiler-writer will tell you, compilers are &lt;a href=&quot;https://softwareengineering.stackexchange.com/a/53069&quot;&gt;just as buggy&lt;/a&gt; as other programs. This is why certifying compilation exists in the first place - to provide higher assurance about the correctness of a compiler. &lt;a href=&quot;#fnref:fn2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I once read &lt;a href=&quot;https://matklad.github.io/2021/02/14/for-the-love-of-macros.html#Domain-Specific-Languages&quot;&gt;an interesting take about building embedded DSLs inside of an existing language&lt;/a&gt; that influenced my thinking here. The takeaway: eDSLs are often not worth it. &lt;a href=&quot;#fnref:fn3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;a href=&quot;https://people.mpi-inf.mpg.de/~mehlhorn/ftp/CertifyingAlgorithms.pdf&quot;&gt;Certifying Algorithms&lt;/a&gt; by R. M. McConnella, K. Mehlhornb, S. Näherc, P. Schweitzer &lt;a href=&quot;#fnref:fn4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><category term="formal_methods" /><category term="plt" /><summary type="html">When I first stumbled upon certifying compilation1, I was absolutely awestruck. I thought a compiler was a very specific thing, a translator from source to target language. But a certifying compiler goes further: it also proves its own correctness. My motto has become “most tests should be generated”, so this immediately seemed like a promising approach to my goal of improving the generative testing of interactive applications. It wasn’t immediately clear how exactly to incorporate this into that context, but after a little experimentation I now have a prototype of what it might look like. I first heard about certifying compilation through a talk on YouTube and a corresponding paper (by Liam O’Connor, Zilin Chen, Christine Rizkallah, Sidney Amani, Japheth Lim, Toby Murray, Yutaka Nagashima, Thomas Sewell, and Gerwin Klein). These are about the Cogent language, which compiles from itself to C, but also generates a proof of its correctness in Isabelle. &amp;#8617;</summary></entry><entry><title type="html">Most Tests Should Be Generated</title><link href="/generated-tests/" rel="alternate" type="text/html" title="Most Tests Should Be Generated" /><published>2023-07-02T00:00:00+00:00</published><updated>2023-07-02T00:00:00+00:00</updated><id>/generated-tests</id><content type="html" xml:base="/generated-tests/">&lt;p&gt;Traditional testing wisdom eventually invokes the test pyramid, which is a guide to the proportion of tests to write along the isolation / integration spectrum. There’s an eternal debate about what the best proportion should be at each level, but interestingly it’s always presented with the assumption that test cases are hand-written. We should also think about test generation as a dimension, and if I were to draw a pyramid about it I’d place generated tests on the bottom and hand-written scenarios on top, i.e. most tests should be generated.&lt;/p&gt;

&lt;h1 id=&quot;correctness-is-what-we-want&quot;&gt;Correctness is What We Want&lt;/h1&gt;

&lt;p&gt;What are we even trying to do with testing? The end goal is to show correctness. We do this for two main reasons: to show that new functionality does what’s expected before release, and to ensure that existing functionality is not broken between releases. Tests are a means to this end, nothing more. Importantly, they also can only ever show &lt;em&gt;approximate&lt;/em&gt; correctness. To understand that fully, let’s define correctness precisely. Here’s a paraphrasing of Kedar Namjoshi’s definition from &lt;a href=&quot;https://www.youtube.com/watch?v=GZXSSCF4siY&quot;&gt;Designing a Self-Certifying Compiler&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;First we have to define what a program is. The simplest representation is just a function from values in X to values in Y. This may look oversimplified, but an interactive program can even be modeled this way by assuming the program function is invoked in response to each user interaction in a loop. So a program P is:&lt;/p&gt;

\[P: X \rightarrow Y\]

&lt;p&gt;Correctness requires a specification to check against. This might be surprising, since one rarely exists, but think of traditional test suites as simply defining this specification point-wise. A specification S can be a function of the same type:&lt;/p&gt;

\[S: X \rightarrow Y\]

&lt;p&gt;We can express correctness with the following property:&lt;/p&gt;

\[\forall x \in X: P(x) = S(x)\]

&lt;p&gt;In English: for every x value in X, evaluating P(x) yields the same value as evaluating S(x).&lt;/p&gt;

&lt;p&gt;Point being, we want to check that the implementation program does the same thing as the specification, always. Notice how achieving 100% branch coverage in a test suite doesn’t get us here by the way, since that doesn’t account for all inputs in &lt;em&gt;X&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Let’s look at how scenarios and generated tests differ with how they show correctness.&lt;/p&gt;

&lt;h1 id=&quot;testing-for-correctness-with-scenarios&quot;&gt;Testing for Correctness with Scenarios&lt;/h1&gt;

&lt;p&gt;As I mentioned, the traditional test pyramid is talking about hand-written test scenarios, aka examples / test cases etc. Correctness is pretty simple to express as a logical property, but it’s very difficult to test for. The first thing we run into is the test oracle problem - how do we actually get the value of &lt;em&gt;S(x)&lt;/em&gt; to check against? Executable specifications rarely exist (though I am a proponent of using them for this reason), so normally what happens is that the test writer interprets an informal specification and hard codes the expected value of &lt;em&gt;S(x)&lt;/em&gt; for a specific x as the test assertion. The informal specification is what the team talks about when deciding to build the feature, and the test writer is the test oracle. Sometimes some details are written down, sometimes not, but the burden of coming up with the expected test value is always on the test writer, and it’s a completely manual process.&lt;/p&gt;

&lt;p&gt;The next issue is the number of values in the input domain X. Each test case needs to specify a single input value from X, but testing for all values from X is not feasible in any way. This is not an exaggeration - if X is the set of single 64-bit integers, we’d have to check 18,446,744,073,709,551,616 test cases. This multiplies for each additional integer, and how many integers do you think are in the entire state of a realistic program? We said earlier that a test suite only approximates correctness, but this makes it more formal. A test suite actually represents this property:&lt;/p&gt;

\[TX \subseteq X \land \forall tx \in TX: P(tx) = S(tx)\]

&lt;p&gt;How effective a test suite is boils down to how confident we are that testing the input values that we chose implies that the correctness will hold for all of the input values, i.e.&lt;/p&gt;

\[\forall tx \in TX: P(tx) = S(tx) \implies \forall x \in X: P(x) = S(x)\]

&lt;p&gt;This is probably true sometimes, but we have no guarantee of it in general. How can we ever know that the values that we pick out of X are “good enough”?&lt;/p&gt;

&lt;p&gt;So test scenarios have an informal and manual test oracle process, and are pretty quantitatively incomplete in terms of how much of the input domain they can possibly cover. That doesn’t mean they’re not useful! Testing via scenarios is unreasonably effective in practice. There are two main benefits to them. First, they’re easy to write. This is likely because they require very literal and linear reasoning, since we just need to assert on the actual output of the program. If we really want, we can just run the program and observe what it outputs and record that as a test assertion. People do this all the time, and there’s even a strategy that takes this to the extreme called “golden testing” or “snapshot testing.”&lt;/p&gt;

&lt;p&gt;The next benefit, somewhat obviously, is that they’re specific. If we have a test case in our head that we know is really important to check, why not just write it out? When we do this, we &lt;a href=&quot;https://buttondown.email/hillelwayne/archive/some-tests-are-stronger-than-others/#fnref:stronger-than-nitpick&quot;&gt;also get more local error messaging when the test fails&lt;/a&gt;, which can point us in a very specific direction. This is always cited as one of the main benefits of unit testing, and it really is helpful to have a specific area of the code to look at vs. trying to track down a weird error in a million lines of code.&lt;/p&gt;

&lt;p&gt;Now let’s look at generated tests.&lt;/p&gt;

&lt;h1 id=&quot;generating-tests-for-properties&quot;&gt;Generating Tests for Properties&lt;/h1&gt;

&lt;p&gt;Our correctness statement from earlier is expressed as a property: &lt;em&gt;P(x) = S(x)&lt;/em&gt; is a property that’s either true or not for all of the program inputs. Now, we know that we can’t actually check every single input in a test, but what we can do is generate lots and lots of inputs and check if the property holds. With property-based testing these inputs are usually generated randomly, but there are &lt;a href=&quot;/category-partition-properties/&quot;&gt;other data generation strategies as well&lt;/a&gt;. So here, we’re talking about property-based testing more generally, and it has a couple of subtly different problems than testing with scenarios.&lt;/p&gt;

&lt;p&gt;When checking for properties, the test oracle problem also presents itself immediately. We can always evaluate &lt;em&gt;P(x)&lt;/em&gt;, since that’s our implementation that we obviously control, but how do we know what the expected value &lt;em&gt;S(x)&lt;/em&gt; is? And, furthermore, we have a chicken-and-egg problem: how do we know what &lt;em&gt;S(x)&lt;/em&gt; is when code is generating &lt;code&gt;x&lt;/code&gt; and we don’t know what it is ahead of time?&lt;/p&gt;

&lt;p&gt;The answer is to define &lt;em&gt;S(x)&lt;/em&gt; with logic that we can actually execute in the test, i.e. an executable specification. This often sounds weird to people at first, but looking at our correctness statement this is the more natural way to test. Instead of implicitly defining the specification via a bunch of individual test cases, we just define &lt;em&gt;S(x)&lt;/em&gt; and call it during testing. This can take the form of simple functions that represent invariants of the code, all the way up to &lt;a href=&quot;/model-based-testing/&quot;&gt;entire models of the functional behavior&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The input space issue is also still present with property-based testing, but in a different way: generating data is hard. Like, really hard. One of the main challenges is logical constraints, e.g. “this number must be less than 100”. These constraints can get very complicated in real-world domains, and sometimes that even leads to performance issues where you have to discard generated inputs until the constraint is met.&lt;/p&gt;

&lt;p&gt;Property-based testing has an absolute killer feature though: it discovers failure cases for you, i.e. it actually finds unknown unknowns. This is worth more than gold. With scenarios, you have to know the failure ahead of time, but isn’t every bug in production a result of a failure that you didn’t even think of before deploying? Rather than check cases that we know ahead of time, we generate tests that search for interesting failures. This simply can’t be done with ahead-of-time test scenarios.&lt;/p&gt;

&lt;h1 id=&quot;the-test-generation-pyramid&quot;&gt;The Test Generation Pyramid&lt;/h1&gt;

&lt;p&gt;We looked at some of the pros and cons of scenarios vs. generated tests, so which should we prefer? I definitely think we should write both kinds, but overall most tests should be generated. Test strategies have to be represented as a triangle, so here is this idea in triangle form:&lt;/p&gt;

&lt;div style=&quot;display: flex; justify-content: center;&quot;&gt;
  &lt;img src=&quot;/assets/generated_tests/generated-tests.png&quot; style=&quot;width:64%&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Why should we prefer generated tests? It all boils down to the fact that they find failures for us, which means that they naturally bring us closer to correctness. Unfortunately, no matter how perfect our selected test scenarios are they leave the vast majority of the input space uncovered, and there is no way to know which uncovered inputs are important and which are redundant. By having a suite of generated tests that are constantly looking for new inputs, we put ourselves in the best position to find edge cases that we just aren’t considering at the moment.&lt;/p&gt;

&lt;p&gt;It’s like having a robot exploratory tester that we can deploy at will, which opens up a whole new mode of testing. We can run generated tests in CI before merging, sure, but we can also run them around the clock since generated tests &lt;em&gt;search&lt;/em&gt; for failures vs. checking predetermined scenarios. More testing time means more of the input domain being searched, so to check more inputs we simply run each generated test for longer and run more test processes in parallel.&lt;/p&gt;

&lt;p&gt;This doesn’t mean that we stop writing scenarios. That’s why there’s two sections in the pyramid. All of the proposed values of test scenarios are valid - we get specific error messages, free executable documentation, and a guarantee that important cases are checked. But generated tests are &lt;a href=&quot;https://buttondown.email/hillelwayne/archive/some-tests-are-stronger-than-others/#fnref:stronger-than-nitpick&quot;&gt;fundamentally stronger&lt;/a&gt; than scenarios, since the generated tests will often find the same inputs that we use in our scenarios in addition to ones we haven’t thought about.&lt;/p&gt;

&lt;p&gt;Since the ultimate goal of testing is correctness, not documentation and local error messages, it’s in our best interest to supplement our scenarios with lots and lots of generated tests.&lt;/p&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><category term="philosophy" /><summary type="html">Traditional testing wisdom eventually invokes the test pyramid, which is a guide to the proportion of tests to write along the isolation / integration spectrum. There’s an eternal debate about what the best proportion should be at each level, but interestingly it’s always presented with the assumption that test cases are hand-written. We should also think about test generation as a dimension, and if I were to draw a pyramid about it I’d place generated tests on the bottom and hand-written scenarios on top, i.e. most tests should be generated.</summary></entry><entry><title type="html">Logical Time and Deterministic Execution</title><link href="/logical-time-determinism/" rel="alternate" type="text/html" title="Logical Time and Deterministic Execution" /><published>2023-02-28T00:00:00+00:00</published><updated>2023-02-28T00:00:00+00:00</updated><id>/logical-time-determinism</id><content type="html" xml:base="/logical-time-determinism/">&lt;p&gt;Recently, Tomorrow Corporation released &lt;a href=&quot;https://www.youtube.com/watch?v=72y2EC5fkcE&quot;&gt;this video of their in-house tech stack&lt;/a&gt; doing some truly awesome time-travel debugging of a production-quality game. You should watch this video, even if you don’t read this post, because the workflow that they’ve created is really inspiring. The creator kept bringing up the fact that the reason their tools can do this is that they have determinism baked into them at the very foundational levels. You simply can’t bolt this on at higher levels in the stack.&lt;/p&gt;

&lt;p&gt;This got me thinking - not only do we rarely have this level of control in our projects, but I think it’s rare to even understand how determinism is possible in modern systems that are interactive, concurrent, and distributed. If we don’t understand this, we can’t ever move our tools toward determinism, which I think is a very good idea. It turns out that even if we can’t predict exactly how a program will execute in a &lt;em&gt;specific&lt;/em&gt; run, we can still model and reason about it deterministically. This is a prerequisite for most formal methods, and while I understand that formal methods aren’t everyone’s cup of tea, this is the number one thing that I wish more people understood. So today, we won’t be talking about testing or verifying anything, we’ll just be looking to better understand software in general by diving into logical time and how it enables deterministic reasoning.&lt;/p&gt;

&lt;h1 id=&quot;user-interaction-and-non-deterministic-choice&quot;&gt;User Interaction and Non-Deterministic Choice&lt;/h1&gt;

&lt;p&gt;Talk of non-determinism can get very abstract very quickly, but there is a practical manifestation that we’ve all observed even if we didn’t know the term: &lt;em&gt;non-deterministic choice&lt;/em&gt;. An application with a user interface is a classic example of a system with non-deterministic choice - no one can predict the order that a user will click through the interface, and the user is free to make any choice that’s visible and enabled.&lt;/p&gt;

&lt;p&gt;We’ll introduce an example to get more specific, and it’s important to &lt;em&gt;always&lt;/em&gt; use &lt;a href=&quot;https://todomvc.com/&quot;&gt;TodoMVC&lt;/a&gt; as the interactive application example&lt;sup id=&quot;fnref:fn1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; (here’s &lt;a href=&quot;https://todomvc.com/examples/js_of_ocaml/&quot;&gt;one of the implementations&lt;/a&gt; if you want to click around). In TodoMVC, we can add new named to-do items and then mark them as completed. We can also remove a to-do without marking it as completed. Like all interactive applications, we can do this in any order though, and these are all valid sequences of actions:&lt;/p&gt;

&lt;p&gt;1.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Add to-do named “t1”&lt;/li&gt;
  &lt;li&gt;Mark “t1” as completed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Add to-do named “t1”&lt;/li&gt;
  &lt;li&gt;Remove “t1”&lt;/li&gt;
  &lt;li&gt;Add to-do named “t2”&lt;/li&gt;
  &lt;li&gt;Mark “t1” as completed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Add to-do named “t1”&lt;/li&gt;
  &lt;li&gt;Mark “t1” as completed&lt;/li&gt;
  &lt;li&gt;Add to-do named “t2”&lt;/li&gt;
  &lt;li&gt;Mark “t2” as completed&lt;/li&gt;
  &lt;li&gt;Add to-do named “t3”&lt;/li&gt;
  &lt;li&gt;Add to-do named “t4”&lt;/li&gt;
  &lt;li&gt;Add to-do named “t5”&lt;/li&gt;
  &lt;li&gt;Remove “t3&lt;/li&gt;
  &lt;li&gt;Add to-do named “t6”&lt;/li&gt;
  &lt;li&gt;Add to-do named “t7”&lt;/li&gt;
  &lt;li&gt;Remove “t4”&lt;/li&gt;
  &lt;li&gt;Add to-do named “t8”&lt;/li&gt;
  &lt;li&gt;Add to-do named “t9”&lt;/li&gt;
  &lt;li&gt;Mark “t6” as completed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can visualize this non-determinism with a state graph:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/TodoMVCStates2.png&quot; /&gt;&lt;/p&gt;
&lt;div style=&quot;display: flex; justify-content: center;&quot;&gt;
  &lt;img src=&quot;/assets/TodoMVCLegend2.png&quot; style=&quot;width:64%&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;A non-deterministic choice exists when more than one transition arrow flows away from a given state. It means that all of them are valid choices that can occur in separate executions, but one has to &lt;em&gt;somehow&lt;/em&gt; be chosen to proceed through the state graph. An interactive application lets the user decide via the UI, but as we’ll see later, there are other things that can make choices. Functionally, it doesn’t matter who does the choosing.&lt;/p&gt;

&lt;p&gt;A quick aside: this is the complete behavior up to a bound of 2 to-dos. Physical space constraints aside, the full state graph of TodoMVC is theoretically infinite, because you can always add a to-do with a new name. Visualizing infinite bubbles is painful for everyone involved, so we place a constraint on the model along the lines of “there are only two to-dos in the entire universe.” This is a silly constraint, but it helps us visualize the state space in a manageable way. Bounded models also help with &lt;a href=&quot;https://en.wikipedia.org/wiki/Model_checking#Techniques&quot;&gt;making properties checkable&lt;/a&gt;, but we’re not talking about that today because we’re not actually doing formal methods!&lt;/p&gt;

&lt;p&gt;Let’s look at an example run through the program by picking specific choices. We’ll start at the gray initial state, add two to-dos named “t1” and “t2”, and then we’ll complete them both. Here’s that path in red:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/TodoMVCPath1.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can get to the same final state a different way, by adding to-do “t2”, completing it, then adding to-do “t1” and completing it:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/TodoMVCPath2.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We all know how software works intuitively, but seeing these runs against the full state graph hints at a couple of precise definitions: software behavior is simply a sequence of states, and a program is a set of allowable behaviors. It also gives us our first step towards determinism. When a non-deterministic choice exists, we don’t know which path will be taken in a specific program run, but we do know what all of the possible runs are. Each of those runs is a totally deterministic behavior.&lt;/p&gt;

&lt;p&gt;Said another way, a non-deterministic choice becomes deterministic when we pick one.&lt;/p&gt;

&lt;p&gt;For fun, here’s the state graph of TodoMVC with 5 to-dos:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/TodoMVCBigStates.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Determinism isn’t necessarily easy.&lt;/p&gt;

&lt;h1 id=&quot;concurrency&quot;&gt;Concurrency&lt;/h1&gt;

&lt;p&gt;Concurrency is another notorious source of non-determinism, but let’s define why. Imagine we have N network requests that start in an idle state, begin fetching some data, and eventually complete. Continuing to keep our bounds small, let’s start with N = 2:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/RequestsFont.svg&quot; /&gt;&lt;/p&gt;
&lt;div style=&quot;display: flex; justify-content: center;&quot;&gt;
  &lt;img src=&quot;/assets/determinism/RequestsLegend.svg&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;In every state, we can either initiate an uninitiated request or an in-progress request can complete. It’s possible for different requests to complete in different orders too, e.g. request 0 can complete first:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/Requests-Req0.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And request 1 can also complete first, even if request 0 was initiated before it:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/Requests-Req1.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The order that requests complete is a non-deterministic choice, which we’ve already seen, but there’s a major difference from the TodoMVC example: the OS or language runtime determines the choice, not a user. This is one reason why concurrency is a constant thorn in the side, and feels much more complex than the non-determinism of user interfaces. We literally don’t have control over the order of operations.&lt;/p&gt;

&lt;p&gt;In the same way as the choices in the user interface, though, we just have to account for all of their combinations, and then we can know which orders of execution are possible. Another way to think about this is that if a race is possible, both sides of the race will always eventually occur, and we have to plan for both cases.&lt;/p&gt;

&lt;p&gt;Because N = 2 is no fun, here’s N = 5 (i.e. 5 concurrent requests) which has 639 distinct states:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/determinism/Requests-Req5.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I’m sure a mutex will make this more manageable.&lt;/p&gt;

&lt;h1 id=&quot;logical-time-time-travel-and-beyond&quot;&gt;Logical Time, Time-Travel, and Beyond&lt;/h1&gt;

&lt;p&gt;Both state graphs show the set of all behaviors for the given system, and they do this by showing &lt;em&gt;logical&lt;/em&gt; time, in contrast to physical time. A user might wait 17 years before selecting a transition in a UI, or an OS scheduler might pick one thread to execute while another waits for I/O. The real-world execution of a program runs in physical time, but our state graphs are only concerned with abstract states and transitions between them. And good thing for that - it would be awkward to have to wait 17 years to understand the possible behaviors of TodoMVC.&lt;/p&gt;

&lt;p&gt;Beyond helping us understand the complete picture of all of the different interleavings of transitions, logical time is also what enables time-travel debugging. We can’t logically move through a system until it’s been properly decomposed into states and the steps between them. This in itself is a design space - how much of the system state do we store vs. derive? How much additional state do we add to make things possible like searching for states by timestamp?&lt;/p&gt;

&lt;p&gt;All we need for logical time are states and transitions between them, i.e. logical time is inherently tied to state machines / transition systems. In fact, a time-travel debugger can pretty much be seen as a user interface for a state machine. But most importantly, this mental model allows us to have a totally deterministic view of the behavior of a complex system. That in turn enables powerful features like time-travel debugging.&lt;/p&gt;

&lt;p&gt;To take advantage of logical time, this model has to be built into an application somehow. Because our tools generally don’t have any notion of determinism, you often see this with language-layer patterns like Redux or the Elm Architecture, or architecture-level patterns like event sourcing. All of those patterns reduce nicely down to the sequential state machine model presented here, but they’re up to the application developer to implement. The question that the Tomorrow Corporation demo asks is: what do we get if our tools did this for us without any up-front effort?&lt;/p&gt;

&lt;p&gt;Imagine not needing to have to add sleeps / retries to tests of asynchronous behavior. Or imagine a tool that identified concurrent code and showed us the different interleavings that we might have otherwise been unaware of, and allowed us to step through and try each of them out. I’m not a Nix user (yet), but others are already imagining a world with deterministic package management. Non-determinism is fundamentally at odds with human brains it seems like, so I for one would love to see more determinism in any tool that I use.&lt;/p&gt;

&lt;p&gt;To get there, we’ll have to understand and implement logical time.&lt;/p&gt;

&lt;h1 id=&quot;outro&quot;&gt;Outro&lt;/h1&gt;

&lt;p&gt;I have no idea how the tools at Tomorrow Corporation are implemented, but I respect their commitment to determinism. Non-determinism is a part of life, but to have full control over a system it’s essential to view it through the deterministic lens of logical time. Because of things like concurrency which often rely on OS or language features that we can’t directly interact with, this can be difficult, but that video shows that there’s tremendous value in baking determinism further down into our foundational tools.&lt;/p&gt;

&lt;p&gt;The main thing I wanted to share in this post was a specific mental model. Sequential state machines are a tried and true model with deterministic properties, and they’ve legitimately changed how I look at software. In this model, a program is a set of behaviors, where each behavior is a sequence of states. It’s hard to imagine reducing programming down to a simpler explanation than that, and that clarity is necessary for wrangling complexity.&lt;/p&gt;

&lt;p&gt;The images in this post were generated from &lt;a href=&quot;https://learntla.com/&quot;&gt;TLA+ specs&lt;/a&gt;, which I won’t really explain, but hopefully they show that it doesn’t take a ton of effort to write simple models. TLA+ is a logic and tool which has this mental model at its foundation. I can’t recommend learning and using it enough. Its companion model checker makes the act of modeling tactile, and you can get machine feedback on your models vs. getting stuck in state-machine quicksand. The state graph visualizer is also very handy sometimes, though as was shown here is more useful when the bounds of the model are small.&lt;/p&gt;

&lt;p&gt;Here’s the spec for TodoMVC:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;------------------------------ MODULE TodoMVC ------------------------------
VARIABLES todos, completedTodos

Todos == {&quot;t1&quot;, &quot;t2&quot;}

Init == todos = {} /\ completedTodos = {}

RemainingTodos == Todos \ todos

IncompleteTodos == todos \ completedTodos

AddTodo == \E t \in RemainingTodos: todos' = todos \union {t} /\ UNCHANGED completedTodos

CompleteTodo == \E t \in IncompleteTodos: completedTodos' = completedTodos \union {t} /\ UNCHANGED todos

RemoveTodo == \E t \in todos: todos' = todos \ {t} /\ completedTodos' = completedTodos \ {t}

Next == AddTodo \/ CompleteTodo \/ RemoveTodo

=============================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And here’s the spec for the concurrency example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;---------------------------- MODULE Concurrency ----------------------------
EXTENDS Integers

VARIABLES requests

Requests == 0..2

Init == requests = [r \in Requests |-&amp;gt; &quot;idle&quot;]

SendRequest(r) == requests' = [requests EXCEPT ![r] = &quot;fetching&quot;]

RecvResponse(r) == requests' = [requests EXCEPT ![r] = &quot;done&quot;]

SendReq == \E r \in Requests: requests[r] = &quot;idle&quot; /\ SendRequest(r)

RecvResp == \E r \in Requests: requests[r] = &quot;fetching&quot; /\ RecvResponse(r)

Terminate == \A r \in Requests: requests[r] = &quot;done&quot; /\ UNCHANGED requests

Next == SendReq \/ RecvResp

=============================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Even if you never use TLA+, the mental model presented here can help understand software at a more fundamental level. Kudos to the Tomorrow Corporation team for an inspiring set of tools that I hope pushes people to think about determinism more.&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:fn1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;\s, but it actually is a good learning tool and proxy for most interactive applications &lt;a href=&quot;#fnref:fn1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Alex Weisberger</name></author><category term="plt" /><category term="formal_methods" /><category term="philosophy" /><summary type="html">Recently, Tomorrow Corporation released this video of their in-house tech stack doing some truly awesome time-travel debugging of a production-quality game. You should watch this video, even if you don’t read this post, because the workflow that they’ve created is really inspiring. The creator kept bringing up the fact that the reason their tools can do this is that they have determinism baked into them at the very foundational levels. You simply can’t bolt this on at higher levels in the stack. This got me thinking - not only do we rarely have this level of control in our projects, but I think it’s rare to even understand how determinism is possible in modern systems that are interactive, concurrent, and distributed. If we don’t understand this, we can’t ever move our tools toward determinism, which I think is a very good idea. It turns out that even if we can’t predict exactly how a program will execute in a specific run, we can still model and reason about it deterministically. This is a prerequisite for most formal methods, and while I understand that formal methods aren’t everyone’s cup of tea, this is the number one thing that I wish more people understood. So today, we won’t be talking about testing or verifying anything, we’ll just be looking to better understand software in general by diving into logical time and how it enables deterministic reasoning.</summary></entry><entry><title type="html">Efficient and Flexible Model-Based Testing</title><link href="/model-based-testing-theory/" rel="alternate" type="text/html" title="Efficient and Flexible Model-Based Testing" /><published>2023-01-31T00:00:00+00:00</published><updated>2023-01-31T00:00:00+00:00</updated><id>/model-based-testing-theory</id><content type="html" xml:base="/model-based-testing-theory/">&lt;p&gt;In &lt;a href=&quot;/model-based-testing/&quot;&gt;Property-Based Testing Against a Model of a Web Application&lt;/a&gt;, we built a web application and tested it against an executable reference model. The model-based test in that post checks sequences of actions against a global system state, which is simple to explain and implement, but is unsuitable for testing practical applications in their entirety. To test the diverse applications that arise in practice, as well as test more surface area of a single application, we’ll need a more efficient and flexible approach.&lt;/p&gt;

&lt;p&gt;In that post, I promised that we’d dive deeper into the theory of model-based testing. To upgrade our testing strategy, we’ll look at the theoretical concepts of &lt;em&gt;refinement mappings&lt;/em&gt;&lt;sup id=&quot;fnref:fn1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; and &lt;em&gt;auxiliary variables&lt;/em&gt;&lt;sup id=&quot;fnref:fn2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, and add in a couple of tweaks based on the specific context of testing. All of this will get applied to &lt;a href=&quot;https://github.com/amw-zero/personal_finance_funcorrect/blob/main/simulation.ts&quot;&gt;a real test of a full-stack application&lt;/a&gt;.&lt;/p&gt;

&lt;h1 id=&quot;a-quick-recap-of-actions&quot;&gt;A Quick Recap of Actions&lt;/h1&gt;

&lt;p&gt;Understanding the notion of “action” is essential for building our upgraded model-based testing strategy. When we say “action,” we mean something very specific: a transition in a state machine / state transition system, whichever name you prefer. It might be helpful to think of it from a code perspective:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;class Counter {
  count: number = 0;

  constructor(count: number) {
    this.count = count;
  }

  increment() {
    this.count += 1;
  }

  decrement() {
    this.count -= 1;
  }
}

let counter = new Counter();
counter.increment();
counter.decrement();
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;count&lt;/code&gt; is the state variable, and &lt;code&gt;increment&lt;/code&gt; and &lt;code&gt;decrement&lt;/code&gt; are &lt;em&gt;actions&lt;/em&gt; which transition the variable to a new state. Imagine the value of &lt;code&gt;count&lt;/code&gt; after each of these actions.&lt;/p&gt;

&lt;p&gt;The presence of a class has nothing to do with this being an object-oriented concept by the way, it’s just that classes are a convenient wrapper around a set of stateful variables and operations on them, and thus they are a good representation of a state machine. We could just as easily write:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;let count = 0;

function increment(count: number): number {
  return count + 1;
}

function decrement(count: number): number {
  return count - 1;
}

count = increment(count);
count = decrement(count);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These are behaviorally equivalent, which we can convince ourselves of by again imagining the value of the &lt;code&gt;count&lt;/code&gt; state variable after each action. The pattern that we use to talk about state machines is superficial, and has nothing to do with how to structure programs in the large. Don’t let the pattern get in the way of the underlying concepts: all we need are states and transitions between them, and we call these transitions “actions.”&lt;/p&gt;

&lt;p&gt;In an interactive application, actions are generally initiated by the user by clicking on or tapping UI elements. The system itself can trigger actions, for example via cron jobs. Even external systems can trigger actions in the system by calling web APIs.&lt;/p&gt;

&lt;p&gt;Actions are what allow an application to move through different states over time.&lt;/p&gt;

&lt;h1 id=&quot;a-preview-of-our-destination&quot;&gt;A Preview of Our Destination&lt;/h1&gt;

&lt;p&gt;The end goal is to convert our existing &lt;a href=&quot;/model-based-testing/&quot;&gt;model-based test&lt;/a&gt; into one that’s more efficient and allows us to check more interesting properties. To do that, we’re going to end up with something that looks like this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;type DeleteRecurringTransactionState = {
  recurringTransactions: RecurringTransaction[];
  id: number;
  db: DBState;
}

class Impl {
  db: DBState;
  client: Client;

  aux: AuxiliaryVariables;

  constructor(db: DBState, client: Client, aux: AuxiliaryVariables) {
    this.db = db;
    this.client = client;
    this.aux = aux;
  }

  async deleteRecurringTransaction(id: number) {
    await this.client.deleteRecurringTransaction(id);
    this.aux.clientModel.deleteRecurringTransaction(id);
  }

  ...
}

type AuxiliaryVariables = {
  clientModel: Budget;
}

function refinementMapping(impl: Impl): Budget {
  let budget = new Budget();
  budget.error = impl.client.error;

  budget.recurringTransactions = [...impl.db.recurring_transactions];
  budget.scheduledTransactions = [...impl.client.scheduledTransactions];

  return budget;
}

Deno.test(&amp;quot;deleteRecurringTransaction&amp;quot;, async (t) =&amp;gt; {  
  let state = /*&amp;lt;generate test state&amp;gt;*/;

  await fc.assert(
    fc.asyncProperty(state, async (state: DeleteRecurringTransactionState) =&amp;gt; {
      let client = new Client();
      client.recurringTransactions = state.recurringTransactions;

      let clientModel = new Budget();
      clientModel.recurringTransactions = state.recurringTransactions;

      let impl = new Impl(state.db, client, { clientModel });
      let model = refinementMapping(impl);

      const cresp = await client.setup(state.db);
      await cresp.arrayBuffer();

      await impl.deleteRecurringTransaction(state.id);
      model.deleteRecurringTransaction(state.id);

      impl.db.recurring_transactions = await client.dbstate();

      let mappedModel = refinementMapping(impl);

      await checkRefinementMapping(mappedModel, model, t);
      await checkImplActionProperties(impl, t);

      await client.teardown();
    }),
    { numRuns: 10, endOnFailure: true }
  );
});&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There’s no way to evaluate if this is a good test or even what exactly it’s testing for without understanding some theory. But all of this theory is in service of testing a real, functional single-page web application.&lt;/p&gt;

&lt;h1 id=&quot;correctness-as-equivalent-behavior-of-action-sequences&quot;&gt;Correctness as Equivalent Behavior of Action Sequences&lt;/h1&gt;

&lt;p&gt;We have to start all the way at the beginning and define what it really means for an implementation to be correct with respect to a model. Action sequences are a good choice for this, because they’re simple to understand. Using our &lt;code&gt;increment&lt;/code&gt; and &lt;code&gt;decrement&lt;/code&gt; functions from above, an example action sequence would be:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;type Action = &quot;increment&quot; | &quot;decrement&quot;;

// Combine individual actions into a single top-level action
function counterAction(counter: number, action: Action): number {
  switch (action) {
    case &quot;increment&quot;:
      return increment(counter);
    case &quot;decrement&quot;:
      return decrement(counter);
  }
}

type ActionFunc&amp;lt;S, A&amp;gt; = (state: S, action: A) =&amp;gt; S;

// Generic action sequence evaluation function
function execute&amp;lt;S, A&amp;gt;(actionFunc: ActionFunc&amp;lt;S, A&amp;gt;, init: S, actions: A[]): S {
  let result = init;
  for (const action of actions) {
    result = actionFunc(result, action);
  }

  return result;
}

let counter = 0;
execute(counterAction, counter, [&quot;increment&quot;, &quot;increment&quot;, &quot;decrement&quot;, &quot;increment&quot;]);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;An action sequence is one particular path through a system. Here, we incremented the counter twice, decremented once, and ended with another increment. These are some more valid action sequences:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;[“increment”]&lt;/li&gt;
  &lt;li&gt;[]&lt;/li&gt;
  &lt;li&gt;[“increment”, “decrement”, “decrement”, “decrement”]&lt;/li&gt;
  &lt;li&gt;[“decrement”]&lt;/li&gt;
  &lt;li&gt;[“decrement”, “increment”, “increment”, “decrement” “decrement”, “increment”, “increment”]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How many possible sequences of actions are there for our simple counter system? 1,000? 500,000,000? Unfortunately, the answer is infinity, and that’s true of all interactive systems. That’s one reason why testing and verification is hard.&lt;/p&gt;

&lt;p&gt;Even though they are infinite, it’s very natural to express the correctness of a model-based system in terms of action sequences using universal quantification, aka “for all” statements:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot; data-lang=&quot;plaintext&quot;&gt;** Holistic correctness statement **:

For all initial states &amp;#39;s&amp;#39;,
  all sequences of actions &amp;#39;acts&amp;#39;,
  a top-level action function &amp;#39;impl&amp;#39;,
  and a top-level action function &amp;#39;model&amp;#39;:
  
  execute(impl, s, acts) = execute(model, s, acts)&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Less formally: no matter what sequence of actions you take in the implementation, nor what state it starts in, it should always agree with the model. The key words being “no matter what” and “always” - this should be true of all actions, in any order, from any starting state, ever. In other words, this statement is &lt;em&gt;complete&lt;/em&gt;, and we’ll refer to it as “the holistic correctness statement.” It’s important to keep this statement in mind, since &lt;strong&gt;this is our definition of correctness and our end goal&lt;/strong&gt;, and any optimization that we do always has to tie back to it. (Note: this is also a classic way of expressing &lt;a href=&quot;/refinement/&quot;&gt;refinement&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;As we hinted at in the introduction, there are some very unfortunate things about this holistic correctness statement in a practical testing context. First is the &lt;code&gt;actions&lt;/code&gt; variable. A real application accepts an infinite stream of actions. Even though we limit our test to finite sequences, combinatorics is just not on our side, with the number of k-length sequences of n actions equaling n^k - a dreadful exponential growth curve. That means that as the number of actions in the systems grows, and as we test longer sequences, the number of possible interleavings of actions grows exponentially. Whatever subset of sequences our test generates is an infinitesimal portion of them all.&lt;/p&gt;

&lt;p&gt;Next is the &lt;code&gt;s&lt;/code&gt; variable. This is the &lt;em&gt;entire&lt;/em&gt; state of the system, and unless we’re building a counter application with a single integer variable it’s way too much data to generate in a test.&lt;/p&gt;

&lt;p&gt;A third problem is that &lt;code&gt;s&lt;/code&gt; is used in both of the model and implementation, which means that they both have to have the same state type. This very rarely works, because the whole point of separating the model and implementation is that the implementation is complex and will have additional state to deal with that. States are often incompatible in practice.&lt;/p&gt;

&lt;p&gt;The last straw is that sometimes, you don’t even have the state variables that you need to check for correctness. This sounds weird, but it’s well known that specifications often have to be augmented with “invisible” variables so that certain properties can be shown to hold.&lt;/p&gt;

&lt;p&gt;Each of these problems eventually arises when you try to use model-based testing, and we need some extra machinery to solve them.&lt;/p&gt;

&lt;h1 id=&quot;single-transitions-and-compatible-states-with-refinement-mappings&quot;&gt;Single Transitions and Compatible States with Refinement Mappings&lt;/h1&gt;

&lt;p&gt;Refinement mappings solve problems 1 and 3, and somewhat magically still also imply the truth of the holistic correctness statement. Meaning that, if we test for a proper refinement mapping, then it’s also true that the implementation correctly implements the model in all possible usage scenarios.&lt;/p&gt;

&lt;p&gt;A refinement mapping is just a function with a couple of special rules, some of which are out of scope for this post. The first rule is that the function is from the implementation state to the model state, e.g. in our preview of the budget app test we can see that the refinement mapping maps the &lt;code&gt;Impl&lt;/code&gt; implementation state type to the &lt;code&gt;Budget&lt;/code&gt; model type:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;function refinementMapping(impl: Impl): Budget {
  ...
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The goal here is to be able to compare the implementation to the model, and if they have different state types we need to translate states in the implementation’s state space to ones in the model’s. On top of this, the most relevant other rule for a valid refinement mapping is that, for all implementation states and actions, the action is equivalent to the model action with the refinement mapping applied in the appropriate places. In logic pseudocode:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot; data-lang=&quot;plaintext&quot;&gt;** Correctness via Refinement Mapping ** 
For all implementation states &amp;#39;s&amp;#39;,
  all implementation actions &amp;#39;impl&amp;#39;,
  all model actions &amp;#39;model&amp;#39;
  and a refinement mapping &amp;#39;rm&amp;#39;:

  rm(impl(s)) = model(rm(s))&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The intuition for why it works is that, if every single-step action in the implementation agrees with the same action taken in the model, then chaining multiple actions into sequences should preserve that equivalence. This is an example of an inductive argument. The refinement mapping function can be defined in many different ways depending on how we want to relate the two state types, which gives our new correctness statement an important caveat: we consider the system correct &lt;em&gt;under the provided refinement mapping&lt;/em&gt;. This is the price we pay for dealing with state incompatibilities.&lt;/p&gt;

&lt;p&gt;In our budget app test, the refinement mapping is defined as follows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;function refinementMapping(impl: Impl): Budget {
  let budget = new Budget();
  budget.error = impl.client.error;

  budget.recurringTransactions = [...impl.db.recurring_transactions];
  budget.scheduledTransactions = [...impl.client.scheduledTransactions];

  return budget;
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;code&gt;Impl&lt;/code&gt; implementation type has both database (&lt;code&gt;impl.db&lt;/code&gt;) and client states (&lt;code&gt;impl.client&lt;/code&gt;), reflecting the independent states in a client-server application. In this system, only recurring transactions are persisted, and scheduled transactions are derived data. Because of this, the implementation’s recurring transactions in the database map to the model’s recurring transactions, whereas the implementation’s scheduled transactions in the client map to the model’s scheduled transactions. Any error in the client maps to an error in the model. Notably, this is talking about &lt;em&gt;system&lt;/em&gt; errors, i.e. errors / results in the domain logic. The model has no notion of networking, so networking errors can be stored separately, but they don’t map to any model state&lt;sup id=&quot;fnref:fn3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;The meat of the test is where we compare single actions, and in order to do this we make the states compatible by applying the refinement mapping:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;...

let impl = new Impl(state.db, client, { clientModel });
let model = refinementMapping(impl);

...

// Run the action in the implementation and the model
await impl.deleteRecurringTransaction(state.id);
model.deleteRecurringTransaction(state.id);

...

let mappedModel = refinementMapping(impl);

await checkRefinementMapping(mappedModel, model, t);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The combination of comparing single transitions and converting between implementation and model state types is an efficiency and flexibility win. We’ve gone from potentially long sequences of actions to comparing simple function calls, we only need to generate a single state value per test iteration, &lt;em&gt;and&lt;/em&gt; we can compare the states of the implementation and model even if they aren’t the same type.&lt;/p&gt;

&lt;p&gt;It’s great progress, but we can do even better.&lt;/p&gt;

&lt;h1 id=&quot;from-global-to-local-state&quot;&gt;From Global to Local State&lt;/h1&gt;

&lt;p&gt;The &lt;code&gt;s&lt;/code&gt; variable in our new iteration of the correctness statement is still the global state, but an observation comes to mind: how much of the global state is necessary for each action? There’s no equation which answers this question directly, but intuitively, an action will only ever operate on a small subset of the global state, leaving the rest unchanged. We can then just ignore that superfluous state and think of the action as operating on its own, local state. This is not related to refinement mapping, or any other theory that I know of (though it might relate to one that I don’t know of), but ends up being a very useful optimization in practice.&lt;/p&gt;

&lt;p&gt;For example, consider an oddly-specific system for point translation:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;type Point = {
  x: number;
  y: number;
}

function translateX(point: Point, delta: number): Point {
  const result = { ...point };
  result.x += delta;

  return result;
}

function translateY(point: Point, delta: number): Point {
  const result = { ...point };
  result.y += delta;

  return result;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;translateX&lt;/code&gt; and &lt;code&gt;translateY&lt;/code&gt; are actions which operate on a &lt;code&gt;Point&lt;/code&gt; type, but each only modifies a single part of the state - only &lt;code&gt;x&lt;/code&gt; or &lt;code&gt;y&lt;/code&gt; of the &lt;code&gt;Point&lt;/code&gt;, but never both. Why, then, do we need to generate a full &lt;code&gt;Point&lt;/code&gt; type in our test for comparing them? We can instead construct a new action function, say &lt;code&gt;translateOnlyX&lt;/code&gt;, which only operates on the data that it actually modifies:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;function translateOnlyX(x: number, delta: number): number {
  return x + delta;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the model-based testing context, instead of comparing the functions at the global state level (&lt;code&gt;Point&lt;/code&gt; in this case), we can compare the actions at the local level:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot; data-lang=&quot;plaintext&quot;&gt;** Local Refinement Mapping Correctness Statement **

For all action functions &amp;#39;impl&amp;#39;,
  all action functions &amp;#39;model&amp;#39;,
  all local states &amp;#39;ls&amp;#39;,
  and a refinement mapping &amp;#39;rm&amp;#39;:
  
  rm(impl(ls)) = model(rm(ls))&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Breaking out the action implementation in this way has no behavioral effect on the global-level &lt;code&gt;translateX&lt;/code&gt; function, since &lt;code&gt;translateX&lt;/code&gt; can easily be implemented in terms of &lt;code&gt;translateOnlyX&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;function translateX(point: Point, delta: number): Point {
  const result = { ...point };
  result.x = translateOnlyX(result.x, delta);

  return result;
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;And this is exactly what’s going on in our upgraded budget test. In our excerpt, we’re only focusing on the &lt;code&gt;deleteRecurringTransaction&lt;/code&gt; action, and we generate a test state specific to this action:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;type DeleteRecurringTransactionState = {
  recurringTransactions: RecurringTransaction[];
  id: number;
  db: DBState;
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Deleting a recurring transaction doesn’t interact in any way with the &lt;code&gt;scheduledTransactions&lt;/code&gt; state variable in that application, so we can leave that out of the test state for this particular action.&lt;/p&gt;

&lt;p&gt;The end result of this is that we can get global guarantees for the cost of local checking, i.e. we can use local states to still show the holistic correctness statement.&lt;/p&gt;

&lt;h1 id=&quot;one-more-wrinkle&quot;&gt;One More Wrinkle&lt;/h1&gt;

&lt;p&gt;One last wrinkle presents itself for now - the notorious problem number 4. It may sound counterintuitive, but there are both refinement mappings and properties of our systems that are not expressible with the state variables of the system itself. Even if they are, they may be more naturally expressed by adding &lt;em&gt;auxiliary variables&lt;/em&gt;. Auxiliary variables are additional variables that are added to a program (usually the implementation) that don’t affect the behavior of the program, but can be used to state properties or aid in a refinement mapping to a model.&lt;/p&gt;

&lt;p&gt;Auxiliary variables provide one solution to a problem in the budget app test, and for tests for client-server applications in general. Our implementation is both the state component of a single-page application, and the corresponding server and database. One implication of that is that the client and database state can become out of sync. Consider the following action sequence:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The database starts with these recurring transactions: [rt1, rt2, rt3].&lt;/li&gt;
  &lt;li&gt;User 1 loads the home page - its client holds [rt1, rt2, rt3]&lt;/li&gt;
  &lt;li&gt;User 2 loads the home page - its client holds [rt1, rt2, rt3]&lt;/li&gt;
  &lt;li&gt;User 2 deletes rt2 - its client now holds [rt1, rt3], and the database holds [rt1, rt3]&lt;/li&gt;
  &lt;li&gt;User 1 adds a new recurring transaction, rt4 - its client holds [rt1, rt2, rt3, rt4] and the database holds [rt1, rt3, rt4].&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the end of these actions, the system has the following state:&lt;/p&gt;

&lt;p&gt;User 1’s client: [rt1, rt2, rt3, rt4]&lt;br /&gt;
User 2’s client: [rt1, rt3]&lt;br /&gt;
The database: [rt1, rt3, rt4]&lt;/p&gt;

&lt;p&gt;Again, there are a few different ways to approach either allowing or disallowing this behavior. One option is to just forbid differences in client values, but this would require a web socket to update all clients on each data write. While some applications actually do this (like chat applications), I would say that most don’t. Instead, we have to allow diverging client states, but we still want to do that in a controlled manner.&lt;/p&gt;

&lt;p&gt;Well, one solution to that is to add a separate model instance as an auxiliary variable to the implementation which tracks the source of truth of the state of the client alone. Then, whenever a write occurs, we double-write to the implementation and this client model. Again, there are many patterns for doing this, but I like wrapping the implementation (&lt;code&gt;Client&lt;/code&gt; here) in a new class with the same interface that forwards actions to the relevant members, this way the structure of the test doesn’t have to change and we keep all of the auxiliary variables in test-specific code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;class Impl {
  db: DBState;
  client: Client;

  aux: AuxiliaryVariables;

  constructor(db: DBState, client: Client, aux: AuxiliaryVariables) {
    this.db = db;
    this.client = client;
    this.aux = aux;
  }

  async deleteRecurringTransaction(id: number) {
    await this.client.deleteRecurringTransaction(id);
    this.aux.clientModel.deleteRecurringTransaction(id);
  }

  ...
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In the test excerpt, we see another assertion named &lt;code&gt;checkImplActionProperties&lt;/code&gt;&lt;sup id=&quot;fnref:fn4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fn4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, and its defintion will now make sense:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-typescript&quot; data-lang=&quot;typescript&quot;&gt;async function checkImplActionProperties(impl: Impl, t: Deno.TestContext) {
  await t.step(&amp;quot;loading is complete&amp;quot;, () =&amp;gt; assertEquals(impl.client.loading, false));

  await t.step(&amp;quot;write-through cache: client state reflects client model&amp;quot;,
    () =&amp;gt; assertEquals(impl.client.recurringTransactions, impl.aux.clientModel.recurringTransactions)
  );
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;After each action has been invoked, we check that the actual state of the client matches the state of the &lt;em&gt;client&lt;/em&gt; model, not the system model which is only aware of the database state. We also check that the loading variable in the client is false for good measure, ensuring that any spinners or other loading UI are hidden at the end of every action.&lt;/p&gt;

&lt;p&gt;The key here is that, as long as they don’t affect the behavior of the implementation, we can add any auxiliary variables we want for tracking &lt;em&gt;additional&lt;/em&gt; information. Once we have them, we can use them for test assertions, totally independent of the implementation that runs in production. They’re test-only code.&lt;/p&gt;

&lt;p&gt;I’m going to be honest - I can have too much fun with auxiliary variables, and that means that we should be careful with them. They are basically a cheat code, and can be used as an escape hatch to get out of all kinds of situations. That being said, they’re sometimes the most elegant solution to a problem, and they’re a key piece in making our test flexible enough to handle the many scenarios that arise in practice. If anything becomes difficult to assert on or express as a property, we can try and make them easier by adding new auxiliary variables.&lt;/p&gt;

&lt;h1 id=&quot;recap&quot;&gt;Recap&lt;/h1&gt;

&lt;p&gt;Alrighty. We went over four main problems and solutions to them:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Action sequences&lt;/li&gt;
  &lt;li&gt;Global state&lt;/li&gt;
  &lt;li&gt;State incompatibility&lt;/li&gt;
  &lt;li&gt;Expression inability&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We introduced refinement mappings, which are functions from the implementation state to the model state, and which require that single-transitions in the implementation and model must be equivalent under this mapping. This overcomes both state incompatibility and the need for action sequences. We showed that by using action-local state we can avoid ever constructing global system state in the test. And we showed that if we ever have the inability to express a property about our system, we can always add auxiliary variables which don’t affect the system behavior but track additional information that we can use in test assertions.&lt;/p&gt;

&lt;p&gt;What we ended up with is a framework for writing model-based tests that is both efficient and flexible, and applicable to real-world systems like database-backed web applications.&lt;/p&gt;

&lt;p&gt;The linked papers have plenty more theoretical background and examples for deeper dives on these topics.&lt;/p&gt;

&lt;h1 id=&quot;thanks&quot;&gt;Thanks&lt;/h1&gt;

&lt;p&gt;Big thanks to &lt;a href=&quot;https://www.hillelwayne.com&quot;&gt;Hillel Wayne&lt;/a&gt; for having an in depth conversation about refinement with me, which influenced my thinking about how to best define the system state for a client-server application.&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:fn1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I recommend reading &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/the-existence-of-refinement-mappings/&quot;&gt;this paper to get a handle on refinement mappings&lt;/a&gt;. Another name for this technique is &lt;em&gt;simulation&lt;/em&gt;, which you can see an example of in &lt;a href=&quot;https://doclsf.de/papers/klein_sw_10.pdf&quot;&gt;how seL4 proves that the implementation implements its functional specification&lt;/a&gt;. Both are the same ultimate idea - prove that one program implements another by showing that all single transitions in each implement each other. &lt;a href=&quot;#fnref:fn1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;We’ll expand on what auxiliary variables are throughout the post, but you can read more about them &lt;a href=&quot;https://lamport.azurewebsites.net/tla/hiding-and-refinement.pdf&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/auxiliary.pdf&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:fn2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Errors that can be present in the implementation but not the model are an interesting topic. For example, if a network error in a request during the course of an action in the implementation, then it certainly won’t complete the action in a way that implements the model. One option is to be liberal, and simply avoid comparing the model and implementation in this case. We didn’t cover stuttering here, but models are allowed to stutter (transition to the current state) during implemenation steps, so an implementation error could be interpreted as a model stutter. The issue is, if the network error happens on every single action invocation, the implementation will never match the non-stuttering step of the model. The other option is to be harsh, and require that there are no network errors in tests, but still plan for them and allow them in production. This current version of this test chooses to be harsh. I’ll let you know how that goes. &lt;a href=&quot;#fnref:fn3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fn4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;a href=&quot;https://www.hillelwayne.com/post/action-properties/&quot;&gt;Action properties&lt;/a&gt; are a subset of temporal properties. They allow you to assert things about state transitions, that you couldn’t assert about individual states. They’re very useful. &lt;a href=&quot;#fnref:fn4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Alex Weisberger</name></author><category term="testing" /><category term="formal_methods" /><summary type="html">In Property-Based Testing Against a Model of a Web Application, we built a web application and tested it against an executable reference model. The model-based test in that post checks sequences of actions against a global system state, which is simple to explain and implement, but is unsuitable for testing practical applications in their entirety. To test the diverse applications that arise in practice, as well as test more surface area of a single application, we’ll need a more efficient and flexible approach. In that post, I promised that we’d dive deeper into the theory of model-based testing. To upgrade our testing strategy, we’ll look at the theoretical concepts of refinement mappings1 and auxiliary variables2, and add in a couple of tweaks based on the specific context of testing. All of this will get applied to a real test of a full-stack application. I recommend reading this paper to get a handle on refinement mappings. Another name for this technique is simulation, which you can see an example of in how seL4 proves that the implementation implements its functional specification. Both are the same ultimate idea - prove that one program implements another by showing that all single transitions in each implement each other. &amp;#8617; We’ll expand on what auxiliary variables are throughout the post, but you can read more about them here and here. &amp;#8617;</summary></entry></feed>