Concerning Quality

Does Your Test Suite Account For Weak Transaction Isolation?

2023-12-31T00:00:00+00:00

Transaction isolation is the kind of thing that you learn about and it fills you with fear. Specifically, there are weak transaction isolation levels which allow some fairly unexpected behavior. Tools like Jepsen are used to test the general isolation guarantees of databases, but it’s pretty uncommon to check the application layer for issues related to isolation anomalies. These anomalies can impact actual domain logic, so it’s important to understand them as well as how we can test them.

What is Weak Transaction Isolation?

Transaction isolation means that concurrent transactions against a database will be independent of one another. It’s the “I” in ACID. Unfortunately, “independence” in this context is a spectrum, and there are actually different isolation levels that are supported, each with subtly different behavior.

Here’s a quick example script which makes concurrent queries against a database (Postgres):

create table txn_iso (ival int);
insert into txn_iso (ival) values(1);

import { Pool, Transaction } from 'https://deno.land/x/postgres/mod.ts';

const pool = new Pool({
  user: 'postgres',
  hostname: 'localhost',
  database: 'postgres',
  port: 5433,
  password: 'test1234',
}, 10);

async function runQuery(
  txn: Transaction,
  query: string,
  args: (string | number)[],
  beforeMsg: string,
  afterMsg: (result: any) => string
) {
  console.log(beforeMsg);
  const result = await txn.queryObject(query, args);
  console.log(afterMsg(result));
}

async function readTransaction() {
  const query = 'select ival from txn_iso';
  const printResult = (result: any) => `Read result: ${result.rows[0].ival}`;

  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  await runQuery(txn, query, [], 'Executing first read...',  printResult);

  // Wait for concurrent write to occur
  await new Promise(resolve => setTimeout(resolve, 2000));

  await runQuery(txn, query, [], 'Executing second read...', printResult);

  await txn.commit();

  await client.release();
}

async function writeTransaction() {
  await new Promise(resolve => setTimeout(resolve, 1000));
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  const updateVal = Math.floor(Math.random() * 1000);
  const updateMsgBefore = `Updating ival to ${updateVal}...`;
  const query = 'update txn_iso set ival = $1';
  await runQuery(txn, query, [updateVal], updateMsgBefore, () => 'ival updated');

  await txn.commit();

  await client.release();
}

await Promise.allSettled([readTransaction(), writeTransaction()]);

This script executes two transactions concurrently: one that reads the txn_iso.ival column two different times, and another which modifies the value of that column. There’s some sleeps sprinkled in so that the second read occurs after the write. The question is: do both reads return the same value?

In Postgres, with the default transaction isolation level set, the answer is surprisingly no. This is an example output of running the script:

Executing first read...
Read result: 839
Updating ival to 79...
ival updated
Executing second read...
Read result: 79

The first read will return the value of the column at the time that the read transaction begins, but the second read will return the value that was updated by the concurrent write transaction. That’s because the default level is Read Committed, which allows non-repeatable reads. A non-repeatable read means that in the span of the same transaction, queries to the same column may return different results! This isn’t unique to Postgres either - Read Committed is the default isolation level in Oracle and SQL Server as well¹.

This is surprising to a lot of people, and rightfully so, since it seems to go against the very definition of what a transaction is. But that’s because Read Committed is a weak transaction isolation level. Weak isolation means that transactions aren’t truly independent from one another, and the effects of one concurrent transaction can be seen in another. There’s 4 isolation levels defined by the ANSI SQL standard². All but Serializable, which is the strictest, are weak and allow some kind of interference between transactions.

Hopefully it’s clear why this is an issue. If you have an important column value, say a user’s account balance, you might query multiple different values in the same transaction which will surely result in a domain logic bug. Will our test suites catch such bugs? That depends how we set up the tests.

Simulating Concurrent Connections

The difficulty with coming up with tests that expose transaction isolation anomalies is that the test has to simulate multiple concurrent connections. Test cases almost always have the implicit assumption that they’re being executed by a single user, and isolation anomalies don’t show up in that scenario.

As an example, here’s some oversimplified code for making outbound transfers from an account with overdraft protection:

interface BalanceRepository {
  getBalance(txn: Transaction): Promise<number>;
  updateBalance(txn: Transaction, amount: number): Promise<void>;
}

async function checkOverdraftProtection(txn: Transaction, balanceRepo: BalanceRepository, amount: number) {
  const balance =  await balanceRepo.getBalance(txn);
  if (balance >= amount) {
    return;
  }

  balanceRepo.updateBalance(txn, balance + 100);
}

async function applyFundTransfer(txn: Transaction, balanceRepo: BalanceRepository, amount: number) {
  const balance = await balanceRepo.getBalance(txn);
  if (balance < amount) {
    console.error("Insufficient funds");
    return;
  }

  await balanceRepo.updateBalance(txn, balance - amount);
}

async function transferFunds(balanceRepo: BalanceRepository, amount: number) {
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  await checkOverdraftProtection(txn, balanceRepo, amount);
  await applyFundTransfer(txn, balanceRepo, amount);
  
  await txn.commit();

  await client.release();
}

const protectedBalanceRepo = {
  balance: 90,
  async getBalance(txn: Transaction) {
    return this.balance;
  },
  async updateBalance(txn: Transaction, amount: number) {
    this.balance = amount;
  }
}

await transferFunds(protectedBalanceRepo, 100);
console.assert(protectedBalanceRepo.balance === 90);

The main logic that we want to test is that overdraft protection adds additional funds when there’s not enough to cover a transfer, and that the final balance is correct. To test this, we’re placing all queries behind a BalanceRepository interface and creating a protectedBalanceRepo which starts out with insufficient funds but updates the balance based on overdraft protection.

This is the operation from the perspective of a single user and thus a single DB connection, so the insufficient funds error won’t get hit. As we saw with the Read Committed example though, another concurrent transaction can affect a value that’s read multiple times. So one way to simulate a concurrent transaction is to simply ignore the overdraft protection and specify a different balance result directly.

const concurrentWriteRepo = {
  currBalance: 0,
  balances: [100, 90],
  async getBalance(txn: Transaction) {
    return this.balances[this.currBalance++];
  },
  async updateBalance(txn: Transaction, amount: number) {
  }
}

...

await transferFunds(concurrentWriteRepo, 100);

This test double sets up two different balance results: it’ll first return 100, which will bypass overdraft protection, but the next balance check will return 90 which will result in an insufficient funds error. One way this would be possible in real life is if multiple people have access to the same account and initiate a transfer in close proximity to one another.

There’s a simple fix for this failure: just don’t read the balance multiple times, and instead pass in the sampled balance to any function that needs it:

async function transferFunds(balanceRepo: BalanceRepository, amount: number) {
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  const balance = await balanceRepo.getBalance(txn);
  const protectedBalance = await checkOverdraftProtection(txn, balance, balanceRepo, amount);
  await applyFundTransfer(txn, protectedBalance, balanceRepo, amount);
  
  await txn.commit();

  await client.release();
}

Now the logic of checkOverdraftProtection and applyFundTransfer can get changed to use a balance value instead of querying it. This also means that checkOverdraftProtection has to return the balance after protection is applied since applyFundTransfer used to get this value with the second balance query, and using the pre-protection balance will result in an insufficient funds error.

This solves the repeatable read anomaly by avoiding multiple reads, but there’s still a major issue: there’s a race condition between multiple concurrent transactions that can result in an incorrect balance.

Race Conditions and Serializability

To show the error we can execute two fund transfers concurrently against the actual DB, and we can introduce a write delay so that we can control which one writes last (we’d see errors even without this, but this reduces the non-determinism):

create table accounts (balance int);
insert into accounts (balance) values (100);

async function transferFunds(balanceRepo: BalanceRepository, amount: number, writeDelay?: number) {
  const client = await pool.connect();
  const txn = client.createTransaction();

  await txn.begin();

  const balance = await balanceRepo.getBalance(txn);

  if (writeDelay) {
    await new Promise(resolve => setTimeout(resolve, writeDelay));
  }

  const protectedBalance = await checkOverdraftProtection(txn, balance, balanceRepo, amount);
  await applyFundTransfer(txn, protectedBalance, balanceRepo, amount);
  
  await txn.commit();

  await client.release();
}

const postgresBalanceRepo = {
  async getBalance(txn: Transaction): Promise<number> {
    const result = await txn.queryObject('select balance from accounts');
    return result.rows[0].balance;
  },
  async updateBalance(txn: Transaction, amount: number) {
    await txn.queryObject('update accounts set balance = $1', [amount]);
  }
}

async function runInTransaction<T>(f: (txn: Transaction) => Promise<T>) {
  const conn = await pool.connect()
  const txn = conn.createTransaction()
  await txn.begin();

  const result = await f(txn);

  await txn.commit();
  await conn.release();

  return result
}

// Setup
await runInTransaction((txn) => {
  return postgresBalanceRepo.updateBalance(txn, 100);
});

// Run two concurrent fund transfers
await Promise.allSettled([transferFunds(postgresBalanceRepo, 80, 2000), transferFunds(postgresBalanceRepo, 60)]);

const balance = await runInTransaction((txn) => {
  return postgresBalanceRepo.getBalance(txn)
});

console.assert(balance === 60);

When this is run, the transfers will see the same initial balance (100), but the one with the write delay will overwrite the balance set in the other one. This also means that neither transfer will trigger overdraft protection, there will be no insufficient funds error, and the resulting balance will be an incorrect value of 20.

This is a serialization anomaly. The Postgres docs define this as:

The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.

There are only two possible orderings of the two fund transfers here:

Transfer 80, then transfer 60
Transfer 60, then transfer 80

Since the starting balance is 100, both of these cases should trigger overdraft protection on the second transfer, and the resulting balance in both cases should be 60 (100 + 100 - 80 - 60). Race conditions can exist when transactions don’t adhere to serializability, and that’s what’s going on here - two fund transfers are initiated, but only one is accounted for because of a concurrent race. This is known as the “lost update” problem.

There’s a few different ways to fix this, but the simplest is to lock the row for the duration of the transaction with FOR UPDATE:

const postgresBalanceRepo = {
  async getBalance(txn: Transaction): Promise<number> {
    const result = await txn.queryObject('select balance from accounts FOR UPDATE');
    return result.rows[0].balance;
  },
  ...
}

Again from the Postgres docs:

FOR UPDATE causes the rows retrieved by the SELECT statement to be locked as though for update. This prevents them from being locked, modified or deleted by other transactions until the current transaction ends.

This means that each transaction will grab a lock on the accounts row and that will block all other transactions from modifying that row until it’s complete, i.e. the transactions will execute in a serializable fashion. It’s worth noting that this is now slower. Without the lock the transactions could truly operate concurrently, but now they have to wait in contention over balance updates to the same account. This is necessary for correct behavior, but it’s worth understanding the tradeoff.

Also of note, test doubles alone won’t help with this bug because the fix is in the real Postgres repository implementation. Test doubles are useful for testing application code independent of the database in many cases, but transaction isolation is a case where the different levels are so subtly different that it’s simpler to test against the real thing.

Transaction isolation has a major impact on an application, both in terms of performance as well as its influence on domain logic. It’s important to integration test against the real database to make sure that a weak transaction isolation level isn’t the cause of concurrency bugs. To expose such bugs, we have to execute at least two concurrent transactions in a test case. Unfortunately this can require some amount of time-based coordination which is never ideal, but is often necessary when tools like databases have opaque non-deterministic behavior that’s out of our control.

Still, it is something we can and should test for at the application level.

Interestingly, MySQL’s default isolation level is Repeatable Read, which avoids the bug presented here. Still, it’s very uncommon for any DB to have a default level of Serializable, so most databases are operating with weak isolation. A notable exception is FoundationDB, which does default to Serializable. ↩
Postgres actually only implements 3 out of 4, because Read Uncommitted conflicts with Postgres’ MVCC implementation. For more detail on isolation levels as Postgres implements them, see: https://www.postgresql.org/docs/current/transaction-iso.html ↩

Forward and Backward Reasoning in Proof Assistants

2023-10-01T00:00:00+00:00

Proof assistants are really fascinating tools, but the learning curve can be extremely steep. If you’re a programmer by trade and not a mathematician, this curve can be even steeper, because it’s not like programmers are doling out proofs left and right at work. One particular sticking point that I had trouble overcoming is the difference between forward vs. backward reasoning - proofs assistants support both.

Forward Reasoning

When thinking about logic, we generally think about forward arguments which get built up from one statement to the next, in sequence. For example, let’s make a logical argument about monitoring. We want to get an alert when our app goes down, and one way we know that the app is down is when a test user can’t login and see the home page. The way to express that in logic is to say that the home page not loading implies that the app is down:

\[HomePageDoesntLoad \implies AppIsDown\]

Implication is a useful thing to know, but it only tells us about the overall relationship and doesn’t tell us whether the app is down right now or not. We want to know the current state so we can determine if we should page someone, and for that we can use one of the oldest rules in all of logic: modus ponens.

Modus ponens is also known as “implication elimination,” which more accurately describes its behavior. It allows us to infer something about an implication, but the conclusion no longer contains one - the implication gets eliminated:

\[\dfrac{P~~~~~~~P \implies Q}{ Q }\]

This is written out as an inference rule, which in this case means that if we know P is true, and we know that P implies Q, then we can infer that Q is also true. On top of the bar are the premises, and on the bottom is the conclusion which we can infer if the premises are true. The reason that this rule is so old is that it’s just a formal description of common sense - if P implies Q, and we know P is true, of course Q is true. That’s what implies means.

In our monitoring context, we can take P to be “the home page doesn’t load” and Q to be “the app is down,” and by this rule we can conclude that the app is down if we actually observe the home page being unable to load. This is a forward argument - when an inference rule is taken from top to bottom.

Proof assistants almost always support forward reasoning. One way to do this in Isabelle is with the frule tactic:

lemma 
  assumes HomePageDoesntLoad 
    and "HomePageDoesntLoad ⟶ AppIsDown"
  shows "AppIsDown"
  using assms
  by (frule_tac P=HomePageDoesntLoad and Q="AppIsDown" in mp)

mp is the rule for modus ponens, which is defined like this¹:

lemma 
  assumes "P ⟶ Q"
    and P
  shows Q
  ...

frule_tac allows us to take a forward logical step if the premises are shown to be true. Since they’re assumed here, they are true, and we prove AppIsDown in one step.

Backward Reasoning

Proof assistants also allow us to work backwards from a goal.

Let’s take a look at a backward proof of our lemma:

lemma 
  assumes hp_load: HomePageDoesntLoad 
    and imp_appdown: "HomePageDoesntLoad ⟶ AppIsDown"
  shows "AppIsDown"
  apply(rule_tac P=HomePageDoesntLoad and Q=AppIsDown in mp)
  using imp_appdown
    apply(assumption)
  using hp_load
    apply(assumption)
  done

Here, instead of frule_tac, we use rule_tac, which applies a rule in a backward fashion. Instead of going from top to bottom in the rule, we replace the current proof goal with the premises in the top of the rule. This allows us to prove each one separately, which is one of the main benefits of backward rule application: we can more easily divide and conquer a complicated proof.

It works because an inference rule can be interpreted in two ways. As we said, the forward interpretation is: “we can conclude the bottom if the top premises are true.” The backward interpretation is: “to prove the bottom, it suffices to prove the top premises.” These are logically equivalent.

To dive in a bit more, we can look at the proof state after each step in the proof above. At the beginning of the proof, the goal is simply the final conclusion we want to show:

lemma 
  assumes hp_load: HomePageDoesntLoad 
    and imp_appdown: "HomePageDoesntLoad ⟶ AppIsDown"
  shows "AppIsDown"

goal (1 subgoal):
 1. AppIsDown

Now we apply modus ponens backwards:

apply(rule_tac P=HomePageDoesntLoad and Q=AppIsDown in mp)

goal (2 subgoals):
 1. HomePageDoesntLoad ⟶ AppIsDown
 2. HomePageDoesntLoad

Instead of having to show AppIsDown directly, we now just have to show that HomePageDoesntLoad ⟶ AppIsDown and HomePageDoesntLoad. In a real proof, we’d have to figure out how to prove these independently, but here both of these are true by assumption so the rest of the proof just pulls in the appropriate one and applies it.

Which One’s Better?

The unfortunate answer is that there’s no preferred direction, and we’ll often want to use both. We can also use higher-level and more powerful tactics anyway, which abstract the underlying reasoning. This monitoring example is very trivial, and can be proven in Isabelle with a variety of one liners, like:

lemma 
  assumes HomePageDoesntLoad 
    and "HomePageDoesntLoad ⟶ AppIsDown"
  shows "AppIsDown"
  by (auto simp: assms)

Backward reasoning seems more natural many times, but this is likely because of the history of proof assistants: they were pretty much designed around backward reasoning and interactivity from the start. The line gets blurred with more recent developments like Isar, which is an Isabelle sublanguage for defining structured proofs. In Isar, individual steps might be proven in a backwards fashion, but the proof proceeds in a structured and forward manner. Isar proofs are almost always preferred because they more closely resemble pen-and-paper proofs, and bring the very relevant intermediate proof state to the foreground.

Here’s one for the monitoring example:

lemma 
  assumes imp_appdown: "HomePageDoesntLoad ⟶ AppIsDown"
    and hp_load: HomePageDoesntLoad
  shows "AppIsDown"
proof (rule mp[where P=HomePageDoesntLoad and Q=AppIsDown])
  from imp_appdown show "HomePageDoesntLoad ⟶ AppIsDown" by assumption
  from hp_load show HomePageDoesntLoad by assumption
qed

This pretty closely mirrors the backward proof from before, and that’s because the structure of the proof is based on the backward application of mp by choosing rule and not frule in the proof command. But now the intermediate goals are visible, which gives the proof more structure. This is especially helpful for more complicated goals that can’t be proven in a single step because each goal can be respectively built up via intermediate steps.

All this to say: the logical direction often changes throughout a proof in a proof assistant, and the same rules can be used both forwards and backwards. Knowing which direction is being used is crucial for understanding our proofs.

It’s actually defined as an axiom, which means it’s implicitly taken to be true, and it also uses the older-style Isabelle syntax which lists assumptions in brackets: "⟦P ⟶ Q; P⟧ ⟹ Q". But this is equivalent to the assumes ... shows ... syntax being used here. ↩

Compiling a Test Suite

2023-08-23T00:00:00+00:00

When I first stumbled upon certifying compilation¹, I was absolutely awestruck. I thought a compiler was a very specific thing, a translator from source to target language. But a certifying compiler goes further: it also proves its own correctness. My motto has become “most tests should be generated”, so this immediately seemed like a promising approach to my goal of improving the generative testing of interactive applications. It wasn’t immediately clear how exactly to incorporate this into that context, but after a little experimentation I now have a prototype of what it might look like.

First, rather than describe the theory, let me show you what the workflow of certifying compilation looks like. Imagine invoking a command like this:

certc source.cc -o myprogram -p proof

certc compiles the source file into an executable, like every other compiler, but in addition it outputs this proof file. Imagine that you can open up this file, and from its contents be convinced that the compilation run contained zero bugs, and the output myprogram is a perfect translation of source.cc². The compilation run is certified by this proof. Such compilers are sometimes referred to as self-certifying for this reason - they produce their own proof of correctness.

We know that proofs are hard though, and for most of us tests are sufficient. So what if instead, we had this workflow:

certc source.cc -o myprogram -t test
./test

Instead of generating a proof, we now generate a test suite, and instead of opening it up to inspect it, we run it. If it passes, we’re still convinced that the compilation run was correct. Visually, certifying compilation just adds one more output artifact to a compilation run, which we can call a “checker,” and looks something like this:

From Programs to Applications

At this point, this doesn’t look very applicable to something like a web application, and I’m mostly interested in testing interactive distributed applications. The idea of compiling a source model into a full-fledged web app is farfetched to say the least. I actually tried going down that path for a bit, and I can confirm: it is hard. It’s definitely an interesting research area, but for now let me pitch an alternative workflow that’s still based on the mental model of certifying compilation.

What if we assume that our target application is something that we hand-modify out of band, and we just generate the checker for it, i.e.:

certc model -c test
./test

And visually:

In this workflow, we hand-develop the implementation application as we do normally, but we still generate the checker from a model. This puts us under the umbrella of model-based testing, but we’re going to look at the proof techniques that a certifying compiler uses as inspiration for how we should generate the correctness tests. Because of this difference, I’d call this paradigm “certifying specification.”

What’s nice about this is that it slots right in to existing workflows. We can even TDD with this if we’re so inclined, by first changing logic in the model and then generating the failing tests before implementing them. Workflow-wise, it’s simple enough to work.

Writing a Model

Since the checker generation depends on the existence of a model, we should first talk about how to write one. The first question to ask is: should we use an existing language or a new language to write models in? I really try to avoid thinking about or suggesting the introduction of new languages into the ecosystem. But, the question has to be asked, because using an existing language has a lot of tradeoffs with respect to specification:

Existing languages have no notion of system structure, i.e. how do we distinguish system state vs. local variables? How do we distinguish system actions vs. local mutation? How do we parse an arbitrary program and get relevant information out of it to help with test generation?
Programming languages are meant for programming. There are aspects of specification that require other language features, such as the ability to express logical properties and the ability to control aspects of test generation.
Programming languages have additional features that aren’t necessary in a modeling context. For example, a model has no need for filesystem operations or networking.

These can be overcome by creating an embedded DSL within an existing language to restrict the structure of models, but embedded DSLs have their own set of tradeoffs³.

One other option is to use an existing specification language, like TLA+. TLA+ in particular is too powerful for us here - we really want to limit models to be executable so that we can use their logic in the checker.

I think these are all viable approaches, but I also think that there are enough reasons to create a language that’s purpose-built for this use case. I’ve been experimenting with one that I call Sligh. Here’s a model of a counter application in Sligh:

record Counter:
  name: Id(String)
  value: Int
end

process CounterApp:
  counters: Set(Counter)
  favorites: Set(String)

  def GetCounters():
    counters
  end

  def CreateCounter(name: String):
    counters := counters.append(Counter.new(name, 0))
  end

  def Increment(name: String):
    def findCounter(counter: Counter):
      counter.name.equals(name)
    end

    def updateCounter(counter: Counter):
      Counter.new(counter.name, counter.value + 1)
    end

    counters := counters.update(findCounter, updateCounter)
  end

  def AddFavorite(name: String):
    favorites := favorites.append(name)
  end

  def DeleteFavorite(name: String):
    def findFavorite(favName: String):
      name.equals(favName)
    end

    favorites := favorites.delete(findFavorite)
  end
end

Sligh is not meant to be revolutionary in any way at the language level (in fact it aims to be much simpler than the average general purpose language), and hopefully the functionality is clear here. The main goal is that it supports enough analysis so that we can generate our model-based tests. The main notable syntactic features are the := operator and the structure of the process definition. The := operator denotes updates of the system state, distinguished from any modification of local variables. The CounterApp app has a set of counters and a set of favorites as system state. Local variables exist, but mutations to those are implementation details and don’t matter from the perspective of testing. Having a specific operator for the system state allows simple syntactic analysis to find state changes, which is essential for generating the certification test.

For example, in the Increment action, we know that the counters state variable is modified, and in the AddFavorite action the favorites state variable is modified. If no assignments occur on a state variable in the span of an action, then we know for sure that it’s not modified in that action. This becomes very important later when we can exploit this to generate the minimum amount of test data necessary for a given test iteration.

Sligh processes also support nested defs which define system actions. System actions are the atomic ways that the system state can change, like adding or incrementing counters. For those conceptual user operations, we have corresponding CreateCounter, and Increment actions. This is what Sligh uses to determine which operations to generate tests for.

These syntactic restrictions lead to a very powerful semantic model of a system that’s also statically analyzable - they effectively form a DSL for describing state machines.

Compiling the Test Suite

A Sligh model doesn’t get compiled into a test suite directly. To compile the above counter model, we’d run:

sligh counter.sl -w witness

which generates a “witness” file. This is a good time to talk a bit about the compiler internals and why that is.

It’s common for certifying compilers to decouple per-program generated output from a separate checker⁴ that’s written once. This makes the code generation phase of the compiler simpler, but also allows the checker to be written and audited independently. This is extra important since the checker is our definition of correctness for the whole application, and a misstatement there affects the guarantees our certification test gives us.

Here’s the current checker that’s in use:

export function makeTest(
    actionName: string,
    stateType: "read" | "write",
    stateGen: any,
    implSetup: any,
    dbSetup: any,
    model: any,
    modelArg: any,
    clientModelArg: any,
    runImpl: any,
    expectations: any,
  ) {
    test(`Test local action simulation: ${actionName}`, async () => {
      let impl: StoreApi<ClientState>;
  
      await fc.assert(fc.asyncProperty(stateGen, async (state) => {
        impl = makeStore();        
  
        const clientState = implSetup(state);

        // Initialize client state
        impl.setState(clientState);

        // Initialize DB state
        await impl.getState().setDBState(dbSetup(state));

        // Run implementation action
        await runImpl(impl.getState(), state);

        // Run model action and assert
        switch (stateType) {
          case "write": {
            const clientModelResult = model(clientModelArg(state));
            for (const expectation of expectations) {
              const { modelExpectation, implExpectation } = expectation(clientModelResult, impl.getState());
    
              expect(implExpectation).toEqual(modelExpectation);
            }
            break;
          }
          case "read": {
            let modelResult = model(modelArg(state));
            for (const expectation of expectations) {
              const { modelExpectation, implExpectation } = expectation(modelResult, impl.getState());
    
              expect(implExpectation).toEqual(modelExpectation);
            }
            break;
          }
        }
      }).afterEach(async () => {
        // Cleanup DB state
        await impl.getState().teardownDBState();
      }), { endOnFailure: true, numRuns: 25 });
    });
  }

This looks similar to other model-based tests we’ve built before in that it compares the output of the model and implementation for a given action at a given initial state. This test is parameterized though, and all of the input parameters for a given test come from the witness.

A “witness” in the certifying compilation world refers to data that’s extracted from the source program during compilation. Here’s the witness output for the CreateCounter action:

interface Counter {
  name: string;
  value: number;
}

interface CreateCounterDBState {
  counters: Array<Counter>;
}

interface CreateCounterType {
  counters: Array<Counter>;
  name: string;
  db: CreateCounterDBState;
}

interface CreateCounterModelIn {
  name: string;
  counters: Array<Counter>;
}

let CreateCounterModel = (params: CreateCounterModelIn) => {
  let name = params.name;
  let counters = params.counters;
  counters = (() => {
    let a = [...counters];
    a.push({ name: name, value: 0 });
    return a;
  })();
  return { counters: counters };
};

// ...

{
  name: "CreateCounter",
  type: "write",
  stateGen: fc.record({
    counters: fc.uniqueArray(
      fc.record({ name: fc.string(), value: fc.integer() }),
      {
        selector: (e: any) => {
          return e.name;
        },
      }
    ),
    name: fc.string(),
    db: fc.record({
      counters: fc.uniqueArray(
        fc.record({ name: fc.string(), value: fc.integer() }),
        {
          selector: (e: any) => {
            return e.name;
          },
        }
      ),
    }),
  }),
  implSetup: (state: CreateCounterType) => {
    return { counters: state.counters };
  },
  dbSetup: (state: CreateCounterType) => {
    return { counters: state.db.counters, name: state.name };
  },
  model: CreateCounterModel,
  modelArg: (state: CreateCounterType) => {
    return { counters: state.db.counters, name: state.name };
  },
  clientModelArg: (state: CreateCounterType) => {
    return { counters: state.counters, name: state.name };
  },
  runImpl: (impl: ClientState, state: CreateCounterType) => {
    return impl.CreateCounter(state.name);
  },
  expectations: [
    (modelResult: CreateCounterModelOut, implState: ClientState) => {
      return {
        modelExpectation: { counters: modelResult.counters },
        implExpectation: { counters: implState.counters },
      };
    },
  ],
},

// ...

The details here are likely to change over time, but the key thing to notice is that all of this information is generated from the definition of CreateCounter in the model. Here’s the CreateCounter definition again for reference:

def CreateCounter(name: String):
  counters := counters.append(Counter.new(name, 0))
end

This action takes a name string as input, but it also modifies the counters state variable (which Sligh is able to detect because of the presence of the := operator). From this, one of the things we generate is a type for all of the test’s input data, CreateCounterType:

interface CreateCounterType {
  counters: Array<Counter>;
  name: string;
  db: CreateCounterDBState;
}

And the stateGen property of the witness object gets a corresponding data generator for this type:

fc.record({
  counters: fc.uniqueArray(
    fc.record({ name: fc.string(), value: fc.integer() }),
    {
      selector: (e: any) => {
        return e.name;
      },
    }
  ),
  name: fc.string(),
  db: fc.record({
    counters: fc.uniqueArray(
      fc.record({ name: fc.string(), value: fc.integer() }),
      {
        selector: (e: any) => {
          return e.name;
        },
      }
    ),
  }),
})

Also, note what this excludes. The test doesn’t have to generate the favorites variable since it’s not referenced or modified in the span of this particular action. The test for each action only has to generate the bare minimum amount of data it needs to function. And most importantly, this means we totally avoid creating any global system states. I think this will be the key to testing a larger application in this way.

Other than that, the other params are similarly extracted from the CreateCounter signature and code, providing overall assistance to the checker. I expect to be able to hone these witness definitions over time, but this works for now.

At this point it should be apparent that the compiler and checker both have to know about some very important system details. They need to know what language the test is written in. They need to know the pattern for executing actions on both the implementation and the model (here the implementation interface is a Zustand store meant to be embedded in a React app). They need to know what testing libraries are being used - here we’re using vitest and fast-check. And they need to be able to set up the state of external dependencies like the database, done here with calls to impl.getState().setDBState and impl.getState().teardownDBState(), which means that the server has to be able to help out with initializing data states.

Still, lots of the functionality is independent of these concerns, and my hope is to make the compiler extensible to different infrastructure and architectures via compiler backends. For now, sticking with this single architecture has supported the development of the prototype of this workflow.

Finally, the test gets wired up together in a single file runnable by the test runner:

import { makeTest } from './maketest';
import { witness } from './witness';

for (const testCase of witness) {
    makeTest(
      testCase.name, 
      testCase.type as "read" | "write",
      testCase.stateGen, 
      testCase.implSetup, 
      testCase.dbSetup, 
      testCase.model, 
      testCase.modelArg, 
      testCase.clientModelArg,
      testCase.runImpl, 
      testCase.expectations
    );
  }

Outro

Ok, I went into a lot of details about the internals of the Sligh compiler. But to reiterate, the developer workflow is just:

sligh counter.sl -w witness
./test

I’m using this on a working Next.js application, and workflow-wise it feels great. I’m excited to see what other challenges come up as the application grows.

I can’t rightfully end the post without talking about a few tradeoffs. I can probably write a whole separate post about that, since this one is already quite long, but two big ones are worth mentioning now. First, because we’re testing single state transitions, a test failure won’t tell you how to actually reproduce the failure. It might take a series of very particular action invocations to arrive at the starting state of the simulation test, and it’s not always clear if the specific state is likely or even legitimately possible in regular application usage. I have ideas there - similar to property-based testing failure minimization, it should be possible to search for action sequences that result in the failing initial state.

The second tradeoff is that data generation for property tests of a full application is non-trivial. Sligh is currently doing the bare minimum here, which is use type definitions to create data generators. I’m hoping the language can help out here though, and more intelligent generators might be able to be extracted from the model logic.

And lastly, I have to call out the awesome Cogent project one last time. So many of these ideas were inspired by the many publications from that project. Specifically, check out this paper: The Cogent Case for Property-Based Testing.

I first heard about certifying compilation through a talk on YouTube and a corresponding paper (by Liam O’Connor, Zilin Chen, Christine Rizkallah, Sidney Amani, Japheth Lim, Toby Murray, Yutaka Nagashima, Thomas Sewell, and Gerwin Klein). These are about the Cogent language, which compiles from itself to C, but also generates a proof of its correctness in Isabelle. ↩
Any compiler-writer will tell you, compilers are just as buggy as other programs. This is why certifying compilation exists in the first place - to provide higher assurance about the correctness of a compiler. ↩
I once read an interesting take about building embedded DSLs inside of an existing language that influenced my thinking here. The takeaway: eDSLs are often not worth it. ↩
Certifying Algorithms by R. M. McConnella, K. Mehlhornb, S. Näherc, P. Schweitzer ↩

Most Tests Should Be Generated

2023-07-02T00:00:00+00:00

Traditional testing wisdom eventually invokes the test pyramid, which is a guide to the proportion of tests to write along the isolation / integration spectrum. There’s an eternal debate about what the best proportion should be at each level, but interestingly it’s always presented with the assumption that test cases are hand-written. We should also think about test generation as a dimension, and if I were to draw a pyramid about it I’d place generated tests on the bottom and hand-written scenarios on top, i.e. most tests should be generated.

Correctness is What We Want

What are we even trying to do with testing? The end goal is to show correctness. We do this for two main reasons: to show that new functionality does what’s expected before release, and to ensure that existing functionality is not broken between releases. Tests are a means to this end, nothing more. Importantly, they also can only ever show approximate correctness. To understand that fully, let’s define correctness precisely. Here’s a paraphrasing of Kedar Namjoshi’s definition from Designing a Self-Certifying Compiler.

First we have to define what a program is. The simplest representation is just a function from values in X to values in Y. This may look oversimplified, but an interactive program can even be modeled this way by assuming the program function is invoked in response to each user interaction in a loop. So a program P is:

\[P: X \rightarrow Y\]

Correctness requires a specification to check against. This might be surprising, since one rarely exists, but think of traditional test suites as simply defining this specification point-wise. A specification S can be a function of the same type:

\[S: X \rightarrow Y\]

We can express correctness with the following property:

\[\forall x \in X: P(x) = S(x)\]

In English: for every x value in X, evaluating P(x) yields the same value as evaluating S(x).

Point being, we want to check that the implementation program does the same thing as the specification, always. Notice how achieving 100% branch coverage in a test suite doesn’t get us here by the way, since that doesn’t account for all inputs in X.

Let’s look at how scenarios and generated tests differ with how they show correctness.

Testing for Correctness with Scenarios

As I mentioned, the traditional test pyramid is talking about hand-written test scenarios, aka examples / test cases etc. Correctness is pretty simple to express as a logical property, but it’s very difficult to test for. The first thing we run into is the test oracle problem - how do we actually get the value of S(x) to check against? Executable specifications rarely exist (though I am a proponent of using them for this reason), so normally what happens is that the test writer interprets an informal specification and hard codes the expected value of S(x) for a specific x as the test assertion. The informal specification is what the team talks about when deciding to build the feature, and the test writer is the test oracle. Sometimes some details are written down, sometimes not, but the burden of coming up with the expected test value is always on the test writer, and it’s a completely manual process.

The next issue is the number of values in the input domain X. Each test case needs to specify a single input value from X, but testing for all values from X is not feasible in any way. This is not an exaggeration - if X is the set of single 64-bit integers, we’d have to check 18,446,744,073,709,551,616 test cases. This multiplies for each additional integer, and how many integers do you think are in the entire state of a realistic program? We said earlier that a test suite only approximates correctness, but this makes it more formal. A test suite actually represents this property:

\[TX \subseteq X \land \forall tx \in TX: P(tx) = S(tx)\]

How effective a test suite is boils down to how confident we are that testing the input values that we chose implies that the correctness will hold for all of the input values, i.e.

\[\forall tx \in TX: P(tx) = S(tx) \implies \forall x \in X: P(x) = S(x)\]

This is probably true sometimes, but we have no guarantee of it in general. How can we ever know that the values that we pick out of X are “good enough”?

So test scenarios have an informal and manual test oracle process, and are pretty quantitatively incomplete in terms of how much of the input domain they can possibly cover. That doesn’t mean they’re not useful! Testing via scenarios is unreasonably effective in practice. There are two main benefits to them. First, they’re easy to write. This is likely because they require very literal and linear reasoning, since we just need to assert on the actual output of the program. If we really want, we can just run the program and observe what it outputs and record that as a test assertion. People do this all the time, and there’s even a strategy that takes this to the extreme called “golden testing” or “snapshot testing.”

The next benefit, somewhat obviously, is that they’re specific. If we have a test case in our head that we know is really important to check, why not just write it out? When we do this, we also get more local error messaging when the test fails, which can point us in a very specific direction. This is always cited as one of the main benefits of unit testing, and it really is helpful to have a specific area of the code to look at vs. trying to track down a weird error in a million lines of code.

Now let’s look at generated tests.

Generating Tests for Properties

Our correctness statement from earlier is expressed as a property: P(x) = S(x) is a property that’s either true or not for all of the program inputs. Now, we know that we can’t actually check every single input in a test, but what we can do is generate lots and lots of inputs and check if the property holds. With property-based testing these inputs are usually generated randomly, but there are other data generation strategies as well. So here, we’re talking about property-based testing more generally, and it has a couple of subtly different problems than testing with scenarios.

When checking for properties, the test oracle problem also presents itself immediately. We can always evaluate P(x), since that’s our implementation that we obviously control, but how do we know what the expected value S(x) is? And, furthermore, we have a chicken-and-egg problem: how do we know what S(x) is when code is generating x and we don’t know what it is ahead of time?

The answer is to define S(x) with logic that we can actually execute in the test, i.e. an executable specification. This often sounds weird to people at first, but looking at our correctness statement this is the more natural way to test. Instead of implicitly defining the specification via a bunch of individual test cases, we just define S(x) and call it during testing. This can take the form of simple functions that represent invariants of the code, all the way up to entire models of the functional behavior.

The input space issue is also still present with property-based testing, but in a different way: generating data is hard. Like, really hard. One of the main challenges is logical constraints, e.g. “this number must be less than 100”. These constraints can get very complicated in real-world domains, and sometimes that even leads to performance issues where you have to discard generated inputs until the constraint is met.

Property-based testing has an absolute killer feature though: it discovers failure cases for you, i.e. it actually finds unknown unknowns. This is worth more than gold. With scenarios, you have to know the failure ahead of time, but isn’t every bug in production a result of a failure that you didn’t even think of before deploying? Rather than check cases that we know ahead of time, we generate tests that search for interesting failures. This simply can’t be done with ahead-of-time test scenarios.

The Test Generation Pyramid

We looked at some of the pros and cons of scenarios vs. generated tests, so which should we prefer? I definitely think we should write both kinds, but overall most tests should be generated. Test strategies have to be represented as a triangle, so here is this idea in triangle form:

Why should we prefer generated tests? It all boils down to the fact that they find failures for us, which means that they naturally bring us closer to correctness. Unfortunately, no matter how perfect our selected test scenarios are they leave the vast majority of the input space uncovered, and there is no way to know which uncovered inputs are important and which are redundant. By having a suite of generated tests that are constantly looking for new inputs, we put ourselves in the best position to find edge cases that we just aren’t considering at the moment.

It’s like having a robot exploratory tester that we can deploy at will, which opens up a whole new mode of testing. We can run generated tests in CI before merging, sure, but we can also run them around the clock since generated tests search for failures vs. checking predetermined scenarios. More testing time means more of the input domain being searched, so to check more inputs we simply run each generated test for longer and run more test processes in parallel.

This doesn’t mean that we stop writing scenarios. That’s why there’s two sections in the pyramid. All of the proposed values of test scenarios are valid - we get specific error messages, free executable documentation, and a guarantee that important cases are checked. But generated tests are fundamentally stronger than scenarios, since the generated tests will often find the same inputs that we use in our scenarios in addition to ones we haven’t thought about.

Since the ultimate goal of testing is correctness, not documentation and local error messages, it’s in our best interest to supplement our scenarios with lots and lots of generated tests.

Logical Time and Deterministic Execution

2023-02-28T00:00:00+00:00

Recently, Tomorrow Corporation released this video of their in-house tech stack doing some truly awesome time-travel debugging of a production-quality game. You should watch this video, even if you don’t read this post, because the workflow that they’ve created is really inspiring. The creator kept bringing up the fact that the reason their tools can do this is that they have determinism baked into them at the very foundational levels. You simply can’t bolt this on at higher levels in the stack.

This got me thinking - not only do we rarely have this level of control in our projects, but I think it’s rare to even understand how determinism is possible in modern systems that are interactive, concurrent, and distributed. If we don’t understand this, we can’t ever move our tools toward determinism, which I think is a very good idea. It turns out that even if we can’t predict exactly how a program will execute in a specific run, we can still model and reason about it deterministically. This is a prerequisite for most formal methods, and while I understand that formal methods aren’t everyone’s cup of tea, this is the number one thing that I wish more people understood. So today, we won’t be talking about testing or verifying anything, we’ll just be looking to better understand software in general by diving into logical time and how it enables deterministic reasoning.

User Interaction and Non-Deterministic Choice

Talk of non-determinism can get very abstract very quickly, but there is a practical manifestation that we’ve all observed even if we didn’t know the term: non-deterministic choice. An application with a user interface is a classic example of a system with non-deterministic choice - no one can predict the order that a user will click through the interface, and the user is free to make any choice that’s visible and enabled.

We’ll introduce an example to get more specific, and it’s important to always use TodoMVC as the interactive application example¹ (here’s one of the implementations if you want to click around). In TodoMVC, we can add new named to-do items and then mark them as completed. We can also remove a to-do without marking it as completed. Like all interactive applications, we can do this in any order though, and these are all valid sequences of actions:

Add to-do named “t1”
Mark “t1” as completed

Add to-do named “t1”
Remove “t1”
Add to-do named “t2”
Mark “t1” as completed

Add to-do named “t1”
Mark “t1” as completed
Add to-do named “t2”
Mark “t2” as completed
Add to-do named “t3”
Add to-do named “t4”
Add to-do named “t5”
Remove “t3
Add to-do named “t6”
Add to-do named “t7”
Remove “t4”
Add to-do named “t8”
Add to-do named “t9”
Mark “t6” as completed

We can visualize this non-determinism with a state graph:

A non-deterministic choice exists when more than one transition arrow flows away from a given state. It means that all of them are valid choices that can occur in separate executions, but one has to somehow be chosen to proceed through the state graph. An interactive application lets the user decide via the UI, but as we’ll see later, there are other things that can make choices. Functionally, it doesn’t matter who does the choosing.

A quick aside: this is the complete behavior up to a bound of 2 to-dos. Physical space constraints aside, the full state graph of TodoMVC is theoretically infinite, because you can always add a to-do with a new name. Visualizing infinite bubbles is painful for everyone involved, so we place a constraint on the model along the lines of “there are only two to-dos in the entire universe.” This is a silly constraint, but it helps us visualize the state space in a manageable way. Bounded models also help with making properties checkable, but we’re not talking about that today because we’re not actually doing formal methods!

Let’s look at an example run through the program by picking specific choices. We’ll start at the gray initial state, add two to-dos named “t1” and “t2”, and then we’ll complete them both. Here’s that path in red:

We can get to the same final state a different way, by adding to-do “t2”, completing it, then adding to-do “t1” and completing it:

We all know how software works intuitively, but seeing these runs against the full state graph hints at a couple of precise definitions: software behavior is simply a sequence of states, and a program is a set of allowable behaviors. It also gives us our first step towards determinism. When a non-deterministic choice exists, we don’t know which path will be taken in a specific program run, but we do know what all of the possible runs are. Each of those runs is a totally deterministic behavior.

Said another way, a non-deterministic choice becomes deterministic when we pick one.

For fun, here’s the state graph of TodoMVC with 5 to-dos:

Determinism isn’t necessarily easy.

Concurrency

Concurrency is another notorious source of non-determinism, but let’s define why. Imagine we have N network requests that start in an idle state, begin fetching some data, and eventually complete. Continuing to keep our bounds small, let’s start with N = 2:

In every state, we can either initiate an uninitiated request or an in-progress request can complete. It’s possible for different requests to complete in different orders too, e.g. request 0 can complete first:

And request 1 can also complete first, even if request 0 was initiated before it:

The order that requests complete is a non-deterministic choice, which we’ve already seen, but there’s a major difference from the TodoMVC example: the OS or language runtime determines the choice, not a user. This is one reason why concurrency is a constant thorn in the side, and feels much more complex than the non-determinism of user interfaces. We literally don’t have control over the order of operations.

In the same way as the choices in the user interface, though, we just have to account for all of their combinations, and then we can know which orders of execution are possible. Another way to think about this is that if a race is possible, both sides of the race will always eventually occur, and we have to plan for both cases.

Because N = 2 is no fun, here’s N = 5 (i.e. 5 concurrent requests) which has 639 distinct states:

I’m sure a mutex will make this more manageable.

Logical Time, Time-Travel, and Beyond

Both state graphs show the set of all behaviors for the given system, and they do this by showing logical time, in contrast to physical time. A user might wait 17 years before selecting a transition in a UI, or an OS scheduler might pick one thread to execute while another waits for I/O. The real-world execution of a program runs in physical time, but our state graphs are only concerned with abstract states and transitions between them. And good thing for that - it would be awkward to have to wait 17 years to understand the possible behaviors of TodoMVC.

Beyond helping us understand the complete picture of all of the different interleavings of transitions, logical time is also what enables time-travel debugging. We can’t logically move through a system until it’s been properly decomposed into states and the steps between them. This in itself is a design space - how much of the system state do we store vs. derive? How much additional state do we add to make things possible like searching for states by timestamp?

All we need for logical time are states and transitions between them, i.e. logical time is inherently tied to state machines / transition systems. In fact, a time-travel debugger can pretty much be seen as a user interface for a state machine. But most importantly, this mental model allows us to have a totally deterministic view of the behavior of a complex system. That in turn enables powerful features like time-travel debugging.

To take advantage of logical time, this model has to be built into an application somehow. Because our tools generally don’t have any notion of determinism, you often see this with language-layer patterns like Redux or the Elm Architecture, or architecture-level patterns like event sourcing. All of those patterns reduce nicely down to the sequential state machine model presented here, but they’re up to the application developer to implement. The question that the Tomorrow Corporation demo asks is: what do we get if our tools did this for us without any up-front effort?

Imagine not needing to have to add sleeps / retries to tests of asynchronous behavior. Or imagine a tool that identified concurrent code and showed us the different interleavings that we might have otherwise been unaware of, and allowed us to step through and try each of them out. I’m not a Nix user (yet), but others are already imagining a world with deterministic package management. Non-determinism is fundamentally at odds with human brains it seems like, so I for one would love to see more determinism in any tool that I use.

To get there, we’ll have to understand and implement logical time.

Outro

I have no idea how the tools at Tomorrow Corporation are implemented, but I respect their commitment to determinism. Non-determinism is a part of life, but to have full control over a system it’s essential to view it through the deterministic lens of logical time. Because of things like concurrency which often rely on OS or language features that we can’t directly interact with, this can be difficult, but that video shows that there’s tremendous value in baking determinism further down into our foundational tools.

The main thing I wanted to share in this post was a specific mental model. Sequential state machines are a tried and true model with deterministic properties, and they’ve legitimately changed how I look at software. In this model, a program is a set of behaviors, where each behavior is a sequence of states. It’s hard to imagine reducing programming down to a simpler explanation than that, and that clarity is necessary for wrangling complexity.

The images in this post were generated from TLA+ specs, which I won’t really explain, but hopefully they show that it doesn’t take a ton of effort to write simple models. TLA+ is a logic and tool which has this mental model at its foundation. I can’t recommend learning and using it enough. Its companion model checker makes the act of modeling tactile, and you can get machine feedback on your models vs. getting stuck in state-machine quicksand. The state graph visualizer is also very handy sometimes, though as was shown here is more useful when the bounds of the model are small.

Here’s the spec for TodoMVC:

------------------------------ MODULE TodoMVC ------------------------------
VARIABLES todos, completedTodos

Todos == {"t1", "t2"}

Init == todos = {} /\ completedTodos = {}

RemainingTodos == Todos \ todos

IncompleteTodos == todos \ completedTodos

AddTodo == \E t \in RemainingTodos: todos' = todos \union {t} /\ UNCHANGED completedTodos

CompleteTodo == \E t \in IncompleteTodos: completedTodos' = completedTodos \union {t} /\ UNCHANGED todos

RemoveTodo == \E t \in todos: todos' = todos \ {t} /\ completedTodos' = completedTodos \ {t}

Next == AddTodo \/ CompleteTodo \/ RemoveTodo

=============================================================================

And here’s the spec for the concurrency example:

---------------------------- MODULE Concurrency ----------------------------
EXTENDS Integers

VARIABLES requests

Requests == 0..2

Init == requests = [r \in Requests |-> "idle"]

SendRequest(r) == requests' = [requests EXCEPT ![r] = "fetching"]

RecvResponse(r) == requests' = [requests EXCEPT ![r] = "done"]

SendReq == \E r \in Requests: requests[r] = "idle" /\ SendRequest(r)

RecvResp == \E r \in Requests: requests[r] = "fetching" /\ RecvResponse(r)

Terminate == \A r \in Requests: requests[r] = "done" /\ UNCHANGED requests

Next == SendReq \/ RecvResp

=============================================================================

Even if you never use TLA+, the mental model presented here can help understand software at a more fundamental level. Kudos to the Tomorrow Corporation team for an inspiring set of tools that I hope pushes people to think about determinism more.

\s, but it actually is a good learning tool and proxy for most interactive applications ↩

Efficient and Flexible Model-Based Testing

2023-01-31T00:00:00+00:00

In Property-Based Testing Against a Model of a Web Application, we built a web application and tested it against an executable reference model. The model-based test in that post checks sequences of actions against a global system state, which is simple to explain and implement, but is unsuitable for testing practical applications in their entirety. To test the diverse applications that arise in practice, as well as test more surface area of a single application, we’ll need a more efficient and flexible approach.

In that post, I promised that we’d dive deeper into the theory of model-based testing. To upgrade our testing strategy, we’ll look at the theoretical concepts of refinement mappings¹ and auxiliary variables², and add in a couple of tweaks based on the specific context of testing. All of this will get applied to a real test of a full-stack application.

A Quick Recap of Actions

Understanding the notion of “action” is essential for building our upgraded model-based testing strategy. When we say “action,” we mean something very specific: a transition in a state machine / state transition system, whichever name you prefer. It might be helpful to think of it from a code perspective:

class Counter {
  count: number = 0;

  constructor(count: number) {
    this.count = count;
  }

  increment() {
    this.count += 1;
  }

  decrement() {
    this.count -= 1;
  }
}

let counter = new Counter();
counter.increment();
counter.decrement();

count is the state variable, and increment and decrement are actions which transition the variable to a new state. Imagine the value of count after each of these actions.

The presence of a class has nothing to do with this being an object-oriented concept by the way, it’s just that classes are a convenient wrapper around a set of stateful variables and operations on them, and thus they are a good representation of a state machine. We could just as easily write:

let count = 0;

function increment(count: number): number {
  return count + 1;
}

function decrement(count: number): number {
  return count - 1;
}

count = increment(count);
count = decrement(count);

These are behaviorally equivalent, which we can convince ourselves of by again imagining the value of the count state variable after each action. The pattern that we use to talk about state machines is superficial, and has nothing to do with how to structure programs in the large. Don’t let the pattern get in the way of the underlying concepts: all we need are states and transitions between them, and we call these transitions “actions.”

In an interactive application, actions are generally initiated by the user by clicking on or tapping UI elements. The system itself can trigger actions, for example via cron jobs. Even external systems can trigger actions in the system by calling web APIs.

Actions are what allow an application to move through different states over time.

A Preview of Our Destination

The end goal is to convert our existing model-based test into one that’s more efficient and allows us to check more interesting properties. To do that, we’re going to end up with something that looks like this:

type DeleteRecurringTransactionState = {
  recurringTransactions: RecurringTransaction[];
  id: number;
  db: DBState;
}

class Impl {
  db: DBState;
  client: Client;

  aux: AuxiliaryVariables;

  constructor(db: DBState, client: Client, aux: AuxiliaryVariables) {
    this.db = db;
    this.client = client;
    this.aux = aux;
  }

  async deleteRecurringTransaction(id: number) {
    await this.client.deleteRecurringTransaction(id);
    this.aux.clientModel.deleteRecurringTransaction(id);
  }

  ...
}

type AuxiliaryVariables = {
  clientModel: Budget;
}

function refinementMapping(impl: Impl): Budget {
  let budget = new Budget();
  budget.error = impl.client.error;

  budget.recurringTransactions = [...impl.db.recurring_transactions];
  budget.scheduledTransactions = [...impl.client.scheduledTransactions];

  return budget;
}

Deno.test("deleteRecurringTransaction", async (t) => {  
  let state = /*<generate test state>*/;

  await fc.assert(
    fc.asyncProperty(state, async (state: DeleteRecurringTransactionState) => {
      let client = new Client();
      client.recurringTransactions = state.recurringTransactions;

      let clientModel = new Budget();
      clientModel.recurringTransactions = state.recurringTransactions;

      let impl = new Impl(state.db, client, { clientModel });
      let model = refinementMapping(impl);

      const cresp = await client.setup(state.db);
      await cresp.arrayBuffer();

      await impl.deleteRecurringTransaction(state.id);
      model.deleteRecurringTransaction(state.id);

      impl.db.recurring_transactions = await client.dbstate();

      let mappedModel = refinementMapping(impl);

      await checkRefinementMapping(mappedModel, model, t);
      await checkImplActionProperties(impl, t);

      await client.teardown();
    }),
    { numRuns: 10, endOnFailure: true }
  );
});

There’s no way to evaluate if this is a good test or even what exactly it’s testing for without understanding some theory. But all of this theory is in service of testing a real, functional single-page web application.

Correctness as Equivalent Behavior of Action Sequences

We have to start all the way at the beginning and define what it really means for an implementation to be correct with respect to a model. Action sequences are a good choice for this, because they’re simple to understand. Using our increment and decrement functions from above, an example action sequence would be:

type Action = "increment" | "decrement";

// Combine individual actions into a single top-level action
function counterAction(counter: number, action: Action): number {
  switch (action) {
    case "increment":
      return increment(counter);
    case "decrement":
      return decrement(counter);
  }
}

type ActionFunc<S, A> = (state: S, action: A) => S;

// Generic action sequence evaluation function
function execute<S, A>(actionFunc: ActionFunc<S, A>, init: S, actions: A[]): S {
  let result = init;
  for (const action of actions) {
    result = actionFunc(result, action);
  }

  return result;
}

let counter = 0;
execute(counterAction, counter, ["increment", "increment", "decrement", "increment"]);

An action sequence is one particular path through a system. Here, we incremented the counter twice, decremented once, and ended with another increment. These are some more valid action sequences:

[“increment”]
[]
[“increment”, “decrement”, “decrement”, “decrement”]
[“decrement”]
[“decrement”, “increment”, “increment”, “decrement” “decrement”, “increment”, “increment”]

How many possible sequences of actions are there for our simple counter system? 1,000? 500,000,000? Unfortunately, the answer is infinity, and that’s true of all interactive systems. That’s one reason why testing and verification is hard.

Even though they are infinite, it’s very natural to express the correctness of a model-based system in terms of action sequences using universal quantification, aka “for all” statements:

** Holistic correctness statement **:

For all initial states 's',
  all sequences of actions 'acts',
  a top-level action function 'impl',
  and a top-level action function 'model':
  
  execute(impl, s, acts) = execute(model, s, acts)

Less formally: no matter what sequence of actions you take in the implementation, nor what state it starts in, it should always agree with the model. The key words being “no matter what” and “always” - this should be true of all actions, in any order, from any starting state, ever. In other words, this statement is complete, and we’ll refer to it as “the holistic correctness statement.” It’s important to keep this statement in mind, since this is our definition of correctness and our end goal, and any optimization that we do always has to tie back to it. (Note: this is also a classic way of expressing refinement).

As we hinted at in the introduction, there are some very unfortunate things about this holistic correctness statement in a practical testing context. First is the actions variable. A real application accepts an infinite stream of actions. Even though we limit our test to finite sequences, combinatorics is just not on our side, with the number of k-length sequences of n actions equaling n^k - a dreadful exponential growth curve. That means that as the number of actions in the systems grows, and as we test longer sequences, the number of possible interleavings of actions grows exponentially. Whatever subset of sequences our test generates is an infinitesimal portion of them all.

Next is the s variable. This is the entire state of the system, and unless we’re building a counter application with a single integer variable it’s way too much data to generate in a test.

A third problem is that s is used in both of the model and implementation, which means that they both have to have the same state type. This very rarely works, because the whole point of separating the model and implementation is that the implementation is complex and will have additional state to deal with that. States are often incompatible in practice.

The last straw is that sometimes, you don’t even have the state variables that you need to check for correctness. This sounds weird, but it’s well known that specifications often have to be augmented with “invisible” variables so that certain properties can be shown to hold.

Each of these problems eventually arises when you try to use model-based testing, and we need some extra machinery to solve them.

Refinement mappings solve problems 1 and 3, and somewhat magically still also imply the truth of the holistic correctness statement. Meaning that, if we test for a proper refinement mapping, then it’s also true that the implementation correctly implements the model in all possible usage scenarios.

A refinement mapping is just a function with a couple of special rules, some of which are out of scope for this post. The first rule is that the function is from the implementation state to the model state, e.g. in our preview of the budget app test we can see that the refinement mapping maps the Impl implementation state type to the Budget model type:

function refinementMapping(impl: Impl): Budget {
  ...
}

The goal here is to be able to compare the implementation to the model, and if they have different state types we need to translate states in the implementation’s state space to ones in the model’s. On top of this, the most relevant other rule for a valid refinement mapping is that, for all implementation states and actions, the action is equivalent to the model action with the refinement mapping applied in the appropriate places. In logic pseudocode:

** Correctness via Refinement Mapping ** 
For all implementation states 's',
  all implementation actions 'impl',
  all model actions 'model'
  and a refinement mapping 'rm':

  rm(impl(s)) = model(rm(s))

The intuition for why it works is that, if every single-step action in the implementation agrees with the same action taken in the model, then chaining multiple actions into sequences should preserve that equivalence. This is an example of an inductive argument. The refinement mapping function can be defined in many different ways depending on how we want to relate the two state types, which gives our new correctness statement an important caveat: we consider the system correct under the provided refinement mapping. This is the price we pay for dealing with state incompatibilities.

In our budget app test, the refinement mapping is defined as follows:

function refinementMapping(impl: Impl): Budget {
  let budget = new Budget();
  budget.error = impl.client.error;

  budget.recurringTransactions = [...impl.db.recurring_transactions];
  budget.scheduledTransactions = [...impl.client.scheduledTransactions];

  return budget;
}

The Impl implementation type has both database (impl.db) and client states (impl.client), reflecting the independent states in a client-server application. In this system, only recurring transactions are persisted, and scheduled transactions are derived data. Because of this, the implementation’s recurring transactions in the database map to the model’s recurring transactions, whereas the implementation’s scheduled transactions in the client map to the model’s scheduled transactions. Any error in the client maps to an error in the model. Notably, this is talking about system errors, i.e. errors / results in the domain logic. The model has no notion of networking, so networking errors can be stored separately, but they don’t map to any model state³.

The meat of the test is where we compare single actions, and in order to do this we make the states compatible by applying the refinement mapping:

...

let impl = new Impl(state.db, client, { clientModel });
let model = refinementMapping(impl);

...

// Run the action in the implementation and the model
await impl.deleteRecurringTransaction(state.id);
model.deleteRecurringTransaction(state.id);

...

let mappedModel = refinementMapping(impl);

await checkRefinementMapping(mappedModel, model, t);

The combination of comparing single transitions and converting between implementation and model state types is an efficiency and flexibility win. We’ve gone from potentially long sequences of actions to comparing simple function calls, we only need to generate a single state value per test iteration, and we can compare the states of the implementation and model even if they aren’t the same type.

It’s great progress, but we can do even better.

From Global to Local State

The s variable in our new iteration of the correctness statement is still the global state, but an observation comes to mind: how much of the global state is necessary for each action? There’s no equation which answers this question directly, but intuitively, an action will only ever operate on a small subset of the global state, leaving the rest unchanged. We can then just ignore that superfluous state and think of the action as operating on its own, local state. This is not related to refinement mapping, or any other theory that I know of (though it might relate to one that I don’t know of), but ends up being a very useful optimization in practice.

For example, consider an oddly-specific system for point translation:

type Point = {
  x: number;
  y: number;
}

function translateX(point: Point, delta: number): Point {
  const result = { ...point };
  result.x += delta;

  return result;
}

function translateY(point: Point, delta: number): Point {
  const result = { ...point };
  result.y += delta;

  return result;
}

translateX and translateY are actions which operate on a Point type, but each only modifies a single part of the state - only x or y of the Point, but never both. Why, then, do we need to generate a full Point type in our test for comparing them? We can instead construct a new action function, say translateOnlyX, which only operates on the data that it actually modifies:

function translateOnlyX(x: number, delta: number): number {
  return x + delta;
}

In the model-based testing context, instead of comparing the functions at the global state level (Point in this case), we can compare the actions at the local level:

** Local Refinement Mapping Correctness Statement **

For all action functions 'impl',
  all action functions 'model',
  all local states 'ls',
  and a refinement mapping 'rm':
  
  rm(impl(ls)) = model(rm(ls))

Breaking out the action implementation in this way has no behavioral effect on the global-level translateX function, since translateX can easily be implemented in terms of translateOnlyX:

function translateX(point: Point, delta: number): Point {
  const result = { ...point };
  result.x = translateOnlyX(result.x, delta);

  return result;
}

And this is exactly what’s going on in our upgraded budget test. In our excerpt, we’re only focusing on the deleteRecurringTransaction action, and we generate a test state specific to this action:

type DeleteRecurringTransactionState = {
  recurringTransactions: RecurringTransaction[];
  id: number;
  db: DBState;
}

Deleting a recurring transaction doesn’t interact in any way with the scheduledTransactions state variable in that application, so we can leave that out of the test state for this particular action.

The end result of this is that we can get global guarantees for the cost of local checking, i.e. we can use local states to still show the holistic correctness statement.

One More Wrinkle

One last wrinkle presents itself for now - the notorious problem number 4. It may sound counterintuitive, but there are both refinement mappings and properties of our systems that are not expressible with the state variables of the system itself. Even if they are, they may be more naturally expressed by adding auxiliary variables. Auxiliary variables are additional variables that are added to a program (usually the implementation) that don’t affect the behavior of the program, but can be used to state properties or aid in a refinement mapping to a model.

Auxiliary variables provide one solution to a problem in the budget app test, and for tests for client-server applications in general. Our implementation is both the state component of a single-page application, and the corresponding server and database. One implication of that is that the client and database state can become out of sync. Consider the following action sequence:

The database starts with these recurring transactions: [rt1, rt2, rt3].
User 1 loads the home page - its client holds [rt1, rt2, rt3]
User 2 loads the home page - its client holds [rt1, rt2, rt3]
User 2 deletes rt2 - its client now holds [rt1, rt3], and the database holds [rt1, rt3]
User 1 adds a new recurring transaction, rt4 - its client holds [rt1, rt2, rt3, rt4] and the database holds [rt1, rt3, rt4].

At the end of these actions, the system has the following state:

User 1’s client: [rt1, rt2, rt3, rt4]
User 2’s client: [rt1, rt3]
The database: [rt1, rt3, rt4]

Again, there are a few different ways to approach either allowing or disallowing this behavior. One option is to just forbid differences in client values, but this would require a web socket to update all clients on each data write. While some applications actually do this (like chat applications), I would say that most don’t. Instead, we have to allow diverging client states, but we still want to do that in a controlled manner.

Well, one solution to that is to add a separate model instance as an auxiliary variable to the implementation which tracks the source of truth of the state of the client alone. Then, whenever a write occurs, we double-write to the implementation and this client model. Again, there are many patterns for doing this, but I like wrapping the implementation (Client here) in a new class with the same interface that forwards actions to the relevant members, this way the structure of the test doesn’t have to change and we keep all of the auxiliary variables in test-specific code:

class Impl {
  db: DBState;
  client: Client;

  aux: AuxiliaryVariables;

  constructor(db: DBState, client: Client, aux: AuxiliaryVariables) {
    this.db = db;
    this.client = client;
    this.aux = aux;
  }

  async deleteRecurringTransaction(id: number) {
    await this.client.deleteRecurringTransaction(id);
    this.aux.clientModel.deleteRecurringTransaction(id);
  }

  ...
}

In the test excerpt, we see another assertion named checkImplActionProperties⁴, and its defintion will now make sense:

async function checkImplActionProperties(impl: Impl, t: Deno.TestContext) {
  await t.step("loading is complete", () => assertEquals(impl.client.loading, false));

  await t.step("write-through cache: client state reflects client model",
    () => assertEquals(impl.client.recurringTransactions, impl.aux.clientModel.recurringTransactions)
  );
}

After each action has been invoked, we check that the actual state of the client matches the state of the client model, not the system model which is only aware of the database state. We also check that the loading variable in the client is false for good measure, ensuring that any spinners or other loading UI are hidden at the end of every action.

The key here is that, as long as they don’t affect the behavior of the implementation, we can add any auxiliary variables we want for tracking additional information. Once we have them, we can use them for test assertions, totally independent of the implementation that runs in production. They’re test-only code.

I’m going to be honest - I can have too much fun with auxiliary variables, and that means that we should be careful with them. They are basically a cheat code, and can be used as an escape hatch to get out of all kinds of situations. That being said, they’re sometimes the most elegant solution to a problem, and they’re a key piece in making our test flexible enough to handle the many scenarios that arise in practice. If anything becomes difficult to assert on or express as a property, we can try and make them easier by adding new auxiliary variables.

Recap

Alrighty. We went over four main problems and solutions to them:

Action sequences
Global state
State incompatibility
Expression inability

We introduced refinement mappings, which are functions from the implementation state to the model state, and which require that single-transitions in the implementation and model must be equivalent under this mapping. This overcomes both state incompatibility and the need for action sequences. We showed that by using action-local state we can avoid ever constructing global system state in the test. And we showed that if we ever have the inability to express a property about our system, we can always add auxiliary variables which don’t affect the system behavior but track additional information that we can use in test assertions.

What we ended up with is a framework for writing model-based tests that is both efficient and flexible, and applicable to real-world systems like database-backed web applications.

The linked papers have plenty more theoretical background and examples for deeper dives on these topics.

Thanks

Big thanks to Hillel Wayne for having an in depth conversation about refinement with me, which influenced my thinking about how to best define the system state for a client-server application.

I recommend reading this paper to get a handle on refinement mappings. Another name for this technique is simulation, which you can see an example of in how seL4 proves that the implementation implements its functional specification. Both are the same ultimate idea - prove that one program implements another by showing that all single transitions in each implement each other. ↩
We’ll expand on what auxiliary variables are throughout the post, but you can read more about them here and here. ↩
Errors that can be present in the implementation but not the model are an interesting topic. For example, if a network error in a request during the course of an action in the implementation, then it certainly won’t complete the action in a way that implements the model. One option is to be liberal, and simply avoid comparing the model and implementation in this case. We didn’t cover stuttering here, but models are allowed to stutter (transition to the current state) during implemenation steps, so an implementation error could be interpreted as a model stutter. The issue is, if the network error happens on every single action invocation, the implementation will never match the non-stuttering step of the model. The other option is to be harsh, and require that there are no network errors in tests, but still plan for them and allow them in production. This current version of this test chooses to be harsh. I’ll let you know how that goes. ↩
Action properties are a subset of temporal properties. They allow you to assert things about state transitions, that you couldn’t assert about individual states. They’re very useful. ↩

The Case for Models

2022-12-11T00:00:00+00:00

I’ve become a bit model-obsessed as of late. By “model,” I mean a simplified representation of a program. No databases. No HTTP. Just pure logic. What interests me so much about models is that they exemplify software minimalism. We often talk about essential vs. accidental complexity - well, models are the embodiment of the essential. We have an ongoing battle against complexity at the language level, with tons of new languages providing support for immutability, reference sharing control, and other features of modern semantics. But I can’t help but still ask: is it really enough? Is software really getting any simpler?

We all know the opposite is true, and I’d like to make the case for using models to combat complexity.

Models are Simple

First, let me expand on what I mean by “model.” Here’s a model of a bank:

interface Account {
  name: string;
  balance: number;
}

interface Deposit {
  account: string;
  amount: number;

  type: "deposit";
}


interface Transfer {
  srcAccount: string;
  dstAccount: string;
  amount: number;

  type: "transfer";
}

type Transaction = Deposit | Transfer;

class Bank {
  accounts: Account[] = [];
  ledger: Transaction[] = [];

  openAccount(name: string) {
    this.accounts.push({ name, balance: 0 })
  }

  deposit(account: string, amount: number) {
    this.ledger.push({ account, amount, type: "deposit" });
    
    this.findAccount(account)!.balance += amount;
  }

  transfer(srcAccount: string, dstAccount: string, amount: number) {
    this.ledger.push({ srcAccount, dstAccount, amount, type: "transfer" });

    this.findAccount(srcAccount)!.balance -= amount;
    this.findAccount(dstAccount)!.balance += amount;
  }

  findAccount(name: string) {
    return this.accounts.find(account => account.name === name);
  }
}

let bank = new Bank();

bank.openAccount("checking1");
bank.openAccount("checking2");

bank.deposit("checking1", 100);

bank.transfer("checking1", "checking2", 50);

A real bank would have an incredible amount of additional concerns, like security, performance, or resilience. But our model just shows the basic functionality over some basic data structures.

Models are as simple as our problem domain and language of choice will allow. Because of this simplicity, we can look at this code snippet and comprehend it quickly. Of course, complexity will creep in over time as more and more functionality gets added, but even with that added scope, doesn’t a model like this represent the smallest possible description of our desired behavior?

Pretty much all of the benefits of models stem from this minimal simplicity - it’s their most important characteristic. Throughout the years, all of our coworkers, friends, and idols have pleaded with us to focus on simplicity, to the point that it’s become totally memetic. Preaching about simplicity is a great way for us to show that we care, and that we’re going to be the ones to find the antidote to our most recent fall into the tar pit. Heck, I’m certainly not the first person to pitch models as a solution to our problems!

But in many ways, simplicity is more like viewing an electron with a transmission electron microscope - we only see the effects of simplicity on surrounding activities, we can’t really observe it directly. So while I think that simplicity is very important as an ideal, it’s ultimately a trap as a true north star because of our inability to concretely define it. The auxiliary benefits of simple models, though, are more concrete and tangible. Here are some.

Models Cost Less

We know that state spaces grow combinatorially, which makes any form of testing either extremely expensive or depressingly incomplete. This makes the inverse true: models are smaller and have less moving parts than full blown implementations, and this means we can test larger portions of them, quantitatively. With things like bounded model checking, we can also use them to exhaustively test up to a finite bound, which is a really awesome tactic that sits between test and proof.

The Trustworthy Systems group also found that verification effort grows quadratically with code size, meaning the act of verification gets increasingly effortful as the target system grows. I know most of us don’t verify our software, but all of us think about our software, and verification is a fairly good proxy for reasoning - to verify code is to reason about all of its executions against some desired property. So while it is a bit of a stretch, it also feels right: intuitively and anecdotally, applications become increasingly complex to work with as they grow in size.

Both in execution time and in reasoning effort, models are cheaper, and there are lots of awesome tools that exploit this cost benefit. TLA+ is seeing quite a bit of industry adoption, and its toolkit comes with a very powerful model checker. Alloy is similar, with a simpler specification language and a more rigid structure to encourage bounded models which are more amenable to checking. SmallCheck is a particularly cool idea, where you can use property-based testing but with exhaustive inputs up to some depth. There’s also the P langauge which defines itself as “Formal Modeling and Analysis of Distributed (Event-Driven) Systems,” and comes with its own model checking story.

Since each of these are exhaustive in nature, there’s no practical way to use them at the implementation level, and their sweet spot is for checking higher-level designs like models. In Misspecification, I checked an invariant of a model in 26 milliseconds with Alloy. While this is a best-case scenario, it’s an example of the cost savings that models can bring.

Models are Oracles

Have you ever written a bunch of test cases for something where the logic just is the definition of correct behavior? Take a recommendation algorithm. Of course you can write test cases to check the actual outputs of the algorithm, but what happens when you inevitably tweak the logic? How many of those test cases end up changing after the tweak, and what exactly are we testing for in this case? If tests always change along with the implementation, that indicates to me that the code just is the specification.

In my experience, every company has at least one, and often many “secret sauce” calculations like this, but a surprising other example of this arising is in distributed systems (and yes, even a seemingly simple client-server application is a distributed system). While distributed systems are wildly complex, that complexity is often from the simple fact that there are multiple communicating machines involved. Aka, distributed systems are complex because they’re distributed. To a user, though, the overall functionality might be relatively simple, and that sounds like a great case for models! If we have a model of the high level behavior, we can check that the distributed implementation conforms to it.

This is where the notion of a test oracle comes in. It’s obvious, but in order to test something we need to know what the expected value is beforehand. With example-based testing, we are the oracle who knows that answer, and we arrive at that answer by interpreting the requirements of what we’re testing. We encode that knowledge in automanual tests, and we have to use mental energy to decide the expected result of every single test case that we use. If we have a trusted model of those requirements though, we can just check that the implementation agrees with what the model says, in effect only writing a single test.

We have to define what exactly it means for an implementation to “agree with” a model, though, and the most common way of doing this is by showing refinement. Waving some hands a little bit, refinement is a way to show that behaviors in the implementation are also behaviors of the model. If this is the case, then the model and implementation should be equivalent from the user’s perspective since they have equivalent external behavior.

This is the basis of how models can be connected to implementations - by using a model as a test oracle in a test suite. This approach is actually seeing industry application, where AWS now does it to test parts of S3. They build reference models and use them to check properties of their distributed implementations, all in the same language (Rust). I also showed an example of this in Property-Based Testing Against a Model of a Web Application, and as I wrote there, this feels surprisingly good in practice.

The ability to automate tests using a model as an oracle goes away if the model isn’t written in the same langauge as your implementation, but in that case there are lower-tech ways of keeping them in sync. I’ve successfully used models at the start of a feature and then used them to manually generate test cases against the implementation. Then, when a bug or question comes in, I can use the model to first get a bearing on the problem before going through the whole implementation. This is a great, low-cost way to get your feet in the door with using models, and you can do it with something as simple as a spreadsheet.

Models are Documentation

While models are still code, they can be vastly more comprehensible than implementations because of their omissions of all but the most important details. Looking through frontend request caching, backend endpoint definitions, data access layers, database queries, ad nauseum, inevitably clouds the essential behavior of an application. This makes even answering basic questions like “what does the application do in this scenario?” difficult. Again, the size of models is an advantage here, and they can be small enough to actually serve as documentation of a system.

It’s often said that tests are documentation, and while that can be kind of true of good tests, I think it’s missing the point. Tests are examples of behavior, and examples are not specifications - examples are specific, but specifications are general. Take these test cases for the deposit functionality of our bank model, where we’ll extend it to have a maximum deposit amount of $10,000:

describe("depositing less than the max deposit amount", () => {
  let bank = new Bank();

  bank.openAccount("checking1");
  bank.deposit("checking1", 100);

  expect(bank.findAccount("checking1")!.balance).to.eq(100);
});

describe("depositing more than the max deposit amount", () => {
  let bank = new Bank();

  bank.openAccount("checking1");
  try {
    bank.deposit("checking1", 10001);
  } catch(e) {
    expect(bank.findAccount("checking1")!.balance).to.eq(0);
    expect(bank.error.message).to.eq("Attempted to deposit more than the maximum deposit amount")
  }
});

These are two specific scenarios of depositing money. Compare that to the definition of the deposit method in the model:

deposit(account: string, amount: number) {
  if (amount > MAX_DEPOSIT_AMOUNT) {
    throw new Error("Attempted to deposit more than the maximum deposit amount")
  }

  this.ledger.push({ account, amount, type: "deposit" });
  
  this.findAccount(account)!.balance += amount;
}

Because the model code is written at a higher abstraction level, it’s pretty much a direct transcription of the specification of our desired behavior. In English, this code reads as:

“When an account is deposited into, an error is returned if the max deposit amount is exceeded. If it’s less than the max deposit amount, a deposit ledger entry is created and the account balance is incremented by the deposit amount.”

For me, the model version is better documentation than the examples, and it’s only possible because implementation-level concerns aren’t present. I’m not commenting on example-based TDD being beneficial for figuring out the best interfaces for modules, or any other benefits of examples, but I don’t think examples end up being the greatest form of documentation. The issue is that there’s an impedance mismatch between a specification statement and the examples that are necessary to exemplify it. Here, we chose 2 examples to test a single if statement. Obviously, for more complicated logic there will be more branches in the code, and that will translate to more examples to cover the branches.

I find that a concise and accurate description of desired logic in the form of a model-based specification is more clear than a set of 10 test cases. This also highlights that, in models, we can write them to be communicative, and not necessarily worry about efficiency since that’s an implementation-level concern. The simple act of modeling allows us to write with more clarity. Information sharing becomes a dominating factor in the efficiency of a team as it grows, so this communicativity is extremely important, and something I find lacking in most projects. Models provide an efficient way to share the most important information about a software product: what exactly the product does.

Models are Fun

Do you remember when programming was fun? When you were first learning and felt like you could build anything? I remember that feeling, but only distantly, because it’s pretty rare that I feel it recently. Obviously enjoyment is multidimensional, but I think a big part of it boils down to economy of motion. Programming is fun when the changes you want to make are easy and fast, and you don’t have to spend days messing with your tools and codebase to get there. The distance between the idea in your mind and a tangible application that a user can interact with is clearly inversely proportional to how enjoyable it is to build the application.

Models are fun! Focusing on pure logic and solving a problem at the user level is highly enjoyable, and it reminds me of what I love about software. I like building interactive applications that make people’s live’s better. The tests that I write on a daily basis don’t seem to be related to that goal at all, no matter what design pattern I’ve tried at the implementation level. The only way I’ve found to have good economy of motion is to drastically reduce the raw size of code that I work with, and have a codebase that has a fast feedback loop. Because models are pure-logical in nature, they achieve that.

Contrast this with an all-too-common experience of programmers becoming jaded over time and eventually leaving a project just to experience the joy of greenfield again. This often happens because the project becomes way too hard to modify and too time-consuming to maintain. Models are easier to change or extend, so using them in a development workflow can help keep a project engaging for longer.

To sum it up: models are fun, simple, and cheap artifacts that can even act as your source of truth in tests. We should use them more!

The Dark Side of Models

I don’t mean to suggest that models are a silver bullet. Models have a number of tradeoffs. The first major one is similar and related to the verification gap - what the heck do you do with a model after you’ve built it? How do you connect it to an actual implementation? If you don’t, then it’s only a matter of time before they drift away from each other, and when you’re in a time crunch you’ll end up tossing the model every time. As mentioned, there are concepts like refinement for checking model conformance, but those require constraints like having an executable model. I don’t want to underplay this - this is a huge problem, but the verification gap post does outline some potential solutions. Long story short, some part of your process has to continuously use and modify the model, or it will simply atrophy.

There are also cases where the delta between the model and the implementation isn’t that large, so you don’t get a ton of extra benefit over just using the ipmlementation. If your model and implementation are just duplicates of each other, the model is not simpler, and therefore all of the auxilliary benefits of them won’t be there. There’s no value prop here.

Controlling the bounds of models is also a delicate art. Model checking is great when the state space of a model is small, but the sate space of even simple models can blow up past the point of being easily checkable. This art can be learned, but it’s not free.

Models are also just plain different. An implementation is the only required artifact that we need for a functional application, and we already have test suites as a de facto addition for continuously checking the quality of implementations. It’s not clear where models fit into this setup - is it possible for them to replace test suites entirely, or should they simply augment them? If they augment them, that’s just another thing to learn and maintain. A good model-based workflow is definitely not something that you can pick up off the shelf, since it’s not very common. Innovation budgets are only so high, and incorporating models at this point in time will require innovation.

Models also don’t get rid of implementation complexity. If you want to debug a performance issue, that has to be done at the implementation level. Same goes for monitoring, I would never suggest that the presence of a model eliminates the need for monitoring the real production system.

There’s also a sub-community built around “Model-Driven Architecture” that apparently has had some divisive experiences, which is important to acknowledge because it’s not like we’ve never tried model-driven workflows before. From what I gather, this seems to be based on UML as the modeling language though, which is definitely different from what I’ve been thinking about in source-code based models. It’s entirely possible that if we become open-minded to models we’re just going to be repeating history, but if anyone has any concrete case studies to share here I’d love to hear them because I don’t know of any.

Of course, all of these problems also present interesting research ideas and problems to solve.

One Last Plea

Again, I’m writing about models because I have a deep desire to keep software minimal. While I don’t think it’s wrong to simply buckle up and deal with the complexity and raw magnitude of implementations as many of us do, that’s just not the path that I find appealing intuitively. I want to at least try and propose an alternative, and right now that looks like adopting modeling and model-based workflows for the reasons presented here. Models are small and minimal, which makes modeling stand out as a compelling technique with simplicity at its center. Problems that we wouldn’t even dream about trying to solve become manageable with models, like exhaustive model checking of bounded state spaces. The gap between model and implementation is real, but we have options for connecting them.

As an industry, we’ve had our hands full for quite some time pushing forward what’s possible. We’ve built new CPU architectures, machine interconnections, programming languages, operating systems, GUI frameworks, and countless other tools to make a truly incredible modern toolkit for building software - and most importantly, we can build novel software that wasn’t possible even just a few decades ago. With no disrespect to what we’ve built to get here, there’s a price to this progress though, and the load on the average programmer’s mind on a daily basis is at an all-time high.

This makes me think of the quote from Alan Perlis, the winner of the first Turing Award:

Simplicity does not precede complexity, but follows it.

I am interested in a simplification of our current workflows, shifting from what’s possible to what’s manageable. We have the vantage point of being around after the creation of tons of technologies, but we have no way of wrangling everything that goes into a software product. Models are natural wranglers, by describing parameters and boundaries of logical behavior in the simplest possible way. I’m hopeful that they can be used to transition us away from complexity.

Extracting a Verified Interpreter from Isabelle/HOL

2022-09-21T00:00:00+00:00

Sometimes building a language is the best solution to a problem. This takes on many forms, from designing DSLs to implement an OS shell to more ambitious languages for verifying filesystems. Since we’re in the Golden Age of PL Research, there are plenty of reasons to give language design a try!

Languages are foundational, though, and soundness issues in them affect all of their programs. This makes them a great candididate for formalization and verification, since the cost of errors is very high, but as usual that means we have to address the verification gap. In this post we’ll cross the gap by building and (partially) verifying an operational semantics of a small language in Isabelle/HOL, and we’ll extract this semantics into an executable interpreter in OCaml.

The Language: Boolean Expressions

Let’s keep the language simple to focus on the end-to-end process of extraction. Our language will allow expressing and evaluating boolean expressions and conditionals, like “if true then true else false” (this is just the untyped boolean expression language from Chapter 3 of Types and Programming Languages).

First, we define the (abstract) syntax, in Isabelle:

datatype boolexp =
  BTrue | 
  BFalse | 
  BIf boolexp boolexp boolexp

This is a simple AST definition allowing terms like BIf BTrue BTrue BFalse and BIf (BIf BTrue BFalse BTrue) BTrue (BIf BTrue BFalse BTrue).

One way to prodce an executable interpreter is with a big-step operational semantics. Here’s one for this language:

definition is_value :: "boolexp ⇒ bool" where
"is_value t = (case t of BTrue ⇒ True | BFalse ⇒ True | _ ⇒ False)"

inductive bigstep :: "boolexp ⇒ boolexp ⇒ bool" where
bval: "is_value t ⟹ bigstep t t" |
bif_true: "⟦bigstep t1 BTrue; bigstep t2 v⟧ ⟹ bigstep (BIf t1 t2 t3) v" |
bif_false: "⟦bigstep t1 BFalse; bigstep t3 v⟧ ⟹ bigstep (BIf t1 t2 t3) v"

An inductive definition like this describes legal evaluations as predicates on the start and end terms. For example, the bif_true rule says that if we know that t1 evaluates to BTrue, and t2 evaluates to v, then (BIf t1 t2 t3) evaluates to v. If we think of this in concrete syntax, it just means that to evaluate if true then t2 else t3, we first recursively evaluate t2 because the conditional is true. Then the whole expression evaluates to the result, which is v. The bval rule just makes sure that the operation terminates once a value of BTrue or BFalse is reached, where they just evaluate to themselves.

It’s common to define semantics like this in proof assistants, (though it’s not the only way to do it), but a couple of incantations are necessary to make it actually executable:

code_pred (modes: i => o => bool as bigstep') bigstep . 

definition "bigstep_ex t = Predicate.the (bigstep' t)"

There’s a whole section devoted to inductive predicates and code generation in the Isabelle codegen documentation, but long story short this just turns the set of evaluation transitions into a function that instead computes the result given a starting term. These are really just two different ways of expressing the same concept, but the function version is the one that we can actually use to evaluate terms as we’re used to with executable interpreters.

Now we can evaluate the example terms from before to see that our semantics evaluates an input term vs. just describing legal transitions:

text "evalulates to: BTrue"
value "bigstep_ex (BIf BTrue BTrue BFalse)"

text "evalutes to: BFalse"
value "bigstep_ex (BIf (BIf BTrue BFalse BTrue) BTrue (BIf BTrue BFalse BTrue))"

Metatheoretical Determinism

Just to show that this semantics can be used in a real proof, here’s a proof that the semantics is deterministic, i.e. any term always evaluates to the same value. This isn’t always the case, particularly when concurrency gets introduced to the language, so knowing that our language has this property or not is useful:

theorem "⟦bigstep t t'; bigstep t t''⟧ ⟹ t' = t''"
proof (induction t t' arbitrary: t'' rule: bigstep.induct)
  case (bval t)
  then show ?case by (auto simp: is_value_def intro: bigstep.cases)
next
  case (bif_true t1 t2 v t3)
  then show ?case
    by (smt (verit, best) bigstep.cases boolexp.distinct(1) boolexp.inject boolexp.simps(10) is_value_def)
next
  case (bif_false t1 t3 v t2)
  then show ?case
    by (smt (verit) bigstep.cases boolexp.distinct(1) boolexp.inject boolexp.simps(10) is_value_def)
qed

The point of this post isn’t really the proof itself, and in full disclosure the last 2 cases of the proof were found with Isabelle’s famous sledgehammer :). This “language” is silly and small, so we won’t go any further with proving any metatheoretical properties about it, but it is important to show that the same definition that can be used in proof can be executed as a real program. Which brings us to…

Crossing the Verification Gap: From Isabelle to OCaml

If we hand-build a compiler or interpreter for our language, it might diverge from our semantics in subtle ways. Isabelle offers code extraction functionality though, and we can just extract the logic we just verified into an equivalent OCaml program:

export_code bigstep_ex BTrue in OCaml file_prefix "core"

Exporting the bigstep_ex function is obviously necessary, but exporting the BTrue constructor of the boolexp datatype might look surprising. It ends up being necessary because Isabelle generates all code inside of an OCaml module, and without this the datatype constructors would be private to the module. Since we’re going to parse and execute real code, we need to create values of type boolexp from outside of this module, and exporting a single constructor is enough to have Isabelle export all of the constructors.

That brings us to the final step, where we’ll parse source code and evaluate it using our extracted semantics.

Parsing and Evaluation

Here’s a quick lexer (using ocamllex):

{
  open Lexing
  open Parser

  exception SyntaxError of string
}

let whitespace = [' ' '\t' '\r' '\n']

rule read = parse
  | whitespace+       { read lexbuf }
  | "true"            { TRUE }
  | "false"           { FALSE }
  | "if"              { IF }
  | "then"            { THEN }
  | "else"            { ELSE }
  | '('               { LPAREN }
  | ')'               { RPAREN }
  | _                 { raise (SyntaxError ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }
  | eof               { EOF }

And the corresponding parser (using menhir):

%{
open Core.BoolExp
%}

// Values
%token TRUE
%token FALSE

%token IF
%token THEN
%token ELSE

%token LPAREN
%token RPAREN
%token EOF

%start prog
%type <boolexp option> prog 

%%

prog: 
  | e = expression EOF { Some e }
  | EOF                { None };

expression:
  | TRUE                          { BTrue }
  | FALSE                         { BFalse}
  | IF e1 = expression THEN e2 = expression ELSE e3 = expression 
                                  { BIf(e1, e2, e3) }
  | LPAREN e = expression RPAREN  { e }

Note that in between curly braces we’re creating OCaml values that correspond to our boolexp AST defined in Isabelle. This is the step where we go from a piece of source code to a piece of structured syntax that our executable semantics can actually execute.

Top this all off with a little bit of glue code to lex and parse a source code string, and we can execute the more complicated of our previous syntax examples:

open Bexp_ocaml
open Bexp_ocaml.Lexer
open Lexing

let evaluate expr =
  let lexbuf = Lexing.from_string expr in
  match Parser.prog Lexer.read lexbuf with
  | Some value ->
    let parsed = Util.string_of_boolexp value in
    Printf.printf "Parsed term: %s\n" parsed;

    let result = Core.BoolExp.bigstep_ex value |> Util.string_of_boolexp in
    Printf.printf "Result: %s\n" result;
  | None -> ()

let () = evaluate "if (if true then false else true) then true else (if true then false else true)"; ()

This evalutes to BFalse, just as it did in Isabelle, and we now have a verified and executable interpreter for terms in our language.

Behind the scenes there’s a little more OCaml setup with dune to get all this to build, and the full example can be found here. The full exported OCaml code is also there, which we’ve omitted here for brevity.

Now What?

Now that we have this interpreter what can what can we do with it? We can of course just use it directly if we want our language to stay interpreted and that’s it. But it can also be used as a source of truth for other language components.

For example, Conrad Watt formalized the semantics of WebAssembly and used it to check wasm implementations against an extracted reference interpreter using fuzz testing. This is a hybrid approach, somewhere between program extraction and model-based test case generation, and it allows for hand-writing a language implementation while still being “connected” to the verified semantics.

Having an executable version of our semantics enables other similar workflows - any place where we want to check for what the definition of correctness is, we now have an oracle.

Wrapping Up

Extraction is one way to handle the verification gap, and Isabelle’s extraction engine is both very powerful and customizable - documentation here. Once extracted, the language semantics is no longer just a mathematical model, but can be used to execute real programs. It has to be said that code generation isn’t guaranteed to be correct in terms of preserving the semantics of the Isabelle version. The reason it’s not guaranteed is because OCaml doesn’t have a formal semantics, so there can be no proof about its programs. In practice, however, this isn’t a huge problem, because HOL maps very directly to the informal semantics of a functional language like OCaml.

Proof assistants can seem like intimidating tools, but hopefully this showed that a verified interpreter can be extracted with a surprisingly small amount of effort. Of course larger languages will have more complicated semantics and associated proofs, but the end-to-end idea can be carried out pretty easily.

Domain-Driven Test Data Generation: A Category-Partition Method and Property-Based Testing Mashup

2022-08-31T00:00:00+00:00

Property-based testing is a well-known testing approach where random input generation is combined with checking correctness properties expressed as predicate functions. The goal of property-based testing is to use large amounts of these randomly generated inputs as a proxy for asserting that the property is always true, also known as an invariant of the program. That’s the main difference from example-based testing: examples check the expected result of a single program execution, whereas properties are statements about all program executions.

No matter how many inputs we generate, though, anything short of exhaustive testing or proof leaves room for errors. This begs the question: are there other data generation strategies that we can use to check the same correctness properties that we’d use when property-based testing? A property predicate doesn’t care how the input data was generated, so we can decouple how data is produced from the actual property checking.

Enter the category-partition method, a testing technique that’s existed since the 1980s. It’s a hybrid human/automated approach for creating domain-driven test data. One of the biggest “downsides” to it is that it can produce lots of test cases, which often makes it prohibitively expensive for manual and example-based testing. But when testing for properties, lots of test cases is a good thing. So is the mashup of category-partition-created-inputs and property-based-testing-properties a hyphen-laden match made in heaven?

A Brief Overview of the Category-Partition Method

The category-partition method starts with decomposing the full input domain of an operation into “categories,” which basically are groups of related inputs. Let’s think about the viewScheduledTransactions operation from the previous post about model-based testing a personal budgeting application. In that application, we can add our recurring bills and give them recurrence rules like “due every 2 weeks” or “due every month on the 8th.” Once we do that, we can view their occurrences in a given time range. For example if we take a bill that’s due every month on the 8th, in the range between 8/1/2022 and 10/31/2022 it would occur on 8/8, 9/8, and 10/8. Similar to recurring calendar events.

So the type signature of viewScheduledTransactions is:

const viewScheduledTransactions: (
  startDate: Date, 
  endDate: Date, 
  recurringTransactions: RecurringTransaction[]
) => ScheduledTransaction[];

An example category of this operation would be: “duration of the start and end date range.” The key is, we came up with this category based on domain knowledge, which is that Dates are notoriously complex and we likely want to try out many different ranges to account for crossing daylight savings boundaries for example. The category then consists of different choices related to that concept, such as “spanning 2 months” or “spanning 11 months.” This is why it’s called the category-partition method: first we identify categories of input data, and then we partition that category into multiple different choices. Each choice refers to a group of related inputs, because there are many date ranges that are 2 months apart.

Next, in the unoptimized version of the method, we take the Cartesian product of all of the categories to create all combinations of data that select one option out of each category. For example, if we introduce another category called: “recurrence rule type” with “weekly and monthly” options, the product of both of those categories would be:

["spanning 2 months", "spanning 11 months"] X ["weekly", "monthly"] = [
  ["spanning 2 months", "weekly"],
  ["spanning 2 months", "monthly"],
  ["spanning 11 months", "weekly"],
  ["spanning 11 months", "monthly"]
]

And finally, we use these combinations to create test cases. The ["spanning 2 months", "weekly"] combination could translate to a test case of:

const startDate = new Date("2022-08-01");
const endDate = new Date("2022-10-01");
const biweeklyBill = { 
  name: "Comic books", 
  amount: 20.0, 
  recurrenceRule: {
    recurrenceType: "weekly",
    interval: 2,
  }
};

const expectedValue = viewScheduleTransactions(startDate, endDate, [biWeeklyBill]);

It helps to think about this visually.

This represents the full set of all combinations of our inputs:

The first category that we define partitions the full input space into the number of choices in that category. For our date range duration category:

And now, because we use the product of all categories, when we introduce a new category we don’t just add its choices to the diagram, we divide each existing partition by each of the new choices, i.e.:

Notice how each slice was further partitioned into “weekly” and “monthly” slices. As we add more and more categories, the input space gets partitioned into finer-grained slices, and each slice represents the data for a single test case:

Now, I said this is the unoptimized version of the method. A large part of the original category-partition method paper is devoted to techniques for reducing the amount of test cases that get generated because the number of elements in a Cartesian product grows very rapidly, and this method was originally intended for manual testing. Since we’re going to be combining this generated test data with properties and not with manual or automated example-based test cases, let’s skip that part! We can just generate and use all of the combinations.

Generating the Input Data

First, we need the concept of the “combination of selected category choices,” which the paper calls a test frame (CreateRecurringTransaction is a type defined in the example repo). A test frame should include all necessary input data for executing a test case:

type DateRange = {
  start: Date,
  end: Date,
};

type TestFrame = {
  dateRange: DateRange,
  recurringTransactions: CreateRecurringTransaction[],
};

The goal is to create an array of TestFrames where each frame is built up from a single selection from each of the categories. Based on that, it makes sense to make selection functions which take in a TestFrame and apply their selected values to it, e.g. here are the selection function for the “date range duration” category:

function selectShortDuration(input: TestFrame) {
  input.dateRange.start.setMonth(1);
  input.dateRange.end.setMonth(2);
}

function selectMediumDuration(input: TestFrame) {
  input.dateRange.start.setMonth(3);
  input.dateRange.end.setMonth(6);
}

function selectLongDuration(input: TestFrame) {
  input.dateRange.start.setMonth(0);
  input.dateRange.end.setMonth(11);
}

And the category itself is just an array of these selections:

function durationCategory() {
  return [selectShortDuration, selectMediumDuration, selectLongDuration];
}

Now what we want to do is generate the product of multiple categories like this, and iterate through them to end up with a list of TestFrames:

type SelectionFunc = (i: TestFrame) => void;

const selectionCombinations: SelectionFunc[][] = product(
  startTimeOfDayCategory(),
  endTimeOfDayCategory(),
  durationCategory(),
  ruleTypeCategory(),
);

let testFrames: TestFrame[] = [];

for (const selectionCombination of selectionCombinations) {
  let startDate = new Date();
  let endDate = new Date();
  let recurringTransactions: CreateRecurringTransaction[] = [];

  let frame = { dateRange: { start: startDate, end: endDate }, recurringTransactions };
  for (const selection of selectionCombination) {
    selection(frame);
  }

  testFrames.push(frame);
}

The product function (which doesn’t exist in JS btw, but is easy enough to write), takes in an array of arrays of these selection functions, and generates all combinations of them. For example, a selection combination looks like this:

[
  [Function: selectMiddleOfDayStart],
  [Function: selectMiddleOfDayEnd],
  [Function: selectShortDuration],
  [Function: selectSomeMonthlyRule]
]

where selectMiddleOfDayStart is one choice out of the startTimeOfDayCategory, selectMiddleOfDayEnd is one choice out of endTimeOfDayCategory, etc. Again - the product produces all such combinations.

The produced test frame from this selection combination is:

{
  dateRange: { start: "2022-03-03T17:34:48.422Z", end: "2022-03-31T16:34:48.422Z" },
  recurringTransactions: [{ 
    name: "monthlyRt1", 
    amount: 10, 
    recurrenceRule: { 
      recurrenceType: "monthly", 
      day: 2 
    }
  }]
}

We can see that the both date range values occur in the middle of the day, and the duration between them is short (less than 1 month). Since the monthly rule choice was chosen out of the ruleTypeCategory, the generated recurring transaction has a monthly recurrence rule. This is a faithful interpretation of the category selections in this combination.

Now, we have a big array of input data that we can check against a property.

From Examples to Properties

Let’s use the same property that we used in the previous post, and simply check that the web application implementation conforms to the simplified model:

Deno.test("Category-partition inputs plus model conformance property", async (t) => {
  let i = 0;
  for (const frame of testFrames) {
    let client = new Client();
    let budget = new Budget();

    await client.setup();
    await t.step(`Frame ${i}`, async (t) => {
      for (const crt of frame.recurringTransactions) {
        await client.addRecurringTransaction(crt);
        budget.addRecurringTransaction(crt);
      }

      await client.viewScheduledTransactions(frame.dateRange.start, frame.dateRange.end);
      budget.viewScheduledTransactions(frame.dateRange.start, frame.dateRange.end);

      assertEquals(client.scheduledTransactions, budget.scheduledTransactions);
    });
    i += 1;
    await client.teardown();
  }
});

That completes the mashup. Once we have generated test frames, properties themselves are extremely uncomplicated. In this case the property is just a single assertion that two values are equal. We don’t even need a property-based testing library here, since those are mostly focused on the input-data generation and checking of the property multiple times. Since we generated our own input data and it’s just an array, we don’t need either of these features.

Observations

Here’s what I like about generating test data this way. The biggest problem with testing in general is state space explosion, and the root of that problem is the nature of combinations and how they grow in number multiplicatively. In the full input space, combinations simply grow way too fast to exhaustively test. The category-partition method fights fire with fire by partitioning this input space into very fine-grained slices with only a few user-defined categories because of the power of the Cartesian product. The key difference is that we control the rate of growth by treating all members of a slice equivalently (i.e. they are equivalance classes).

Since each slice is a combination of all of the input variables, we end up with very specific data scenarios based on knowledge of the domain. This intuitively feels like it would place a lot of stress on the implementation, which is what we want out of our test data. There’s also at least one study that used a similar approach and it resulted in very high test coverage and found a large number of defects during testing.

As anecdotal evidence, the first time I ran this test (which was generated from relatively simple categories), it found an edge case which I didn’t correct in the last post. It’s a very specific scenario where two different recurring transactions end up expanding to the same date, and then the secondary order of transactions doesn’t agree between the model and implementation. That’s not conclusive evidence, but it is pretty promising as I forgot about that edge case already and this approach rediscovered it for me.

The categories shown here led to 81 test frames getting generated. That’s nothing for a property-based test, but that number will grow very quickly as new categories and choices are added. For example, if we have 10 categories with 5 choices each, that already hits over 9 million test frames. We didn’t cover test frame optimization here, but as I said there’s a lot of information in the paper and elsewhere about how to exclude contradictory or redundant test frames.

Because of all of this, I see this being used as a complementary approach to random data generation. Randomess is very powerful, and when it’s great, it’s great. The major downside of it is that guiding the randomness to create complex, domain-based data can be a chore, and at the end of the day the strength of random testing comes from it being unbiased. By checking the same properties with both random and domain-driven data generation strategies, we can get a better proxy for checking all inputs.

Here’s the actual test created in this post. It sits in the same repo as the personal finance test application, so the application code can be consulted as well.

Property-Based Testing Against a Model of a Web Application

2022-08-11T00:00:00+00:00

The term “lightweight formal methods” is a bit of a contradiction, but the idea is that like everything else, formal methods exists on a spectrum. We don’t have to prove each line of our entire application to use mathematical thinking to reason about our software. A great example of this is model-based testing, where we can automatically generate test cases for an implementation by first constructing a model of the system. This is a well-studied approach, but what rekindled my curiosity in it was Amazon’s recent usage of it to test parts of S3’s functionality, as well as seeing some recent activity around the Quickstrom web application testing tool.

If it hasn’t been clear, I’m mostly interested in which aspects of formal methods we can use “at work” and on real projects. So what interests me about lightweight formal methods is that they generally have a built-in solution to the verification gap, meaning we can apply them to deployable code. Does the lightweight nature of model-based testing make it a viable approach for testing web applications? Let’s first go into an example, and we’ll tackle the underlying theory in a separate post.

A Model of Functional Correctness

It is our position that a solid product consists of a triple: a program, a functional specification, and a proof that the program meets the functional specification

-Edsger W. Djikstra, EWD1060-0

As always, testing and verification starts out with the surprisingly difficult question of: “what is our program supposed to do”? And as we talked about in Refinement, the answer to that question also depends on who you ask. But when you start to think about it, the really crazy thing is that the only way to know the answer to this question is by looking at the code - the code is the source of truth for the behavior.

Is this a good idea? Do we need to look at JSON serialization, database queries, caching layers, or other distributed system minutiae to figure out what is supposed to happen when a user clicks “submit” on some form?

Model-based testing instead encourages us to create a high level model of our system, and to use it to test that the implementation conforms to the model. The model serves as a functional specification of the application that we can test the implementation against. Now Djikstra was no fan of testing, but if we relax the requirement of a full-blown proof we can arrive at a more lightweight definition of a “solid” product: a triple of a program, a model, and a property-based test that checks that the program conforms to the model.

We’ll talk about each of those in turn, but first let’s get right into the modeling. We’re going to build a personal finance application for tracking our recurring bills. All of the code referenced in this post is available here.

export class Budget {
  recurringTransactions: RecurringTransaction[] = [];
  scheduledTransactions: ScheduledTransaction[] = [];
  error: string | null = null;

  ids: Record<string, number> = {};

  addRecurringTransaction(crt: CreateRecurringTransaction) {
    this.recurringTransactions.push(recurringTransactionFromCreate(this.genId("RecurringTransaction"), crt));
  }

  viewRecurringTransactions(): RecurringTransaction[] {
    return this.recurringTransactions;
  }

  viewScheduledTransactions(start: Date, end: Date) {
      let expanded = this.recurringTransactions.flatMap(rt => 
        expandRecurringTransaction(rt, start, end).map(d => (
          { date: dateStringFromDate(d), name: rt.name, amount: rt.amount }
        )));
      
      expanded.sort(compareScheduledTransactions);

      this.scheduledTransactions = expanded;
  }

  genId(type: string): number {
    if (this.ids[type]) {
      this.ids[type] += 1;

      return this.ids[type];
    }

    this.ids[type] = 1;

    return 1;
  }
}

First thing’s first: we wrote our model in Typescript. There are many ways of creating model specifications with separate specification languages like TLA+ or Alloy, but in the Amazon paper above they went with implementing the model in the same language as the implementation. This makes writing the functional correctness test a lot simpler, as it can just reference the model and implementation directly.

A few other notes about this model. It’s a class (Budget) which holds the state of the whole system. There are a few methods on it which influence the system state, and these are meant to model which user actions the user can take with the application. addRecurringTransaction is a basic CRUD operation - we just take the input data and store it in the system state. viewScheduledTransactions takes the current system state and “expands” the recurring transactions into the actual days that they’d occur based on their recurrence rules. For example we may pay rent monthly on the same day each month, but get our dog groomed every 3 weeks. The expansion logic is fairly nontrivial, especially since there’s a lot of date-based math to achieve full correctness, as you can see in the rrule library which implements the iCalendar recurrence rule format. For now, we’ll stay focused on the testing aspect vs. the actual domain logic, but feel free to browse the repo for the full model.

One part that probably looks weird is our genId method which keeps track of global identifiers. We use this to assign a new ID for the Recurring Transactions that we create. This might seem like database logic, but I find that identifiers are actually a crucial part of any practical application. Either way, in order for the implementation to match the model state those identifiers will need to agree. There are certainly other ways to accomplish that, but this works for this example.

The main thing that I want to highlight here is how simple the model is. This is pure domain logic, and is probably as close to the essential complexity of the application that we can get. The full model is about 130 source-lines of code, including the type definitions of the system state. It’s a very simple program because of the lack of web application details, and since this model will serve as the source of truth for the behavior of the application this simplicity is crucial.

Now that we have a model of our behavior, let’s first talk about what it would mean for an implementation to be correct with respect to this model.

A Single Test for Conformance

The high level goal of a test for model conformance is to check that the implementation “does what the model does” in all cases. So we’ll introduce a Client object which is our entrypoint to the web app implementation. This object receives the same actions as the model, but calls into a web server to perform the action. That means that after all of the web requests, database queries, JSON serialization, etc., the end result of each of the actions in the model and implementation should be the same.

To test for “all” cases, we’ll use the model-based testing capabilities of fast-check to create large amounts of randomly generated sequences of these actions. As with all property-based tests, the random generation here is a proxy for checking all possibilities.

First we create a command object that runs the same action in both the model and the implementation:

class AddRecurringTransactionCommand implements fc.AsyncCommand<Budget, Client> {
  constructor(readonly crt: CreateRecurringTransaction) {}
  check = (m: Readonly<Budget>) => true;
  async run(b: Budget, c: Client): Promise<void> {
    b.addRecurringTransaction(this.crt);
    await c.addRecurringTransaction(this.crt);
  }
  toString = () => `addRecurringTransaction`;
}

Each command gets passed a Budget instance which we defined in the model, as well as the Client instance. We create similar command objects for the other system actions, and then we can wire them up into the full test. The test creates an array, allCommands, which holds all of these possible actions, and it uses fast-check's data generators to create the input data for each of the actions. fast-check takes in this array of commands and executes random sequences of them, and after each sequence is complete we can run some assertions that check that the model and implementation states are equal.

That’s a bunch of info, but the full test is relatively small and looks like this:

const dateMin = new Date("1990-01-01T00:00:00.000Z");
const dateMax = new Date("1991-01-01T00:00:00.000Z");

Deno.test("functional correctness", async (t) => {
  let client = new Client();

  // 1. Data generators for all system action inputs
  const allCommands = [
    fc.record({ 
      name: fc.string(), 
      amount: fc.integer(), 
      recurrenceRule: fc.oneof(
        fc.record({ recurrenceType: fc.constant("monthly"), day: fc.integer({min: 0, max: 31}) }),
        fc.record({ 
          recurrenceType: fc.constant("weekly"), 
          day: fc.integer({min: 0, max: 31 }), 
          basis: fc.option(fc.date({min: dateMin, max: dateMax})), 
          interval: fc.option(fc.integer({min: 1, max: 20})) 
        })
      ) 
    }).map(crt => new AddRecurringTransactionCommand(crt)),
    fc.constant(new ViewRecurringTransactionsCommand()),
    fc.record({ 
      start: fc.date({min: dateMin, max: dateMax}),
      end: fc.date({min: dateMin, max: dateMax}), 
    }).map(({ start, end }) => new ViewScheduledTransactionsCommand(start, end)),
  ];

  await fc.assert(
    // 2. fc.commands generates random sequences of the actions defined in 1.
    fc.asyncProperty(fc.commands(allCommands, { size: "small" }), async (cmds) => {
      // 3. We're testing a web application with a database, so signal the start of the test case
      //    so that we can restore the system state at the end
      await client.setup();

      await t.step(`Executing scenario with ${cmds.commands.length} commands`, async (t) => {
        let model = new Budget();
        client = new Client();
  
        // 4. Run the list of commands in sequence by executing each command's `run` method
        const env = () => ({ model, real: client });
        await fc.asyncModelRun(env, cmds);
  
        // 5. Check that the model and implementation states are equivalent, and any additional 
        //    state in the Client has the expected value.
        await t.step("Checking invariants between model and implementation", async (t) => {
          await t.step("UI State", async (t) => {
            await t.step("loading", async () => {
              assertEquals(client.loading, false);
            })
            await t.step("error", async () => {
              assertEquals(client.error, model.error);
            });
          });

          await t.step("Recurring transactions are equal", async () => {
            assertEquals(client.recurringTransactions, model.recurringTransactions);
          });

          await t.step("Scheduled transactions are equal", async () => {
            assertEquals(client.scheduledTransactions, model.scheduledTransactions);
          });
        });

        // 6. Restore system state to its state before the test iteration
        await client.teardown();
      });
      console.log("\n")
    }),
    { numRuns: 100 }
  );
});

The actual test in the example repo has some added logging to show what’s going on in each test case in more detail, and here’s an example scenario that’s generated:

  Executing scenario with 5 commands ...
------- output -------
  [Action] addRecurringTransaction
    {
      "name": "tCnRiS",
      "amount": 1518583647,
      "recurrenceRule": {
        "recurrenceType": "monthly",
        "day": 0
      }
    }
  [Action] addRecurringTransaction
    {
      "name": "",
      "amount": -1669970975,
      "recurrenceRule": {
        "recurrenceType": "weekly",
        "day": 3,
        "basis": "1990-06-08T17:55:25.769Z",
        "interval": 17
      }
    }
  [Action] addRecurringTransaction
    {
      "name": "xRL",
      "amount": -1400152232,
      "recurrenceRule": {
        "recurrenceType": "weekly",
        "day": 19,
        "basis": "1990-03-11T00:36:49.477Z",
        "interval": 10
      }
    }
  [Action] addRecurringTransaction
    {
      "name": "V/'/]dlYwp",
      "amount": -1711622850,
      "recurrenceRule": {
        "recurrenceType": "monthly",
        "day": 13
      }
    }
  [Action] viewScheduledTransactions start: Mon Mar 05 1990 17:36:42 GMT-0500 (Eastern Standard Time), end: Wed Sep 26 1990 12:25:40 GMT-0400 (Eastern Daylight Time)
----- output end -----
    Checking invariants between model and implementation ...
      UI State ...
        loading ... ok (3ms)
        error ... ok (2ms)
      ok (7ms)
      Recurring transactions are equal ... ok (2ms)
      Scheduled transactions are equal ... ok (3ms)
    ok (13ms)
  ok (78ms)

Here the test generated a scenario with 5 actions, where 4 recurring transactions were added in a row and then the corresponding scheduled transactions were viewed. All of the invariants are checked after that, and they all pass.

It’s important to drive the point home that the variance in action sequences that the test can generate is immense, depending on how fast-check is configured. This is the main value proposition of property-based testing! In the above snippet, we’re setting the number of scenarios to 100 (with { numRuns: 100 }). The number of commands per scenario is also configurable. This test passes { size: 'small' } to the fc.commands function, which generates sequences with lengths between 0 and 10. fast-check supports medium, large and xlarge sizes, which generate much longer lists of commands - xlarge generates sequences of up to 10,000 commands.

To get a sense for how long this takes, on my machine (an M1 Max Macbook), 100 small iterations took about 9 seconds to run, 1,000 small iterations took 1 minute and 45 seconds, and 100 medium iterations took 45 seconds. These are integration tests that go all the way from the client to a Postgres database, which brings us to our actual implementation.

The Implementation

A couple of design decisions had to be made to enable model-based testing in this way. The first questsion is, what should Client be? In order to compare an implementation to the model, they have to have the same interface and hold comparable state. This application will have a React UI, so we could make Client something like a React class component supporting the same methods as the model, but almost all of the flaky tests I’ve ever encountered have been involving an actual UI. Instead, we’ll go one layer below the React UI and have Client be a class which manages the state of the application and handles all networking with the API server. The model-based test is therefore not a full end to end test but starts at this layer, and we’ll rely on a simple state-binding strategy to ensure that the Client state is always properly rendered in the UI. On top of that, the Client is statically typed, so that should also reduce the change of passing invalid date into it.

For the state binding, we’ll use MobX. MobX plays very nicely with binding stateful classes to React’s view state while having pretty minimal setup code. With the Client object and MobX handling the heavy lifting of the application’s state management, the React UI is kept very thin, only forwarding user input to the Client and rendering its state. This is commonly called the humble object pattern. To be clear, the React UI itself is left untested by the model-based test, but of course it can be tested separately if desired.

Here’s an excerpt of the Client to get more concrete:

export class Client {
  recurringTransactions: RecurringTransaction[] = [];
  scheduledTransactions: ScheduledTransaction[] = [];

  loading: boolean = false;
  error: string | null = null;

  constructor(config: (c: Client) => void = () => {}) {
      config(this);
  }

 async addRecurringTransaction(crt: CreateRecurringTransaction) {
    this.updateLoading(true);

    let resp = await fetch(`${API_HOST}/recurring_transactions`, {
      method: "POST",
      body: serializeRecurringTransaction(crt),
      headers: {
        'Content-Type': "application/json",
      },
    });

    this.updateNewRecurringTransaction(await resp.json());
  }

  async setup() {
    return fetch(`${API_HOST}/setup`, {
      method: "POST",
    });
  }

  async teardown() {
    return fetch(`${API_HOST}/teardown`, {
      method: "POST",
    });
  }

  updateNewRecurringTransaction(json: RecurringTransactionResponse) {
    this.loading = false;
    switch (json.type) {
    case "recurring_transaction":
        this.recurringTransactions = [...this.recurringTransactions, normalizeRecurringTransaction(json)];
        break;
    case "error":
        this.error = json.message;
        break;
    default:
        console.log("Default was hit when updating new recurring transaction")
    };
  }
}

The choice of MobX influenced two main aspects of the Client. First, the constructor has to accept a config callback so the React app can setup MobX to mark its state as automatically observable. Second, any actual updates to the state of the Client have to take place in non-async functions - not adhering to this caused a warning in the browser console. Both of these were easy to work around.

That brings us to our API server. In this example, I went with Rails, but any server framework or library would do. The first interesting part of the server are the setup and teardown APIs which the test calls. These allow the server to clean up any database state created in each property iteration so that each iteration starts with fresh data and leads to a deterministic test case. Rails has the tried and true database_cleaner gem which takes care of most of this for us. All that we need to do is pick the appropriate cleaning strategy. Since the test spans multiple different endpoint requests, the proper strategy is the truncation strategy, which truncates only the tables that have been written to during the test when the pre_count option is set to true:

class TestController < ApplicationController
  def setup
    DatabaseCleaner.strategy = DatabaseCleaner::ActiveRecord::Truncation.new(pre_count: true)
    DatabaseCleaner.start
  end

  def teardown
    DatabaseCleaner.clean
  end
end

The next interesting thing is how we represent the recurrence rules in the database. In the model, they’re represented as discriminated unions since that’s the most natural way to model them, but SQL doesn’t have any built in way to store them. There are various ways to do this, but let’s just pick one and go for it. We’ll just serialize the rule into a string and store that in the DB in the same table as the recurring transaction model. We can do this with the ActiveRecord attributes API and a custom type conversion:

class RecurringTransaction < ApplicationRecord
  attribute :recurrence_rule, RecurrenceRuleType.new
end

...

# Recurrence rules are stored as a string with format:
#   '<type>::attr1=v1;attr2=v2;...'
#
# The type represents the different rule types, e.g. Monthly and Weekly,
# and the list of attribute keys and values represent the data for each
# rule type.

class RecurrenceRuleType < ActiveRecord::Type::Value
  def resolve_rrule_type(type_str)
    case type_str
    when 'monthly'
      RecurrenceRule::Monthly
    when 'weekly'
      RecurrenceRule::Weekly
    else
      raise "Attempted to cast unknown recurrence rule type: #{type}"
    end
  end

  def cast(value)
    if value.is_a?(String) && value =~ /^weekly|monthly:/ 
      value_components = value.split('::')

      if value_components.length != 2
        raise "Recurrence rule DB strings must be of the format ''<type>:attr1=v1;attr2=v2;...'. Attempted to cast a string with #{value_components.length} components: #{value}" 
      end

      type, all_attrs = value_components

      attr_pairs = all_attrs.split(';')
      attrs = attr_pairs.each_with_object({}) do |attr_pair, attrs|
        k, v = attr_pair.split('=')
        if !v.nil?
          attrs[k] = v
        end
      end

      super(resolve_rrule_type(type).from_attrs(attrs))
    elsif value.is_a?(ActiveSupport::HashWithIndifferentAccess)
      attrs = value.except(:recurrence_type)

      super(resolve_rrule_type(value[:recurrence_type]).from_attrs(attrs))
    else
      super
    end
  end

  def serialize(value)
    if value.is_a?(RecurrenceRule::Monthly) || value.is_a?(RecurrenceRule::Weekly)
      attrs = value.db_serialize.to_a.map { |k, v| "#{k}=#{v}" }.join(";")
      case value
      when RecurrenceRule::Monthly
        super("monthly::#{attrs}")
      when RecurrenceRule::Weekly
        super("weekly::#{attrs}")
      end
    else
      super
    end
  end
end

The implementation is where most of the nitty gritty details show up because of the trifecta of networking, UI state binding, and database concerns. But once this is in place, the test has been very reliable, and we’re free to change pretty major implementation details while keeping the test unchanged.

The single functional correctness test was also invaluable for getting these details right.

Would Anyone Ever Do This On a Real Web Application?

Well, it depends, but I think that it’s less crazy than it may seem at first glance. The first thing that I want to talk about in this respect is the delta between the perceived high-level behavior of the application and the lower-level behavior in the application’s architecture. I don’t think it’s crazy to say that in modern applications that delta is actually quite large, and my justification for that opinion is the amount of times I or one of my coworkers have said: “this application is just simple CRUD, and I have to write so much code to just move data around!” This is exacerbated if you’re working with services - one simple action in terms of the model might involve communicating across multiple different services, compiling all of the data together before applying further processing and sending to the client. Now sure, libraries and frameworks exist to make operating within the architecture a little easier, but they don’t get rid of the overhead entirely.

Why do complicated web architectures exist in the first place? Most if not all facets of web architecture boil down to some kind of optimization. Using a database is often a performance optimization, but also an optimization to ensure data integrity amongst concurrent accesses and modifications. There are many reasons to use services, but the main reasons are to optimize for modularity in the face of many teams or to optimize a small subset of the system’s operational characteristics independent of the rest. Specific to this example, the changes to the Client object to play nicely with MobX are a result of React’s declarative UI which is an optimization for developer cognitive load. The presence of optimizations are one of the best reasons to use a model, because the model captures the essence of the system, whereas the actual optimizations cloud that simple behavior.

And like everything else, back in the 70s we were already thinking about this problem of optimization making verification more difficult and considering how to use models to make it simpler:

I also claim that in order to prove by Floyd’s method the correctness of a program A, in a case where data is represented unnaturally, perhaps for efficiency’s sake, the easiest and most lucid approach is rather close to first designing a program B which is simulated by program A and which represents the data naturally, and then proving B correct.

-Robin Milner, An Algebraic Definition of Simulation Between Programs 1971

That’s right. Even proofs are easier when the logic being verified is simpler. Simplicity always wins in the end.

As far as cost of effort, the aformentioned paper by Amazon found that their reference models were just 1% of the size of the implementation code, and the overall model-based testing framework and actual test properties totaled 12% of the component’s codebase. Contrast that with gigantic unit test suites that are often 2-3x the size of the production code! No test suite is free, and beyond the actual size of a test suite, I’ve also found that large unit test suites can lead to TDD Ossification. So, if the application behavior is relatively simple, but general distributed system complexity makes your implementation large and complex, models can be a big win and an actual cost saver with respect to the overall project code size and agility.

Even if it has benefits over traditional unit testing, I’d like to point out that model-based testing is fairly noncommital and doesn’t prevent you from doing any other kinds of testing. For example, as the property-based test caught bugs in this example application, I used them to create regular example-based tests to be able to figure the issue with that specific scenario. If we know that certain scenarios are really important, we can easily add specific unit or integration tests for them, using all of the same tools and techniques that we use today. So model-based testing doesn’t have to replace anything, but can rather complement your existing test suite. Such tests are in the repo in a separate examples.ts test file. Note how after just a few examples, this file is already ~400 lines of code, whereas the model-based test is only ~100.

Now to the tradeoffs. One often-cited benefit of fine-grained unit tests is that they provide very specific error messages when they fail so you know exactly where the error needs to be fixed. The model-based test basically gives you a binary message: correct, or not correct. To mitigate this, I added logs to all of the actions so that the full trace of behavior could be looked at after any failure. From there, we’d still need to figure out where in the codebase the actual error is. While I agree that specific error messages are a desirable property, I still think the benefit is oversold. In practice, it’s not that difficult to narrow down to the problematic area of code in the span of a few web requests. But it’s still worth calling out, and anything we could do to improve that I’m on board with.

Also, while the overall size of the test suite code might be reduced, the price we pay for that is the test runtime. No matter how many random tests get run, it won’t be exhaustive, so the more we allow the property-based test to run the better. Luckily, property-based testing is an embarrassingly parallelizable problem since each property iteration is totally independent from the next. While it’s still not free, we can simply run many invocations of the same test at the same time and cover more surface area in parallel. One other idea there is to decompose the single test for all commands into multiple different subsets of commands, where a subset of commands is picked related to individual feature areas. For example, as the personal finance application grows and we add something like authentication, we can test the auth flow separately from recurring transaction management.

Probably the biggest concern is misspecification. Since it’s the source of truth, as the model grows and changes it’s going to be hard to know when something was simply expressed incorrectly in the model, or when some important aspect of the model was lost. This is a very tricky problem in general, but I would also argue that it’s also unavoidable and not solved by other approaches either. Since the model is executable, it can also be tested though, and testing an in-memory class is much simpler than testing a full distributed system. For example, it’s much easier to check properties and invariants on the model since it is small. This is what Robin Milner was getting at in the above quote.

Wrapping Up

Overall, even when considering the tradeoffs, this flavor of model-based testing seems to have a great cost-to-value ratio. To see the value of property-based testing in practice, check out the example test cases that I got simply by taking failure traces from the model-based test and turning them into a single deterministic example. The flagship example of that the test caught a discrepancy when expanding recurring transactions in a range where a crossover to daylight savings time occurred. As much as I try to be vigilant, this is a test case I would have never written ahead of time! Any tool or methodology that actually points out unknown-unknowns is extremely compelling.

The performance also isn’t as bad as you would think, and as we mentioned before this style of testing is very amenable to parallelization. We live in a world now where Jepsen regularly finds issues in real distributed systems products with property-based integration tests.

The React UI is also totally functional. Feel free to build the app and run it, and of course please report any bugs :)

Concerning Quality

Does Your Test Suite Account For Weak Transaction Isolation?

What is Weak Transaction Isolation?

Simulating Concurrent Connections

Race Conditions and Serializability

Forward and Backward Reasoning in Proof Assistants

Forward Reasoning

Backward Reasoning

Which One’s Better?

Compiling a Test Suite

From Programs to Applications

Writing a Model

Compiling the Test Suite

Outro

Most Tests Should Be Generated

Correctness is What We Want

Testing for Correctness with Scenarios

Generating Tests for Properties

The Test Generation Pyramid

Logical Time and Deterministic Execution

User Interaction and Non-Deterministic Choice

Concurrency

Logical Time, Time-Travel, and Beyond

Outro

Efficient and Flexible Model-Based Testing

A Quick Recap of Actions

A Preview of Our Destination

Correctness as Equivalent Behavior of Action Sequences

Single Transitions and Compatible States with Refinement Mappings

From Global to Local State

One More Wrinkle

Recap

Thanks

The Case for Models

Models are Simple

Models Cost Less

Models are Oracles

Models are Documentation

Models are Fun

The Dark Side of Models

One Last Plea

Extracting a Verified Interpreter from Isabelle/HOL

The Language: Boolean Expressions

Metatheoretical Determinism

Crossing the Verification Gap: From Isabelle to OCaml

Parsing and Evaluation

Now What?

Wrapping Up

Domain-Driven Test Data Generation: A Category-Partition Method and Property-Based Testing Mashup

A Brief Overview of the Category-Partition Method

Generating the Input Data

From Examples to Properties

Observations

Property-Based Testing Against a Model of a Web Application

A Model of Functional Correctness

A Single Test for Conformance

The Implementation

Would Anyone Ever Do This On a Real Web Application?

Wrapping Up