Concerning Quality

Extracting a Verified Interpreter from Isabelle/HOL

Sometimes building a language is the best solution to a problem. This takes on many forms, from designing DSLs to implement an OS shell to more ambitious languages for verifying filesystems. Since we’re in the Golden Age of PL Research, there are plenty of reasons to give language design a try!

Languages are foundational, though, and soundness issues in them affect all of their programs. This makes them a great candididate for formalization and verification, since the cost of errors is very high, but as usual that means we have to address the verification gap. In this post we’ll cross the gap by building and (partially) verifying an operational semantics of a small language in Isabelle/HOL, and we’ll extract this semantics into an executable interpreter in OCaml.

The Language: Boolean Expressions

Let’s keep the language simple to focus on the end-to-end process of extraction. Our language will allow expressing and evaluating boolean expressions and conditionals, like “if true then true else false” (this is just the untyped boolean expression language from Chapter 3 of Types and Programming Languages).

First, we define the (abstract) syntax, in Isabelle:

datatype boolexp =
  BTrue | 
  BFalse | 
  BIf boolexp boolexp boolexp

This is a simple AST definition allowing terms like BIf BTrue BTrue BFalse and BIf (BIf BTrue BFalse BTrue) BTrue (BIf BTrue BFalse BTrue).

One way to prodce an executable interpreter is with a big-step operational semantics. Here’s one for this language:

definition is_value :: "boolexp  bool" where
"is_value t = (case t of BTrue  True | BFalse  True | _  False)"

inductive bigstep :: "boolexp  boolexp  bool" where
bval: "is_value t  bigstep t t" |
bif_true: "bigstep t1 BTrue; bigstep t2 v  bigstep (BIf t1 t2 t3) v" |
bif_false: "bigstep t1 BFalse; bigstep t3 v  bigstep (BIf t1 t2 t3) v"

An inductive definition like this describes legal evaluations as predicates on the start and end terms. For example, the bif_true rule says that if we know that t1 evaluates to BTrue, and t2 evaluates to v, then (BIf t1 t2 t3) evaluates to v. If we think of this in concrete syntax, it just means that to evaluate if true then t2 else t3, we first recursively evaluate t2 because the conditional is true. Then the whole expression evaluates to the result, which is v. The bval rule just makes sure that the operation terminates once a value of BTrue or BFalse is reached, where they just evaluate to themselves.

It’s common to define semantics like this in proof assistants, (though it’s not the only way to do it), but a couple of incantations are necessary to make it actually executable:

code_pred (modes: i => o => bool as bigstep') bigstep . 

definition "bigstep_ex t = Predicate.the (bigstep' t)"

There’s a whole section devoted to inductive predicates and code generation in the Isabelle codegen documentation, but long story short this just turns the set of evaluation transitions into a function that instead computes the result given a starting term. These are really just two different ways of expressing the same concept, but the function version is the one that we can actually use to evaluate terms as we’re used to with executable interpreters.

Now we can evaluate the example terms from before to see that our semantics evaluates an input term vs. just describing legal transitions:

text "evalulates to: BTrue"
value "bigstep_ex (BIf BTrue BTrue BFalse)"

text "evalutes to: BFalse"
value "bigstep_ex (BIf (BIf BTrue BFalse BTrue) BTrue (BIf BTrue BFalse BTrue))"

Metatheoretical Determinism

Just to show that this semantics can be used in a real proof, here’s a proof that the semantics is deterministic, i.e. any term always evaluates to the same value. This isn’t always the case, particularly when concurrency gets introduced to the language, so knowing that our language has this property or not is useful:

theorem "bigstep t t'; bigstep t t''  t' = t''"
proof (induction t t' arbitrary: t'' rule: bigstep.induct)
  case (bval t)
  then show ?case by (auto simp: is_value_def intro: bigstep.cases)
  case (bif_true t1 t2 v t3)
  then show ?case
    by (smt (verit, best) bigstep.cases boolexp.distinct(1) boolexp.inject boolexp.simps(10) is_value_def)
  case (bif_false t1 t3 v t2)
  then show ?case
    by (smt (verit) bigstep.cases boolexp.distinct(1) boolexp.inject boolexp.simps(10) is_value_def)

The point of this post isn’t really the proof itself, and in full disclosure the last 2 cases of the proof were found with Isabelle’s famous sledgehammer :). This “language” is silly and small, so we won’t go any further with proving any metatheoretical properties about it, but it is important to show that the same definition that can be used in proof can be executed as a real program. Which brings us to…

Crossing the Verification Gap: From Isabelle to OCaml

If we hand-build a compiler or interpreter for our language, it might diverge from our semantics in subtle ways. Isabelle offers code extraction functionality though, and we can just extract the logic we just verified into an equivalent OCaml program:

export_code bigstep_ex BTrue in OCaml file_prefix "core"

Exporting the bigstep_ex function is obviously necessary, but exporting the BTrue constructor of the boolexp datatype might look surprising. It ends up being necessary because Isabelle generates all code inside of an OCaml module, and without this the datatype constructors would be private to the module. Since we’re going to parse and execute real code, we need to create values of type boolexp from outside of this module, and exporting a single constructor is enough to have Isabelle export all of the constructors.

That brings us to the final step, where we’ll parse source code and evaluate it using our extracted semantics.

Parsing and Evaluation

Here’s a quick lexer (using ocamllex):

  open Lexing
  open Parser

  exception SyntaxError of string

let whitespace = [' ' '\t' '\r' '\n']

rule read = parse
  | whitespace+       { read lexbuf }
  | "true"            { TRUE }
  | "false"           { FALSE }
  | "if"              { IF }
  | "then"            { THEN }
  | "else"            { ELSE }
  | '('               { LPAREN }
  | ')'               { RPAREN }
  | _                 { raise (SyntaxError ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }
  | eof               { EOF }

And the corresponding parser (using menhir):

open Core.BoolExp

// Values
%token TRUE
%token FALSE

%token IF
%token THEN
%token ELSE

%token LPAREN
%token RPAREN
%token EOF

%start prog
%type <boolexp option> prog 


  | e = expression EOF { Some e }
  | EOF                { None };

  | TRUE                          { BTrue }
  | FALSE                         { BFalse}
  | IF e1 = expression THEN e2 = expression ELSE e3 = expression 
                                  { BIf(e1, e2, e3) }
  | LPAREN e = expression RPAREN  { e }

Note that in between curly braces we’re creating OCaml values that correspond to our boolexp AST defined in Isabelle. This is the step where we go from a piece of source code to a piece of structured syntax that our executable semantics can actually execute.

Top this all off with a little bit of glue code to lex and parse a source code string, and we can execute the more complicated of our previous syntax examples:

open Bexp_ocaml
open Bexp_ocaml.Lexer
open Lexing

let evaluate expr =
  let lexbuf = Lexing.from_string expr in
  match Parser.prog lexbuf with
  | Some value ->
    let parsed = Util.string_of_boolexp value in
    Printf.printf "Parsed term: %s\n" parsed;

    let result = Core.BoolExp.bigstep_ex value |> Util.string_of_boolexp in
    Printf.printf "Result: %s\n" result;
  | None -> ()

let () = evaluate "if (if true then false else true) then true else (if true then false else true)"; ()

This evalutes to BFalse, just as it did in Isabelle, and we now have a verified and executable interpreter for terms in our language.

Behind the scenes there’s a little more OCaml setup with dune to get all this to build, and the full example can be found here. The full exported OCaml code is also there, which we’ve omitted here for brevity.

Now What?

Now that we have this interpreter what can what can we do with it? We can of course just use it directly if we want our language to stay interpreted and that’s it. But it can also be used as a source of truth for other language components.

For example, Conrad Watt formalized the semantics of WebAssembly and used it to check wasm implementations against an extracted reference interpreter using fuzz testing. This is a hybrid approach, somewhere between program extraction and model-based test case generation, and it allows for hand-writing a language implementation while still being “connected” to the verified semantics.

Having an executable version of our semantics enables other similar workflows - any place where we want to check for what the definition of correctness is, we now have an oracle.

Wrapping Up

Extraction is one way to handle the verification gap, and Isabelle’s extraction engine is both very powerful and customizable - documentation here. Once extracted, the language semantics is no longer just a mathematical model, but can be used to execute real programs. It has to be said that code generation isn’t guaranteed to be correct in terms of preserving the semantics of the Isabelle version. The reason it’s not guaranteed is because OCaml doesn’t have a formal semantics, so there can be no proof about its programs. In practice, however, this isn’t a huge problem, because HOL maps very directly to the informal semantics of a functional language like OCaml.

Proof assistants can seem like intimidating tools, but hopefully this showed that a verified interpreter can be extracted with a surprisingly small amount of effort. Of course larger languages will have more complicated semantics and associated proofs, but the end-to-end idea can be carried out pretty easily.