Relational compilation for end-to-end verification

Author:: Clément Pit-Claudel
Date:: December 2, 2021

Display all goals and responses

Abstract

Purely functional programs verified using interactive theorem provers typically need to be translated to run: either by extracting them to a similar language (like Coq to OCaml) or by proving them equivalent to deeply embedded implementations (like C programs). Traditionally, the first approach was automated but produced unverified programs with average performance, and the second approach was manual but produced verified, high-performance programs.

This thesis shows how to recast program extraction as a proof-search problem to automatically derive correct-by-construction, high-performance code from shallowly embedded functional programs. First, it introduces a unifying framework, relational compilation, to capture and extend recent developments in program extraction, with a focus on modularity and sound extensibility. Then, it presents Rupicola, a relational compiler-construction toolkit designed to extract fast, verified, idiomatic low-level code from annotated functional models.

The originality of this approach lies in its combination of foundational proofs, extensibility, and performance, backed by an unconventional take on compiler extensions: unlike traditional compilers, Rupicola generates good code not because of clever built-in optimizations, but because it allows users to plug in domain- and sometimes program-specific extensions soundly. This thesis demonstrates the benefits of this approach through case studies and performance benchmarks that highlight how easy Rupicola makes it to create domain-specific compilers that generate code with performance comparable to that of handwritten C programs.

TODO: Compiled 2021-12-20T01:22:11, git revision 19770bf690072759df3b5c1dc9ec9b563ae7db8a

Acknowledgments

I am deeply grateful to the many people who supported me during my PhD, foremost among them my advisor, Adam Chlipala, and my family: my wife, my brother, and my parents. I feel immensely fortunate to have benefited from their support, encouragement, and advice over all these years.

I also would like to thank my colleague and mentor, Ben Delaware, who worked closely with me in the first years of my PhD; my coworker Jade Philipoom, with whom I created the first versions of Rupicola, and the other members of the Rupicola team at MIT; my friends, colleagues and classmates at MIT, from whom I kept learning throughout my PhD, especially Zelda Mariet, Sara Achour and Cecilia Testart; my close friend Thomas Bourgeat, who showed infinite patience with my coffee-drinking habits back when research was an in-person activity; my professors and committee members at MIT, especially Sam Madden, Joel Emer, Daniel Jackson, Jonathan Ragan-Kelley, and Frans Kaashoek; all the professors, friends and mentors who supported me professionally and trusted me, especially Jonathan Protzenko, Stéphane Lengrand, Éric Goubault, Rustan Leino, Zach Tatlock, and Arvind; my internship supervisors and colleagues, especially Cătălin Hrițcu and Nik Swamy; my SIGPLAN-M mentor, Patrick Lam, who helped me navigate the last year of my PhD; my wonderful colleague in organizing CAV 2021's artifact evaluation, Maria A Schett; the members of the interactive theorem proving research community, especially the members of Coq's development team; the researchers and volunteers who give their time to develop and maintain the tools that I use in my research; and all the friends who supported me and kept in touch despite my terrible email habits, especially Letitia Li, Max Dunitz, Clément Feltin, Caroline Vu Tien and Olivier Hercend.

1 Introduction

Vulnerabilities in critical systems fall into roughly two categories: logic mistakes (incorrect business logic) and programming mistakes (use-after-free, out-of-bounds accesses, etc.). High-level languages attempt to eliminate logic mistakes by promoting higher levels of abstraction that facilitate reasoning about program behavior, and they protect against low-level issues using static and dynamic checks and safer programming paradigms (garbage collection to rule out use-after-free errors, stream- and result-oriented APIs for out-of-bounds accesses, etc.).

At one extreme, purely functional languages offer very strong protections against low-level mistakes and readily lend themselves to mathematical reasoning. By eliminating arrays, exceptions, state, and other low-level concerns and encouraging higher-order programming, languages like Coq [Coq+Zenodo2021], Lean [Lean+deMoura+CADE2015], Idris [Idris+Brady+JFP2013], or the pure subsets of Haskell and F* [FStar+Swamy+POPL2016] [Haskell+Hudak+HOPL2007] offer programming models much less susceptible to the low-level issues that plague the vast majority of today's critical systems.

Unfortunately, this combination of flexibility and safety comes at a significant performance cost: it is an unsolved problem to program a compiler for any of these purely functional languages that verifiably preserves all of their high-level guarantees while offering performance competitive with the usual low-level suspects, especially C.

In fact, to ground this discussion of high-level inefficiencies, let us consider the simple task of converting an ASCII string str to uppercase, maybe as part of a network program that receives a request and normalizes its contents to use them as a key in a table of records.

In a purely functional language like Gallina (part of the Coq proof assistant), this task can be implemented succinctly as follows (with strings being linked list of characters, characters an inductive type with 256 cases, and toupper a switch with one case per lowercase letter):

String.map Char.toupper str

This program accurately captures the intent of the task, but how fast does it run? When extracted to OCaml, it will pointer-chase through a linked list to traverse the original string (creating data dependencies and cache pressure), create a fresh string (costing allocations, cache misses, and an extra traversal for garbage collection), and either stack-overflow on long strings (due to a non-tail-recursive map, though there have been recent developments in that space), or traverse the string twice (doubling allocation and pointer-chasing costs), or accumulate continuations (even more allocations).

In contrast, the low-level implementation below performs a single pass, occupies constant stack space, does not allocate, is cache-friendly, can be unrolled, and is trivially vectorizable (toupper on ASCII chars is just a comparison and a bitmask): it assumes that the original string is never reused, represents str as a contiguous array of characters, and mutates it with a simple for loop.

for (int i = 0; i < len; i++)
  str[i] = toupper(str[i]);

It is possible to see this low-level program as a transformation of the high-level one above, but only if we broadly generalize the way we think about compiler extensions.

Rethinking compiler extensions — The traditional extension point in a compiler is a single-language rewrite rule (e.g., in GHC Haskell), and much past research has studied compiler optimizations through the lens of term-rewriting systems. In fact, many compilers are implemented as sequences of lowering passes interleaved with optimization passes that operate on single languages. Unfortunately, single-language rewrite rules are poorly suited to the kind of cross-language transformations with potentially complex side conditions that would be needed to automatically generate the C program above from the Coq one preceding it.

Expressing the translation of String.map into for loop with mutation as a rewrite rule, for example, would require us either to encode mutation and for loops explicitly in the source language (so that the transformation could be performed within Gallina), or to extend C to support higher-order functions and folds (so that the transformation could be performed within C), or to use a generic compiler to lower folds into C and subsequently eliminate all the resulting cruft (including closures and a GC) using more rewrite rules. This expressivity issue, in addition to the fact that such transformations are often conditional on relatively complex side conditions that are best solved by user-provided partial decision procedures, makes rewrite rules poorly suited to our problem.

The aim of this thesis is to make it possible to build custom compilers that support complex cross-language optimizations, allowing users to translate the functional String.map program above into an efficient in-place loop. Such transformations are crucial to get good performance, easy to express for specific programs or domains, but either too narrow in scope or too complex to generalize to be a good fit for a general-purpose compiler.

A different kind of compiler — To realize this vision, this thesis describes a different approach to the compilation of functional programs. Its defining characteristic is that it trades completeness for performance: in my approach, users assemble custom compilers for specific programs or collections of programs, and as a result these compilers do not need to support the full complexity of the source language. By abandoning completeness, code patterns that would be complex and costly to compile in full generality (e.g. requiring garbage collection, closures, a runtime system, etc.) can be mapped to simple low-level constructs like loops and mutations by exploiting domain- or program- specific assumptions.

This vision is realized as a compiler-construction toolkit, Rupicola, implemented in the Coq proof assistant. In Rupicola, users assemble compilers by combining individual theorems that connect high-level patterns in Gallina (Coq's functional programming language) to low-level code fragments in Bedrock2 (a low-level imperative language developed at MIT [Lightbulb+Erbsen+PLDI2021]). So, for example, a Rupicola user may prove a theorem that translates all maps on lists into for loops that mutate arrays in place. Such an implementation choice is not applicable to all uses of maps and lists: it works only if mutation is acceptable (i.e. if the original list is not needed anymore after the map), and if the original code neither adds to nor removes from the list (these operations have no direct equivalent on a fixed-length array).

Rupicola is not intended to replace all program extraction. Instead, Rupicola is restricted, out of the box, to a minimal set of constructs (essentially arithmetic, simple data structures, and some control flow), yielding a predictable and transparent compilation process. Users are expected (and enabled) to extend it as needed for each new domain, plugging in domain- or program-specific compilation hints that capture the insight that humans would normally apply when manually implementing high-level specifications in a low-level language: details of memory layout and memory management, implementation strategies for data-structure traversals, etc.

shows where Rupicola fits in a complete pipeline from high-level specifications to assembly code.

Where Rupicola fits in the bigger picture. Rupicola is not a universal compiler: it bridges the shallow to deep gap by accepting a restricted (but extensible) input that can reliably and predictably be translated to fast low-level code. Right: Rupicola's namesake, the Guianan *cock-of-the-rock*, *Rupicola Rupicola*. — Where Rupicola fits in the bigger picture. Rupicola is not a universal compiler: it bridges the gap between shallowly embedded purely functional programs and deeply embedded imperative code by accepting a restricted (but extensible) input that can reliably and predictably be translated to fast low-level code. Right: Rupicola's namesake, the Guianan *cock-of-the-rock*, *Rupicola Rupicola*.

Rupicola's target audience and use cases — Rupicola works best for small, performance-critical programs, where precise control over implementation choices and optimization is crucial — the kind of programs that experts write directly in C, to avoid the overheads introduced by traditional functional-programming compilers. In other words, Rupicola's user are intended to know what kind of low-level code they want, and Rupicola's task is to allow them to generate that code from functional models instead of writing it directly. Rupicola's value proposition, in this use case, is a combination of automation and ease of reasoning:

Transformations that would be repeatedly applied by hand when manually implementing a low-level program from a high-level description are instead encoded a single time as compiler extensions, and subsequently applied many times across a range of related programs. Users pay a bit more upfront (they have to encode the transformation into a Rupicola plug-in) but reap the benefits down the line.
Reasoning about the source programs becomes much simpler: since the source programs are valid Gallina code, proofs about these programs do not have to deal with the subtleties of low-level languages like mutation, complex control flow, or memory allocation. This makes verifying code from end to end much easier, because all program-specific reasoning and proofs happen on shallowly embedded purely functional programs, which are only then translated into efficient imperative code.

In a sense, Rupicola codifies and automates away the most unpleasant part of traditional end-to-end verification pipelines. In the traditional world, authors not willing to rely on Coq's extraction (for performance or trust reasons) will manually relate handwritten, deeply embedded low-level programs to functional models, and then they will separately relate each functional model to a high-level specification. In that world, authors must repeatedly deal with the complexities of the low-level language's semantics and with details such as when to allocate or free memory or how to relate low-level memory layouts to high-level functional models. Rupicola, in contrast, completely automates the first phase of this process, generating the deeply embedded low-level program from its functional model by leveraging user-provided hints and program annotations. In Rupicola, programmers only supply shallowly embedded programs written in a subset of Gallina that naturally maps to low-level constructs, and the tooling produces low-level, deeply embedded code. Unchanged is the second phase that relates these functional models to abstract specifications: that part is still the programmer's responsibility. [1] But because Rupicola's inputs are shallowly embedded, this phase is disconnected from the details of the low-level language's semantics, and the traditional reasoning patterns best supported by Coq — especially structural induction — are fully applicable.

1.1 Dissertation outline

This dissertation starts with an in-depth presentation of relational compilation, the theoretical foundation that Rupicola is built on, with special emphasis on composability and extensibility. Relational compilation is a unifying framework that I developed to capture and extend recent developments in program extraction: a collection of program-derivation techniques that soundly bridges the gap from shallowly embedded programs to deeply embedded executable code. Some of the ideas that I present in this section were previously known, but the presentation itself is new, as are many of the advanced use cases that I demonstrate. By leveraging this technique, I show how to construct modular, extensible domain-specific compilers that perform advanced domain- or even program-specific transformations in a safe manner.

I then provide a real-world perspective on relational compilation, using it to derive, in the Coq proof assistant, high-performance implementations of various small yet bug-prone low-level programs. The resulting framework, Rupicola, is a compiler-construction toolkit, not a standard compiler. Because our focus is on low-level programming, we [2] made no attempt to compile all or even most functional programs. Instead, we focused on relatively small, loop-oriented programs such as those often found in binary parsers, text-manipulation libraries, cryptographic routines, system libraries, and other high-risk, high-performance code. These programs are not traditionally implemented in purely functional languages, but this thesis shows that relational compilation can in fact be used to write performance-critical programs in that style, directly within the native, pure logic of an interactive theorem prover, combining straightforward reasoning and verification with excellent performance — on par with handwritten code. In other words, Rupicola's inputs are pure and written with maps and folds, but they compile to C code that manages its own memory, mutates its inputs, and runs at the speed of vectorizable for loops.

To support these claims, this dissertation then presents a collection of case studies and benchmarks. The benchmarks measure the performance of code generated using Rupicola for a variety of example programs and show that it generally matches that of handwritten code (). Case studies quantify the effort needed to plug in new translations () and expose low-level features to source programs (); demonstrate benefits from switching from reification to relational compilation on Rupicola's expression compiler (a 10× reduction in Ltac code size, ); show how Rupicola fits within an end-to-end pipeline by measuring the effort needed to connect high-level specifications to a functional model suitable for compilation with Rupicola (); and describe third-party uses of Rupicola ().

1.2 Thesis contributions

This thesis claims two main contributions:

A novel, systematic presentation of relational compilation, in Coq, with a focus on composability and extensibility;
The description and evaluation of Rupicola, a relational compiler from Gallina, Coq's programming language, to Bedrock2 [Lightbulb+Erbsen+PLDI2021], a low-level imperative language. Rupicola advances the state of the art through its composable support for arbitrary monadic programs, novel treatment of loops, and output-code performance.

In other words, Rupicola demonstrates that, at least for some simple loop-oriented programs, users do not have to chose between slow but reliable high-level code and fast but bug-prone low-level programs.

While the techniques that this thesis develops are presented in the context of the Coq proof assistant, they are more largely applicable. In particular, Rupicola innovates in the way it treats loops and effects, and these innovations could carry to other systems:

Most verification systems handle loops in a unified way: for example, the semantics of Bedrock, the language that Rupicola targets, has a single deduction rule for while loops. This means that reasoning about loops is often done in terms of the lowest-level loop primitive (high-level loops are mapped to a low-level iteration primitive and proofs about the loops body mention symbolic state also phrased in terms of that primitive). In Rupicola, in contrast, each type of loop is compiled a custom lemma: there are distinct lemmas for maps, folds, iteration on ranges of numbers, etc.; sometimes more than one lemma for a single type of loop. These custom loop lemmas differ in the way they encode intermediate loop states: each one of them exclusively mentions the corresponding high-level iterator (map and fold), independently of the way the loop is eventually compiled. Thanks to this, invariant inference for Rupicola loops can be entirely automated (the computation of the loop's strongest postcondition is trivial), and all reasoning about loops and loop states is done at the level of purely functional code (, ).
Control over low-level features such as memory allocation and over effects such as mutation is critical for performance, but rewriting functional programs to make uses of these features explicit (e.g. using monadic encodings) is costly. Rupicola instead introduces most effects and low-level features as part of the compilation process, using lightweight annotations that are semantically transparent — that is, they do not make reasoning about the program any more complicated, and they are not reflected in the program's type (). As a bonus, however, most of Rupicola is parametric on a choice of monad, so even code best expressed using monads can be mapped to low-level code with minimal effort ().

There have been many previous efforts in this space, foremost among them developments on Imperative/HOL [ImperativeHOL+Lammich+ITP2015] [LLVMHOL+Lammich+ITP2019], CakeML [CakeMLExtraction+Myreen+JFP2014] [HOLCakeML+Hupel+ESOP2018], Œuf [OEuf+Mullen+CPP2018] [BinaryCodeExtraction+Kumar+ITP2109], HOL compilation to Verilog [HOLVerilog+Loow+FormaliSE2019], Fiat-to-Facade [FiatToFacade+PitClaudel+IJCAR2020] (my own previous work), CertiCoq [CertiCoq+Anand+CoqPL2017], and Low* [Kremlin+Protzenko+ICFP2017]. Rupicola's novelty is its combination of performance, foundational proofs, and extensibility:

All projects above except CertiCoq and Low* use relational compilation pipelines. Among these, only Fiat-to-Facade focuses on generating high-performance low-level code from functional programs, but it did not come close to achieving Rupicola's performance (other relational compilers target either garbage-collected languages or other types of languages like Verilog; LLVM/HOL for example compiles from a one-to-one shallow embedding of LLVM).
KreMLin, Low*'s compiler, does produce code with performance matching handwritten programs, but it is not formally verified (Rupicola's output is certified by a proof of total correctness). CertiCoq has proofs (though not for the initial reification step), but it is a standard compiler forcing use of a runtime system.
Among the above, only Fiat-to-Facade strove for straightforward user extensions, but unlike in Rupicola users were limited by linearity of the target language, and support for loops and effects was ad-hoc (a loop could mutate only one object, and the nondeterminism monad was hardcoded).

2 Prelude: Interactive theorem proving and formal verification

The software artifacts that support this thesis are built within the Coq proof assistant. Coq is a venerable piece of software: its development started in 1984 [3], and has over time produced a powerful and versatile programming and verification environment.

At the core of Coq is a programming language called Gallina. It resembles traditional functional languages like SML, OCaml, F#; here, for example, is how one can define a Coq function that filters a list according to a predicate:

From Coq Require Import List.
Import ListNotations.

Fixpoint filter {α} (p: α -> bool) (l: list α) {struct l} :=
  match l with
  | [] => []
  | h :: t => let t' := filter p t in
              if p h then h :: t' else t'
  end.

A function defined in this way can be executed interactively using the Compute command:

Compute filter Nat.even [1; 2; 4; 13; 42; 139; 506].= [2; 4; 42; 506]
: list nat

The same function can then be extracted — that is, translated — to other languages, including OCaml (this is the way most Coq programs are converted into executable binaries today):

From Coq Require Import Extraction ExtrOcamlBasic.
Extraction filter.(** val filter : ('a1 -> bool) -> 'a1 list -> 'a1 list **)

let rec filter p = function
| [] -> []
| h :: t -> let t' = filter p t in if p h then h :: t' else t'

Beyond these superficial similarities, Coq diverges from traditional functional languages in multiple important ways — some theoretical and some practical. For the purposes of the discussion to follow, the following are the most relevant differences:

First, there are no effects in Coq: no mutable variables, no exceptions, no nondeterminism, no I/O and no control flow except for recursion and pattern matching. As a result, computations in Coq are very similar to mathematical computations: calling the same function twice with the same arguments always returns the same values.

Second, all programs written in Coq terminate (there are no unbounded loops, only well-founded recursion); hence Coq rejects the following incorrect program, for example:

Fixpoint filter {α} (p: α -> bool) (l: list α) {struct l} :=
  match l with
  | [] => []
  | h :: t =>
    let t' := filter p l in
    if p h then h :: t' else t'
  end.Recursive definition of filter is ill-formed.
In environment
filter : forall α : Type, (α -> bool) -> list α -> list α
α : Type
p : α -> bool
l : list α
h : α
t : list α
Recursive call to filter has principal argument equal to 
"l" instead of "t".
Recursive definition is:
"fun (α : Type) (p : α -> bool) (l : list α) =>
 match l with
 | [] => []
 | h :: _ => let t' := filter α p l in if p h then h :: t' else t'
 end".

As a consequence, with a few exceptions irrelevant to this discussion, reduction strategies do not matter in Coq. This is important because computation is at the core of everything in Coq, and unlike most languages Coq supports partial evaluation, reduction under binders, and reduction with holes:

Eval cbv in filter ?[f] [1; 2].= if ?f 1 then 1 :: (if ?f 2 then [2] else []) else if ?f 2 then [2] else []
: list nat
Eval simpl in (fun x => filter Nat.even [1; x; 4; 13; 42]).= fun x : nat => if Nat.even x then [x; 4; 42] else [4; 42]
: nat -> list nat

Third, Coq has a particularly powerful type system, capable of capturing not just the simple types of OCaml, but also relations between types and values — in fact, Coq does not distinguish between types and values: the types of natural numbers and lists are defined by induction, but so are the types of logical disjunctions, existentially quantified propositions, or even equalities:

Print nat.Inductive nat : Set :=  O : nat | S : nat -> nat
Print list.Inductive list (A : Type) : Type :=  nil : list A | cons : A -> list A -> list A

Print or.Inductive or (A B : Prop) : Prop :=
    or_introl : A -> A \/ B | or_intror : B -> A \/ B
Print ex.Inductive ex (A : Type) (P : A -> Prop) : Prop :=
    ex_intro : forall x : A, P x -> exists y, P y
Print "=".Inductive eq (A : Type) (x : A) : A -> Prop :=  eq_refl : x = x

This allows Coq's language, Gallina, to be used not just for writing programs, but also for stating properties and proving them. For example, we can define predicates and prove theorems about the filter function above. The predicate contains below returns a mathematical proposition capturing whether a value is contained in a list:

Fixpoint contains {α} (l: list α) (x: α) : Prop :=
  match l with
  | [] => False
  | h :: t => h = x \/ contains t x
  end.

Proofs in Coq are simply values that inhabit a certain types: just like we can say that 5 has type nat (for natural numbers), we can say that Nat.add_comm has type forall n m: nat, n + m = m + n — and really what this means is that Nat.add_comm is a function that, given two numbers m and n, returns a proof (a value) of type n + m = m + n:

From Coq Require Import Arith.
Check Nat.add_comm.Nat.add_comm
     : forall n m : nat, n + m = m + n

Here is a theorem about the filter function defined above. Its statement is a type, that of a function which given a function f, a list l, a value x, and a proof of type contains (filter f l) x, returns a proof that f x equals true:

Lemma f_if {A B} (f: A -> B) (b: bool) a a':
  f (if b then a else a') = if b then f a else f a'.A, B: Type
f: A -> B
b: bool
a, a': A
f (if b then a else a') = (if b then f a else f a')
Proof. destruct b; reflexivity. Qed.

Theorem filter_sound {α} (f: α -> bool):
  forall l x, contains (filter f l) x -> f x = true.α: Type
f: α -> bool
forall (l : list α) (x : α), contains (filter f l) x -> f x = true

Proofs in Coq are typically written using a meta-language that generates terms under the hood; this is called the tactic language. Coq is an interactive system: proofs are written step by step, and after each step the system displays the current state of the proof, a collection of open “goals”, which are secondary theorems that need to be proved to complete the main proof. Let us see a concrete example; in what follows the gray boxes indicate Coq's output, and the text without background is user input. [4]

Proof.

The Proof command begins the proof. Above the bar are the hypotheses that hold at this point in the proof. Below the bar is the goal, the theorem that we are trying to prove. Since the contains predicate is defined by induction on l, we follow that structure in the proof, using the induction command; this corresponds to distinguishing two cases in the proof: empty and non-empty lists.

  induction l.α: Type
f: α -> bool
1forall x : α, contains (filter f []) x -> f x = true
α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
forall x : α, contains (filter f (a :: l)) x -> f x = true

We are now presented with two “subgoals” — two cases. In the first (1) the list l has been replaced by the empty list []; in the second the list is assumed to be non-empty, and l has been replaced by a cons of a newly introduced element a and a new list l.

The all: combinator below applies a tactic to all goals, and intros moves forall-quantified variables and premises of implications into the context as hypotheses:

  all: intros.α: Type
f: α -> bool
x: α
H: contains (filter f []) x2
f x = true
α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
H: contains (filter f (a :: l)) x
f x = true

The - below focuses the proof on the first subgoal. In that goal, hypothesis H: contains (filter f []) x (2) is contradictory, as filter f [] reduces to [], and contains [] x reduces to False; and indeed, using the simpl tactic to perform evaluation.:

  -α: Type
f: α -> bool
x: α
H: contains (filter f []) x
f x = true simpl in H.α: Type
f: α -> bool
x: α
H: False
f x = true

What H: False means is that hypothesis H has type False. In Coq False is defined as an empty inductive case, so H inhabiting the type False is inconsistent: case analysis on H completes this branch of the proof:

    destruct H.

The second goal is less simple; this time, simplification suggests two cases: f a = true and f a = false, so we can perform a case analysis on that value:

  -α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
H: contains (filter f (a :: l)) x
f x = true simpl in H.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
H: contains (if f a then a :: filter f l else filter f l) x
f x = true
    destruct (f a) eqn:Hf.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: contains (a :: filter f l) x
f x = true
α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = false
H: contains (filter f l) x
f x = true

We find ourselves with two new subgoals: in the first f a is true (as indicated by Hf: f a = true) and filter f (a :: l) reduced to a :: filter f l; in the second f a is false (Hf: f a = false) and filter f (a :: l) reduced to filter f l. Further simplification will suggest one more case split, as contains (a :: filter f l) itself reduces to a disjunction: either a = x or x is in the result of filter f l.

    +α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: contains (a :: filter f l) x
f x = true simpl in H.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: a = x \/ contains (filter f l) x
f x = true
      destruct H.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: a = x
f x = true
α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true3
x: α
Hf: f a = true
H: contains (filter f l) x
f x = true

In the first case a = x, and we are in the case in which f a is true; hence the goal holds:

      *α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: a = x
f x = true rewrite H in Hf.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f x = true
H: a = x
f x = true
        assumption.

In the second case we know by assumption that x is in (filter f l) (3), but also by induction that all x in (filter f l) satisfy f (3):

      *α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: contains (filter f l) x
f x = true apply IHl.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = true
H: contains (filter f l) x
contains (filter f l) x
        assumption.

Finally we read the case in which f a = false, and the induction hypothesis applies immediately:

    +α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = false
H: contains (filter f l) x
f x = true apply IHl.α: Type
f: α -> bool
a: α
l: list α
IHl: forall x : α, contains (filter f l) x -> f x = true
x: α
Hf: f a = false
H: contains (filter f l) x
contains (filter f l) x assumption.

A satisfying Qed closes the proof:

Qed.

Under the hood tactic generate proof terms, and in fact it is possible to write proofs directly as plain Gallina programs. The result is seldom readable, however:

Fixpoint filter_complete {α} (f: α -> bool) l {struct l}:
  forall x, contains l x -> f x = true -> contains (filter f l) x :=
  match l return (forall x, contains l x -> f x = true ->
                  contains (filter f l) x) with
  | [] => fun x (H: False) _ => H
  | a :: l => fun x (Hc: a = x \/ contains l x) Hf =>
      match Hc with
      | or_introl Heq =>
        eq_rect_r
         (fun a => contains (if f a then a :: _ else _) x)
         (eq_rect_r (x := true)
            (fun b => contains (if b then x :: _ else _) x)
            (or_introl eq_refl) Hf)
         Heq
      | or_intror Hc =>
          if f a as b return (contains (if b then a :: _ else _) x)
          then or_intror (filter_complete f l x Hc Hf)
          else filter_complete f l x Hc Hf
      end
  end.

A final distinguishing characteristic of Coq is its support for advanced notations: unlike traditional languages, almost all the syntax of Coq is defined through extensions of its parser; later in this document we will use this to define special syntax for dictionaries, Hoare triples, function specifications, etc.

Coq proofs are usually written using specialized IDEs [ProofGeneral+Aspinall+ETAPS2000] [CompanyCoq+PitClaudel+CoqPL2016] that support showing Coq code side by side with the corresponding proof state.

For more information on the Coq proof assistant, readers can consult any of the books and tutorials listed at https://coq.inria.fr/documentation. In the rest of this document should be accessible to readers with limited Coq experience; the following chapters assume some Coq proficiency.

3 On relational compilation

The traditional process for developing a verified compiler is to define types that model the source (\(S\)) and target (\(T\)) languages, and to write a function \(f: S \rightarrow T\) that transforms an instance \(s\) of the source type into an instance \(t = f(s)\) of the target type, such all behaviors of \(t\) match existing behaviors of \(s\) (“refinement”) and sometimes additionally such that all behaviors of \(s\) can be achieved by \(t\) (“equivalence”, or “correctness”).

Naturally, proving correctness for such a compiler requires a formal understanding of the semantics of languages \(S\) and \(T\) (that is, a way to give meaning to programs \(s \in S\) and \(t \in T\), so that it is possible to speak of the behaviors of a program: return values, I/O, resource usage, etc.). Then the refinement criterion above translates to \(\sigma _T(t) \subseteq \sigma _S(s)\) (where \(\sigma _S(s)\) denotes the behaviors of \(s\) and \(\sigma _T(t)\) those of \(t\)), and the correctness criterion defines a relation \(~\) between source and target programs such that \(t \sim s\) iff. \(\sigma _T(t) = \sigma _S(s)\). With that, a compiler \(f\) is correct iff. \(f(s) \sim s\) for all \(s\).

Relational compilation is a twist on that approach: it turns out that instead of writing the compiler as a monolithic program and separately verifying it, we can break up the compiler's correctness proof into a collection of orthogonal correctness theorems, and use these theorems to drive a code-generating proof search process. It is a Prolog-style “compilers as relations” approach, but taken one step further to get “compilers as (constructive) decision procedures”.

Instead of writing our compiler as a function \(f: S \rightarrow T\), we will write the compiler as a (partial) decision procedure: an automated proof-search process for proving theorems of the form \(\exists t, \sigma _T(t) = \sigma _S(s)\). In a constructive setting, any proof of that statement must exhibit a witness \(t\), which will be the (correct) compiled version of \(s\). (Note that the theorem does not \(\forall \)-quantify \(s\), as we want to generate one distinct proof per input program — otherwise, with a \(\forall s\) quantification, the theorem would be equivalent by skolemization to \(\exists f, \forall s, \sigma _T(f(s)) = \sigma _S(s)\), which is the same as defining a single compilation function \(f\)… and precisely what we're trying to avoid.)

The two main benefits of this approach are flexibility and trustworthiness: it provides a very natural and modular way to think of compiler extensions, and it makes it possible to extract shallowly embedded programs without trusting an extraction routine (in contrast, extraction in Coq is trusted). The main cost? Completeness: a (total) function always terminates and produces a compiled output; a (partial) proof search process may loop or fail to produce an output. [5]

This is not an entirely new idea: variations on this trick have been variously referred to in the literature as proof-producing compilation, certifying compilation, and, when the source language is shallowly embedded (we will get to that a bit later), proof-producing extraction, certifying extraction, or binary extraction. I have not seen a systematic explanation of it yet, so here is my attempt. I like to call this style of compilation “relational compilation”, and to explain it I like to start from a traditional verified compiler and progressively derive a relational compiler from it.

3.2 Use case 1: Compiling shallowly embedded DSLs

The original set up of the problem (compiling from language \(S\) to language \(T\)) required us to exhibit a function \(f: S \rightarrow T\). Not so with the new set up, which instead requires us to prove instances of the ∼ relation (one per program). What this means is that we can apply this compilation technique to compile shallowly embedded programs, including shallowly embedded DSLs [7].

Here is how we would change our previous example to compile arithmetic expressions written directly in Gallina (Gallina is functional programming language inside of the Coq proof assistant):

Start by redefining the relation to use Gallina expressions on the right side of the equivalence (there are no more references to \(S\) nor σS):
```
Notation "t ≈ s" := (forall ts, σT t ts = s :: ts).
```

Add compilation lemmas (the proofs are exactly the same as before, so they are omitted). Note that on the right side we have plain Gallina + and -, not SAdd and SOp, so each lemma now relates a shallow program to an equivalent deep-embedded one:

Lemma GallinatoT_Z z :
  [TPush z] ≈ z.z: Z
[TPush z] ≈ z

Lemma GallinatoT_Zopp t z :
  t ≈ z ->
  [TPush 0] ++ t ++ [TPopSub] ≈ - z.t: T
z: Z
t ≈ z -> [TPush 0] ++ t ++ [TPopSub] ≈ - z

Lemma GallinatoT_Zadd t1 z1 t2 z2 :
  t1 ≈ z1 ->
  t2 ≈ z2 ->
  t1 ++ t2 ++ [TPopAdd] ≈ z1 + z2.t1: T
z1: Z
t2: T
z2: Z
t1 ≈ z1 -> t2 ≈ z2 -> t1 ++ t2 ++ [TPopAdd] ≈ z1 + z2

These lemmas are sufficient to create a small compiler: as before we populate a hint database with our compilation lemmas:

Create HintDb stack.
Hint Resolve GallinatoT_Z | 10 : stack.
Hint Resolve GallinatoT_Zopp : stack.
Hint Resolve GallinatoT_Zadd : stack.

And then we run our relational compiler on shallowly embedded input programs:

Example g7 := 3 + 6 + Z.opp 2.

Example t7_shallow: { t7 | t7 ≈ g7 }.{t7 : T | t7 ≈ g7}
Proof. eauto with stack. Defined.

Compute t7_shallow.= exist [TPush 7]
: {t7 : T | t7 ≈ g7}

Of course, it is easy to package this in a convenient notation (the pattern match Set return T with _ => X end is a roundabout way to force the type of the value X):

Notation compile gallina_term :=
  (match Set return { t | t ≈ gallina_term } with
   | _ => ltac:(eauto with stack)
   end)
  (only parsing).

Compute compile (3 + 6 + Z.opp 2).= exist [TPush 7]
: {t : T | t ≈ 3 + 6 + - (2)}

There is something slightly magical happening here. By rephrasing compilation as a proof search problem, we are been able to make a compiler that would not even be expressible (let alone provable!) as a regular Gallina function. Reasoning on shallowly embedded programs is often much nicer than reasoning on deeply embedded programs, and this technique offers a convenient way to bridge the gap.

3.3 Use case 2: Extensible compilation

Up to this point we assumed that the input language was fixed, but in fact now that we are compiling shallowly embedded Gallina programs we can trivially extend the source language with additional constructs. Fortunately, the relational compilation technique above readily supports extending the compiler to support new source expressions.

In fact, extensible languages is one place where relational compilation shines. As an example, suppose we are modeling combinational hardware circuits in Coq. Our target type (deep-embedded Boolean expressions) is very simple:

Inductive circuit :=
| Const (z: Z)
| Read (reg_name: string)
| Mux (cond l r: circuit)
| Op (op_name: string) (args: list circuit).

Notice how the Op and Reg constructors (used to call built-in operators and to read registers) take names as strings: this means that to define an interpreter for the language we need an environment Σ of functions defining the semantics of the builtin operators of the language (6) and a context R giving the value of each register (7). The code below uses the notation c.[k] to look up key k in context c (it defaults to an arbitrary value if the key k cannot be found):

Section Interp.
  Variable Σ: string -> (list Z -> Z).6
  Variable R: list (string * Z).7

  Fixpoint cinterp (c: circuit) : Z := match c with
    | Const z => z
    | Read r => R.[r]
    | Mux cond t f =>
      if cinterp cond =? 0 then cinterp f else cinterp t
    | Op op args =>
      Σ op (List.map cinterp args)
  end.
End Interp.

Here is an example.

First, we define an environment of built-in functions:

Definition testbit z n :=
  if Z.testbit z n then 1 else 0.

Example Σ0 fn args := match fn, args with
  | "add", [z1; z2] => Z.add z1 z2
  | "xor", [z1; z2] => Z.lxor z1 z2
  | "nth", [z; n] => testbit z n
  | _, _ => 0 (* Default to 0 for simplicity *)
end.

Then, we define an environment of registers:

Example R0 :=
  [("pc", 16); ("ppc", 14); ("r1", 5); ("r2", 7)].

And finally we can run the interpreter on an example circuit c1:

Example c1 :=
  Mux (Op "xor" [Read "pc"; Read "ppc"])
      (Read "r1") (Read "r2").

Reducing everything but Σ0 shows the interpretation of Mux in this term:

Eval cbn -[Σ0] in cinterp Σ0 R0 c1.= if Σ0 "xor" [16; 14] =? 0 then 7 else 5
: Z

… and reducing everything gives us the value of this example circuit:

Compute cinterp Σ0 R0 c1.= 5
: Z

Relational compilation is a simple and convenient way to generate such circuits from Gallina expressions. First, we need a compilation relation:

Notation "c ≋ g @ Σ // R" := (cinterp Σ R c = g).

Then, we need compilation lemmas relating source programs in Gallina and their circuit equivalents.

First, constants:

Lemma compile_Const {Σ R} z:
  Const z ≋ z @ Σ // R.Σ: string -> list Z -> Z
z: Z
Const z ≋ z @ Σ // R

Then variable accesses, compiled to register reads:

Lemma compile_Read {Σ R} r z:
  z = R.[r] ->
  Read r ≋ z @ Σ // R.Σ: string -> list Z -> Z
r: string
z: Z
z = R.[r] -> Read r ≋ z @ Σ // R

Then conditionals:

Lemma compile_Mux {Σ R} ccond gcond ct gt cf gf:
  ccond ≋ gcond @ Σ // R ->
  ct    ≋ gt    @ Σ // R ->
  cf    ≋ gf    @ Σ // R ->
  Mux ccond ct cf ≋ if gcond =? 0 then gf else gt @ Σ // R.Σ: string -> list Z -> Z
ccond: circuit
gcond: Z
ct: circuit
gt: Z
cf: circuit
gf: Z
ccond ≋ gcond @ Σ // R ->
ct ≋ gt @ Σ // R ->
cf ≋ gf @ Σ // R -> Mux ccond ct cf ≋ if gcond =? 0 then gf else gt @ Σ // R

And finally operators. Note that compilation lemmas are parametric on the environment of functions, only requiring that the one function it uses be found in the environment:

Lemma compile_add {Σ R} c1 g1 c2 g2:
  (forall x y, Σ "add" [x; y] = Z.add x y) ->
  c1 ≋ g1 @ Σ // R ->
  c2 ≋ g2 @ Σ // R ->
  Op "add" [c1; c2] ≋ Z.add g1 g2 @ Σ // R.Σ: string -> list Z -> Z
c1: circuit
g1: Z
c2: circuit
g2: Z
(forall x y : Z, Σ "add" [x; y] = x + y) ->
c1 ≋ g1 @ Σ // R -> c2 ≋ g2 @ Σ // R -> Op "add" [c1; c2] ≋ g1 + g2 @ Σ // R

Lemma compile_xor {Σ R} c1 g1 c2 g2:
  (forall x y, Σ "xor" [x; y] = Z.lxor x y) ->
  c1 ≋ g1 @ Σ // R ->
  c2 ≋ g2 @ Σ // R ->
  Op "xor" [c1; c2] ≋ Z.lxor g1 g2 @ Σ // R.Σ: string -> list Z -> Z
c1: circuit
g1: Z
c2: circuit
g2: Z
(forall x y : Z, Σ "xor" [x; y] = Z.lxor x y) ->
c1 ≋ g1 @ Σ // R -> c2 ≋ g2 @ Σ // R -> Op "xor" [c1; c2] ≋ Z.lxor g1 g2 @ Σ // R

Lemma compile_nth {Σ R} cz gz cn gn:
  (forall z n, Σ "nth" [z; n] = testbit z n) ->
  cz ≋ gz @ Σ // R ->
  cn ≋ gn @ Σ // R ->
  Op "nth" [cz; cn] ≋ testbit gz gn @ Σ // R.Σ: string -> list Z -> Z
cz: circuit
gz: Z
cn: circuit
gn: Z
(forall z n : Z, Σ "nth" [z; n] = testbit z n) ->
cz ≋ gz @ Σ // R ->
cn ≋ gn @ Σ // R -> Op "nth" [cz; cn] ≋ testbit gz gn @ Σ // R

That is enough to compile a simple EDSL for circuits. There are a few things worthy of note here; first, we now have a mechanism to refer to bound variables, through the R environment; second, each compilation lemma is parametric on the environment of functions, so the whole compiler is extensible. Let us see these two points in action with a more complex program:

Definition gc1 pc ppc (r1 r2: Z) :=
  let correct := pc =? ppc in
  let v := if correct then r1 else r2 in
  testbit v 4.

This program checks if its first two arguments to see if they are equal, and returns a different value in each case. The relational compilation goal will looks a bit different from previous ones, because we now need to account for an environment of variables:

Example cc1 : { cc1 | forall pc ppc r1 r2,
   cc1 ≋ gc1 pc ppc r1 r2 @ Σ0
   // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)] }.{cc1 : circuit
| forall pc ppc r1 r2 : Z,
  cc1 ≋ gc1 pc ppc r1 r2 @ Σ0 //
  [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]}

In other words: there exists a circuit cc1 equivalent to the Gallina program gc1 for all inputs pc, ppc, r1, and r2, assuming that these inputs are available in registers and that the circuit runs with the environment of built-ins Σ0 The proof too looks a bit different, because there now are side conditions for certain lemmas:

Proof.
  eexists; intros; unfold gc1.?cc1 ≋ testbit (if pc =? ppc then r1 else r2) 4 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

The program starts with a call to testbit, so we plug in the circuit primitive "nth":

  eapply compile_nth.forall z n : Z, Σ0 "nth" [z; n] = testbit z n
?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?cn ≋ 4 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

We now get three subgoals: one asserting that we can call the function "nth" given the current function environment Σ, and two corresponding to each of the arguments to testbit in Gallina:

  1: reflexivity.?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?cn ≋ 4 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

Of the two argument subgoals, the first one is a conditional to which none of our compilation lemmas apply: as we defined it, compile_Mux requires a specific Gallina pattern, _ =? 0, which is not present here:

?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]    apply compile_Mux.In environment
pc, ppc, r1, r2 : Z
Unable to unify
 "Mux ?M1440 ?M1442 ?M1444 ≋ if ?M1441 =? 0 then ?M1445 else ?M1443 @ ?Σ // ?R"
with
 "?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
  [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]".

This is the first instance where the extensibility of relational compilation comes into play. For this example, we can extend the compiler by plugging in a rewrite rule that transforms the program to match a shape supported by the compiler, after which Compile_Mux applies:

Z_lxor_eqb
     : forall z1 z2, (z1 =? z2) = (Z.lxor z1 z2 =? 0)    rewrite Z_lxor_eqb.?cz ≋ if Z.lxor pc ppc =? 0 then r1 else r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
    apply compile_Mux.

Compiling the conditional leaves us with three more subgoals: one for the test, one for the right branch, and one for the left branch:

?ccond ≋ Z.lxor pc ppc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?ct ≋ r2 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?cf ≋ r1 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

The first part can be handled using the compile_xor lemma:

      apply compile_xor; try reflexivity.?c1 ≋ pc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?c2 ≋ ppc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

And this is where the second interesting part of this proof comes about: handling variables. First, let us try something that looks right, but will not work:

        apply compile_Const.In environment
pc, ppc, r1, r2 : Z
Unable to unify "?c1" with "Const pc" (cannot instantiate 
"?c1" because "pc" is not in its scope).

It is very important that this shouldn't work, but it is not immediately obvious why it wouldn't. The value pc is indeed not a constant, but it you look just at the types, things do line up:

?c1 ≋ pc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]compile_Const pc
     : Const pc ≋ pc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

The reason it doesn't work is captured in the error message above. When Coq creates the evars denoted by ?…, it associates with each of them a context that records which variables they may refer to. Here is, for example, the internal goal that Coq generated for ?c1:


c1circuit

In other words, the evar ?c1 cannot refer to any of the local variables — which is good, since we're trying to build a closed term! What we want, of course, is compile_Read, which reads into the environment of registers, not compile_Const:

        apply compile_Read with (r := "pc"); reflexivity.

The same lemma applies to the next three goals, in fact:

?c2 ≋ ppc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]        apply compile_Read with (r := "ppc"); reflexivity.

?ct ≋ r2 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]      apply compile_Read with (r := "r2"); reflexivity.

?cf ≋ r1 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]      apply compile_Read with (r := "r1"); reflexivity.

And finally we have the second argument to the original testbit, the index of the bit that we want to extract:

?cn ≋ 4 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]    apply compile_Const.
Defined.

The resulting generated program is as expected, and correct by construction:

Print cc1.cc1 = 
exist
  (Op "nth"
     [Mux (Op "xor" [Read "pc"; Read "ppc"]) (Read "r2") (Read "r1"); Const 4])
     : {cc1 : circuit
       | forall pc ppc r1 r2 : Z,
         cc1 ≋ gc1 pc ppc r1 r2 @ Σ0 //
         [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]}

3.3.1 Automating the derivation

The process above is closer to tool-assisted program derivation than to “compilation”. Automating it is not hard, but it requires tricks beyond what we have seen above. We will start by creating a hint database to hold our custom compilation lemmas:

Create HintDb circuits.

For readability, we will use Coq's Derive feature to state our compilation goal. Derive defines a dependent pair (a term and a proof of a property about it) as two separate names:

Derive cc1' SuchThat (forall pc ppc r1 r2,
    cc1' ≋ gc1 pc ppc r1 r2 @ Σ0
    // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)])
  As cc1'ok.cc1':= ?Goal: circuit
forall pc ppc r1 r2 : Z,
cc1' ≋ gc1 pc ppc r1 r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
Proof.
  unfold cc1', gc1; clear cc1'.forall pc ppc r1 r2 : Z,
?Goal ≋ testbit (if pc =? ppc then r1 else r2) 4 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

The first trick we will introduce is forward reasoning. Until now, we have used eauto to pull from a database of lemmas to try to derive a complete proof of a goal. In this mode, eauto is all-or-nothing: it will not make any progress on this goal until we introduce all relevant lemmas:

  eauto using compile_nth.forall pc ppc r1 r2 : Z,
?Goal ≋ testbit (if pc =? ppc then r1 else r2) 4 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

This isn't sustainable: we need to guess exactly all lemmas that are required, or nothing happens. Instead we will use a partial progress style popularized by [CoqProlog+Zimmermann+TTT2017], in which we program eauto to take a single step and then return to its caller:

  Hint Extern 2 => simple apply compile_Const; shelve : circuits.
  Hint Extern 2 => simple apply compile_add; shelve : circuits.
  Hint Extern 2 => simple apply compile_xor; shelve : circuits.
  Hint Extern 2 => simple apply compile_nth; shelve : circuits.

Each of these hints allow eauto to shelve the current goal after applying the corresponding lemma, which removes the subgoals generated by that lemma from the pool of things the eauto is required to solve and places the on Coq's shelf.

Coq's shelf is a side collection of goals that typically hold terms whose definition will follow from solving other goals. For example, in the following proof, the definition of ?x is on the shelf: it is expected that solving .io#shelf-example.s(eexists).g.ccl will instantiate ?x (note how the call to Show Existentials mentions that ?x is shelved):

Goal exists x, x + 1 = 2.exists x : Z, x + 1 = 2
  eexists.?x + 1 = 2
  Show Existentials.Existential 1 = ?x : [ |- Z] (shelved)
Existential 2 = ?Goal : [ |- ?x + 1 = 2]

Using the shelf allows us to define forward-reasoning tactics: we use unshelve to bring back the intermediate goals that were shelved within eauto (typeclasses eauto, in fact, since eauto puts things on the wrong shelf — Coq has more than one), and shelve_unifiable to move evars that are not propositional goals back on the shelf:

  Tactic Notation "step" "with" ident(db) :=
    intros; unshelve typeclasses eauto with db; shelve_unifiable.

  Tactic Notation "forward" "with" ident(db) :=
    progress repeat step with db.

These hints are enough to get us started with the derivation of our program. forward with … is a tactic similar to eauto, but it supports partial progress:

   forward with circuits.forall z n : Z, Σ0 "nth" [z; n] = testbit z n
?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

The first goal we get asks us to prove that we have the right function under the name "nth" in our function context, as before:

forall z n : Z, Σ0 "nth" [z; n] = testbit z n    Hint Extern 1 => reflexivity : circuits.forall z n : Z, Σ0 "nth" [z; n] = testbit z n
    forward with circuits.

The second goal is where this new approach of partial compilation begins to be useful: we have partially compiled our program, and we can now add more hints to continue making progress. As before, compile_Mux doesn't apply:

?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]    apply compile_Mux.In environment
pc, ppc, r1, r2 : Z
Unable to unify
 "Mux ?M1502 ?M1504 ?M1506 ≋ if ?M1503 =? 0 then ?M1507 else ?M1505 @ ?Σ // ?R"
with
 "?cz ≋ if pc =? ppc then r1 else r2 @ Σ0 //
  [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]".

We could use Z_lxor_eqb as before to rewrite the equality test pc =? ppc into an exclusive-or test Z.lxor pc ppc =? 0, but for consistency we will prove a new compilation lemma instead (this gives us a unified way to handle all extensions):

Lemma compile_Mux_eqb {Σ R} c1 g1 c2 g2 ct gt cf gf:
  (forall x y, Σ "xor" [x; y] = Z.lxor x y) ->
  c1 ≋ g1 @ Σ // R ->
  c2 ≋ g2 @ Σ // R ->
  ct ≋ gt @ Σ // R ->
  cf ≋ gf @ Σ // R ->
  Mux (Op "xor" [c1; c2]) ct cf ≋
  if g1 =? g2 then gf else gt @ Σ // R.Σ: string -> list Z -> Z
c1: circuit
g1: Z
c2: circuit
g2: Z
ct: circuit
gt: Z
cf: circuit
gf: Z
(forall x y : Z, Σ "xor" [x; y] = Z.lxor x y) ->
c1 ≋ g1 @ Σ // R ->
c2 ≋ g2 @ Σ // R ->
ct ≋ gt @ Σ // R ->
cf ≋ gf @ Σ // R ->
Mux (Op "xor" [c1; c2]) ct cf ≋ if g1 =? g2 then gf else gt @ Σ // R

This is enough to step further in the compilation process, leaving us with four very similar goals:

  Hint Extern 2 =>
    simple apply compile_Mux_eqb; shelve : circuits.forall pc ppc r1 r2 : Z,
?Goal ≋ testbit (if pc =? ppc then r1 else r2) 4 @ Σ0 //
[("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
  forward with circuits.?c1 ≋ pc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?c2 ≋ ppc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?ct ≋ r2 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?cf ≋ r1 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

This time, we get stuck because we did not register compile_Read:

?c1 ≋ pc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?c2 ≋ ppc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?ct ≋ r2 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?cf ≋ r1 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

Why not? Because as written, compile_Read applies to all goals, often leaving an unsolvable goal: when applied to a goal _ ≋ g @ _ // R, compile_Read will simply generate a goal asking to find g in R, regardless of whether such a binding in fact exists in R. For example:

  eexists ?[c].?c ≋ 1 + 1 @ Σ0 // [("r0", 14)]
  apply compile_Read.1 + 1 = [("r0", 14)].[?r]
Abort.

There are two ways to proceed in such cases: add logic to apply compile_Read eagerly, and backtrack if no corresponding binding can be found; or use a more clever strategy to apply compile_Read more discerningly. Up to this point our derivations have all been deterministic, and we want the compiler to be as predictable as possible, so we will do the latter by restricting the use of the compile_Read lemma to cases where the Gallina term g is a single variable. Additionally, we will give compile_Read a low priority (the backtracking approach is discussed in a note at the end of this section):

  Hint Extern 3 (_ ≋ ?v @ _ // _) =>
    is_var v; simple apply compile_Read; shelve : circuits.?c1 ≋ pc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?c2 ≋ ppc @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?ct ≋ r2 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]
?cf ≋ r1 @ Σ0 // [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)]

  all: forward with circuits.pc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r]
ppc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r0]
r2 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r1]
r1 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r2]

The remaining goals are the preconditions of compile_Read: the variables should in fact be in context:

pc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r]
ppc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r0]
r2 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r1]
r1 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r2]

To solve these goals automatically, we will use two simple lemmas.

assoc_hd, which handles the case in which the value we're looking for is the first in the list:

Lemma assoc_hd (v: V) k tl:
  v = ((k, v) :: tl).[k].V: Type
H: HasDefault V
v: V
k: string
tl: list (string * V)
v = ((k, v) :: tl).[k]

assoc_tl, which handles the case in which the value we're looking for is in the tail of the list:

Lemma assoc_tl (v v': V) k k' tl:
  v = tl.[k'] ->
  k <> k' ->
  v = ((k, v') :: tl).[k'].V: Type
H: HasDefault V
v, v': V
k, k': string
tl: list (string * V)
v = tl.[k'] -> k <> k' -> v = ((k, v') :: tl).[k']

Here is how these come into play, on a standalone example. We start with a goal asking us to locate the value 3 in a context. Applying assoc_tl discards the first binding ("x", 1) and asks us to find 3 among the remaining bindings; then, applying assoc_hd selects the first of the remaining bindings ("y", 3):

3 = [("x", 1); ("y", 3)].[?k]  apply assoc_tl.3 = [("y", 3)].[?k]
"x" <> ?k
  -3 = [("y", 3)].[?k] apply assoc_hd.
  -"x" <> "y" congruence.

Plugging in these two lemmas completes the derivation:

  Hint Extern 1 => congruence : circuits.pc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r]
ppc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r0]
r2 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r1]
r1 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r2]
  Hint Resolve assoc_hd assoc_tl | 2 : circuits.pc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r]
ppc = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r0]
r2 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r1]
r1 = [("pc", pc); ("ppc", ppc); ("r1", r1); ("r2", r2)].[?r2]
  all: forward with circuits.
Qed.

Print cc1'.cc1' = 
Op "nth"
  [Mux (Op "xor" [Read "pc"; Read "ppc"]) (Read "r2") (Read "r1"); Const 4]
     : circuit

In my experience, it is almost always best to reduce or eliminate backtracking. Crafting compilers is already a tricky business, and introducing backtracking makes debugging significantly harder. Generating proofs along programs ensure that the compiler does not produce incorrect output, but it does not rule out undesirable output, such as inefficient programs.

Still, in the compile_Read case above, an alternative approach would have been to apply the lemma unconditionally, but without shelving: in that case eauto would only have applied it when it is able to solve the resulting assoc subgoals:

Hint Resolve compile_Read assoc_hd assoc_tl : circuits.

3.3.1.1 Automating the initial setup

This handles the derivation itself; the initial unfolding phase of the derivation can be also handled automatically using a simple Ltac script to reveal the evar created by Derive and unfold toplevel program definitions, like gc1 above; we do all this in a new tactic compile with <database>:

Tactic Notation "setup" "with" ident(db) :=
  intros; autounfold with db;
  lazymatch goal with c := _ |- _ => subst c end.

Tactic Notation "compile" "with" ident(db) :=
  setup with db; forward with db.

And with this, we have our first, tiny, extensible compiler! Of course, this new compiler is applicable to a wide range of programs, not just the one we just compiled:

Derive c SuchThat (forall z, c ≋ z + z @ Σ0 // [("z", z)]) As cok.c:= ?Goal: circuit
forall z, c ≋ z + z @ Σ0 // [("z", z)]
Proof. compile with circuits. Qed.

Print c.c = Op "add" [Read "z"; Read "z"]
     : circuit

3.3.3 Leveraging contextual information

The last extension is important enough that it deserves its own section. It is a common pattern in functional programming languages to use a monad to describe an effect that is not supported by the language. Part of compiling the program down to a lower-level language is mapping the monadic structure to low-level primitives, and we can do that using relational compilation.

Even better, we can support native compilation of arbitrary monads (!): unlike traditional compilers that hard-code a list of built-in monads that get special compiler support (think IO in Haskell), we can plug-in native compilation support for arbitrary user-defined monads. This means that we never have to define an interpreter for a free monad (this is the usual approach in Coq for extraction effectful program: define a free monad specialized to a functor capturing effects, extract to OCaml, and define an unverified interpreter to map the effects to native OCaml effects; here, we can instead directly map the monad to effects of the target language).

We will start by exploring the example of the Writer monad with a compiler specialized for that monad, and then we will see an extra twist that allows us to define the pure parts of the compiler in a monad-agnostic way, so that they can be shared between different EDSL compilers specialized to different monads, reducing duplication. Rupicola itself has many more examples of relational compilation for monadic programs.

3.3.3.1 The writer monad

In this example, programs will return not just values, but also a list of strings, the "output" of the program:

Definition Trace := list string.
Record S {α: Type} := { val: α; trace: Trace }.

Example puts str := {| val := tt; trace := [str] |}.

The usual monadic operators are readily defined: a pure computation is like a monadic computation with an empty trace, and two effectful computations running in sequence produce the result of the second and the concatenation of the two traces:

Definition ret (a: α) : S α :=
  {| val := a; trace := [] |}.

Definition tr_bind (a: S α) (b: S β) :=
  {| val := b.(val); trace := a.(trace) ++ b.(trace) |}.

Definition bind (a: S α) (k: α -> S β) : S β :=
  tr_bind a (k a.(val)).

Notation "v ← a ; body" := (bind a%string (fun v => body)).

It is straightforward to prove that the usual monad properties hold:

Lemma bind_ret (ca: S α) :
  bind ca ret = ca.α, β, γ: Type
ca: S α
v ← ca;
ret v = ca
Lemma ret_bind a (k: α -> S β) :
  bind (ret a) k = (k a).α, β, γ: Type
a: α
k: α -> S β
v ← ret a;
k v = k a
Lemma bind_bind ca (ka: α -> S β) (kb: β -> S γ) :
  bind (bind ca ka) kb = bind ca (fun a => bind (ka a) kb).α, β, γ: Type
ca: S α
ka: α -> S β
kb: β -> S γ
v ← v ← ca;
    ka v;
kb v = a ← ca;
       v ← ka a;
       kb v

We will compile this to a simple imperative string language with traces. Here is the definition of expressions and statements in that language; this will also give us the opportunity to start discussing assignments:

Inductive Expr :=
| Ref var
| Const str
| Concat (e1 e2: Expr).

Inductive T :=
| Seq (t1 t2: T)
| Assign var (e: Expr)
| Puts (e: Expr)
| Skip.

The semantics of expressions is given by an interpreter:

Definition Ctx := list (Var * string).

Fixpoint interp ctx e : string := match e with
  | Ref var      => ctx.[var]
  | Const str    => str
  | Concat e1 e2 => interp ctx e1 ++ interp ctx e2
end.

… and, to spice things up, the semantics of statements is given by a big-step evaluation relation:

Inductive RunsTo : Ctx -> T -> Ctx -> Trace -> Prop :=
| RunsToSeq ctx0 t1 ctx1 tr1 t2 ctx2 tr2:
    ⟨ctx0, t1⟩ ⇓ ⟨ctx1, tr1⟩ ->
    ⟨ctx1, t2⟩ ⇓ ⟨ctx2, tr2⟩ ->
    ⟨ctx0, Seq t1 t2⟩ ⇓ ⟨ctx2, tr1 ++ tr2⟩
| RunsToAssign ctx var e:
    ⟨ctx, Assign var e⟩ ⇓ ⟨(var, interp ctx e) :: ctx, []⟩
| RunsToPuts ctx e:
    ⟨ctx, Puts e⟩ ⇓ ⟨ctx, [interp ctx e]⟩
| RunsToSkip ctx e:
    ⟨ctx, Skip⟩ ⇓ ⟨ctx, []⟩
where "⟨ ctx , p ⟩ ⇓ ⟨ ctx' , tr ⟩" :=
  (RunsTo ctx p ctx' tr).

The relational compiler for expressions is routine at this point, so we omit it for brevity. We simply define a relation ~ₑ for expressions, and prove lemmas relating outputs and inputs of interp. For very simple cases like this one, what we are really building is in fact a reification procedure, implemented by programming a decision procedure to invert the function interp:

Notation "e ~ₑ g // ctx" := (interp ctx e = g%string).

Derive eHello SuchThat (forall name,
  eHello ~ₑ ("hello, " ++ name) // [("name", name)])
As eHello_ok.eHello:= ?Goal: Expr
forall name : string, eHello ~ₑ "hello, " ++ name // [("name", name)] compile with str. Qed.

Print eHello.eHello = Concat (Const "hello, ") (Ref "name")
     : Expr

Conversely, there are a few new things in the relational compiler for statements. Specifically, we want to convert uses of monadic bind into a sequence, translating pure expressions using the expression compiler and using primitives to implement stateful computations like puts.

There is one significant difference from previous examples, however: our new target language has assignments, and these assignments are not directly reflected in the source language. As a result the final state of the program may contain arbitrary bindings, and it would be quite inconvenient to have to declare exactly which temporary variables may be used when starting the compilation process. Instead, we will use a slightly more complicated compilation relation. Unlike the equalities used previously, we now state that a low-level program is related to a high-level one if they produce the same traces, and if the result of the high-level program can be found in the final context, under a name specified as part of the compilation relation. Beyond this, the final context is allowed to contain arbitrary bindings.

Definition related t g var ctx :=
  (forall ctx' tr,
      ⟨ ctx, t ⟩ ⇓ ⟨ ctx', tr ⟩ ->
      g.(trace) = tr /\
      g.(val) = ctx'.[var]).

Notation "t ~ₜ g @ var // ctx" :=
  (related t g%string var ctx).

Each lemma about the new compilation relation ~ₜ matches a corresponding constructor of RunsTo closely, but not exactly, because we need to phrase things in terms of monadic operations. First we show how to compile Puts, which writes a value out:

Lemma compile_Puts ctx e g t k var:
  e ~ₑ g // ctx ->
  t ~ₜ k tt @ var // ctx ->
  Seq (Puts e) t ~ₜ bind (puts g) k @ var // ctx.ctx: Ctx
e: Expr
g: string
t: T
k: unit -> S string
var: Var
e ~ₑ g // ctx ->
t ~ₜ k tt @ var // ctx -> Seq (Puts e) t ~ₜ v ← puts g;
                                            k v @ var // ctx

Then Assign (quantifying over the value v prevents the expression g from getting inlined into the continuation k):

Lemma compile_Assign ctx e g t k var tmp:
  e ~ₑ g // ctx ->
  (forall v, v = g -> t ~ₜ k v @ var // (tmp, v) :: ctx) ->
  Seq (Assign tmp e) t ~ₜ bind (ret g) k @ var // ctx.ctx: Ctx
e: Expr
g: string
t: T
k: string -> S string
var, tmp: Var
e ~ₑ g // ctx ->
(forall v : string, v = g -> t ~ₜ k v @ var // (tmp, v) :: ctx) ->
Seq (Assign tmp e) t ~ₜ v ← ret g;
                        k v @ var // ctx

And finally a lemma to conclude the compilation, which uses an assignment because the way our compilation relation is phrased (see sidebar).

Lemma compile_Skip ctx e str var:
  e ~ₑ str // ctx ->
  Assign var e ~ₜ ret str @ var // ctx.ctx: Ctx
e: Expr
str: string
var: Var
e ~ₑ str // ctx -> Assign var e ~ₜ ret str @ var // ctx

It may be surprising that we do not have a lemma for compiling an arbitrary sequence — one that would handle any Gallina bind. Instead, we have two lemmas, specialized to the two kinds of expressions that may be bound, Puts and Assign, both with an arbitrary continuation. Is that not limiting? Not really.

The reason we do not have an arbitrary bind is that our equivalence relation is not precise enough: it does not specify exactly what the environment of variables is after running a piece of code. This is convenient, since it allows to phrase our lemmas concisely and easily; but it prevents us from phrasing a lemma to relate bind and Seq in general (and it makes it so that our Skip lemma does not in fact use Skip). What would such a bind lemma look like? We might hope to write something like the following, but this is not valid:

Lemma compile_Seq ctx t1 s1 t2 s2 var tmp:
  t1 ~ₜ s1 @ tmp // ctx ->
  t2 ~ₜ s2 s1.(val) @ var // (tmp, s1.(val)) :: ctx ->
  Seq t1 t2 ~ₜ bind s1 s2 @ var // ctx.ctx: Ctx
t1: T
s1: S string
t2: T
s2: string -> S string
var, tmp: Var
t1 ~ₜ s1 @ tmp // ctx ->
t2 ~ₜ s2 (val s1) @ var // (tmp, val s1) :: ctx ->
Seq t1 t2 ~ₜ v ← s1;
             s2 v @ var // ctx
Abort.

Indeed, the first premise says very little about the state of the machine after running t1: only that it will contain a binding tmp ↦ s1.(val). It does not say that ctx will be unmodified otherwise (t1 may create new bindings), and hence the environment in which Seq t1 t2 will run t2 is not guaranteed to be (tmp, s1.(val)) :: ctx, as required by the second premise.

This is not an issue for straightline code, but it does cause trouble for conditionals and loops that appear in the value part of a bind (what local variables can we assume the environment to contain after an if?). The monad-agnostic approach that we present in the next step solves this problem.

These compilation rules are enough to compile full programs (below, the tactic binder_name translates a Coq-level binder name into a string):

Hint Extern 1 => simple apply compile_Puts; shelve : str.
Hint Extern 1 => simple apply compile_Skip; shelve : str.

Hint Extern 1 (_ ~ₜ bind ?s ?k @ _ // _) =>
  simple apply compile_Assign
    with (tmp := ltac:(binder_name k)); shelve : str.

Definition greet (name: string) :=
  greeting ← ret ("hello, " ++ name);
  _ ← puts greeting;
  _ ← puts "!";
  ret greeting.
Hint Unfold greet : str.

Derive tHello SuchThat (forall name,
  tHello ~ₜ greet name @ "out" // [("name", name)])
As tHello_ok.tHello:= ?Goal: T
forall name : string, tHello ~ₜ greet name @ "out" // [("name", name)] compile with str. Qed.

Print tHello.tHello = 
Seq (Assign "greeting" (Concat (Const "hello, ") (Ref "name")))
  (Seq (Puts (Ref "greeting"))
     (Seq (Puts (Const "!")) (Assign "out" (Ref "greeting"))))
     : T

3.3.3.2 Monad-agnostic extraction

In the above, we defined a new extraction procedure for each monad, but we can reduce the required effort by generalizing further. We can define things in such a way that lemmas about non-monadic code work for all monad, which means that different domain-specific languages, using different monads, can all use the same code for compiling pure values. The key here is to completely generalize over the pre and post-conditions that the compiler uses, leaving only a distinguished argument to the postcondition indicating which program we are compiling. For this we define a Hoare triple on top of our program semantics (for brevity we define it on top of our big-step semantics instead of defining a new relation, and we omit the precondition, which will live in the ambient proof context instead):

Definition triple {α}
           (ctx: Ctx) (prog: T) (spec: α)
           (post: α -> Trace -> Ctx -> Prop) :=
  forall ctx' tr,
    ⟨ctx, prog⟩ ⇓ ⟨ctx', tr⟩ -> post spec tr ctx'.

Notation "<{ ctx }> prog <{ spec | post }>" :=
  (triple ctx prog spec post).

Note how the post-condition post takes a special argument spec, which is where we will plug in our Gallina programs (this trick makes it easy to spot the program that we're compiling when writing tactics that inspect the goal). Crucially, spec does not take a monadic argument in: any Gallina program is fair game to plug into this spot that drives the compiler. Our previous relation is a special case of this one, and in fact all lemmas will carry naturally. Specifically, we have:

Goal forall t g var ctx,
    <{ ctx }>
    t
    <{ g | fun g tr' ctx' =>
         g.(val) = ctx'.[var] /\
         g.(trace) = tr' }> ->
    t ~ₜ g @ var // ctx.forall t (g : S string) var ctx,
<{ ctx }> t <{ g
| fun (g0 : S string) (tr' : Trace) ctx' => val g0 = ctx'.[var] /\ trace g0 = tr'
}> -> t ~ₜ g @ var // ctx

Here is compile_Assign written in this new style. We quantify the hypothesis about expressions over all states allowed by the precondition, and as the postcondition we plug in the strongest postcondition for the assignment statement (which looks simpler than the usual SP of an assignment because of the choice to move preconditions to the surrounding logic).

Lemma compile_Assign e g ctx tmp:
  e ~ₑ g // ctx ->
  <{ ctx }>
    Assign tmp e
  <{ g | fun v tr ctx' =>
       tr = [] /\ ctx' = (tmp, v) :: ctx }>.e: Expr
g: string
ctx: Ctx
tmp: Var
e ~ₑ g // ctx ->
<{ ctx }> Assign tmp e <{ g
| fun (v : string) (tr : Trace) ctx' => tr = [] /\ ctx' = (tmp, v) :: ctx }>

And with these generalized pre-post pairs, we can now define a compilation lemma for bind, as well as a proper Skip lemma, both of which vexed us previously; this time the first program is compiled with an arbitrary postcondition, with will be resolved through unification as part of the compilation process; and the second program assumes this intermediate postcondition as its starting point and completes its run with a modified postcondition that appends the corresponding trace. Because this lemma does mention low-level effects, of course, we do need to mention the monad that we use to implement trace-modifying effects.

Lemma compile_Seq {α β} ctx post middle p1 p2 (s1: S α) (s2: α -> S β):
  <{ ctx }> p1 <{ s1 | middle }> ->
  (forall g1 tr1 ctx1, g1 = s1 -> middle g1 tr1 ctx1 ->
   let post' g2 tr2 := post (tr_bind g1 g2) (tr1 ++ tr2) in
   <{ ctx1 }> p2 <{ s2 g1.(val) | post' }>) ->
  <{ ctx }> Seq p1 p2 <{ bind s1 s2 | post }>.α, β: Type
ctx: Ctx
post: S β -> list string -> Ctx -> Prop
middle: S α -> Trace -> Ctx -> Prop
p1, p2: T
s1: S α
s2: α -> S β
<{ ctx }> p1 <{ s1 | middle }> ->
(forall (g1 : S α) (tr1 : Trace) ctx1,
 g1 = s1 ->
 middle g1 tr1 ctx1 ->
 let post' :=
   fun (g2 : S β) (tr2 : list string) => post (tr_bind g1 g2) (tr1 ++ tr2) in
 <{ ctx1 }> p2 <{ s2 (val g1) | post' }>) ->
<{ ctx }> Seq p1 p2 <{ v ← s1;
                       s2 v | post }>

Lemma compile_Skip g ctx post :
  post g [] ctx ->
  <{ ctx }> Skip <{ g | post }>.α: Type
g: α
ctx: Ctx
post: α -> Trace -> Ctx -> Prop
post g [] ctx -> <{ ctx }> Skip <{ g | post }>

Conditionals and loops can be handled similarly to sequences. For straightline code, we do not need to instantiate this “middle” clause explicitly: instead we can simply let it be derived by unification as part of compiling the first part of the sequence. For conditionals and for loops there is no free lunch, however, so we need to infer a predicate that captures the effect of both branches (for conditionals) or arbitrary repetitions (for loops). Luckily this inference problem is easier than it seems at first: specifically, we can pick the strongest postcondition, with carefully chosen heuristics and manipulations to ensure that we chose a postcondition that is readable and workable for the rest of the compilation process. The exact choice of heuristics is out of scope for this section, but we detail it later when we dive into the specifics of Rupicola ().

At this point, if we consider the case of a source program made of a sequence of let-bindings, we realize that the compilation process will now be an alternation of compile_Seq and specialized compilation lemmas, the former introducing cuts in the derivation and the latter refining the precondition of the program. Once we're done with all let-bindings (or monadic binds), we unify the final precondition that the compiler has derived with the postcondition that we were hoping to achieve.

It is because of this last step that we usually prefer to add an explicit continuation to each lemma, even though our new representation allows for a general cut lemma compile_Seq: making continuations explicit allows us to craft our intermediate postconditions very precisely as part of writing each lemma, instead of relying on unification. (Careful readers will have noticed the difficulty already popping up in a small way in compile_Assign above, which referred to a variable name tmp that we could not infer from goal, since we had already eliminated the corresponding let binding.) This makes the last unification step trivial in almost all cases. Here is what compile_Assign looks like in this style (using a wrapper blet — for “blocked let” — around let-bindings to prevent Coq from unfolding too aggressively without depending on a specific monad and bind + ret):

Lemma compile_Assign {α} ctx e g t (k: string -> α) tmp post:
  e ~ₑ g // ctx ->
  (forall v, v = g -> <{ (tmp, v) :: ctx }> t <{ k v | post }>) ->
  <{ ctx }> Seq (Assign tmp e) t <{ blet g k | post }>.α: Type
ctx: Ctx
e: Expr
g: string
t: T
k: string -> α
tmp: Var
post: α -> Trace -> Ctx -> Prop
e ~ₑ g // ctx ->
(forall v : string, v = g -> <{ (tmp, v) :: ctx }> t <{ k v | post }>) ->
<{ ctx }> Seq (Assign tmp e) t <{ blet g k | post }>
Proof. unfold triple; hammer. Qed.

This is the last step of our journey, and since our final compiler looks so similar to the previous iteration, we omit it for brevity; curious readers can consult the complete development for details.

5 Evaluation

I claim that Rupicola's novelty is its combination of extensibility, foundational proofs, and performance. The first and third claims are measurable. To support them, I evaluated Rupicola from three angles: programmer experience, expressivity, and performance. For the first two I used case studies, and for the third, performance benchmarks.

5.1 Programmer experience and expressivity

5.1.1 Case study: Extending Rupicola

Extending a traditional compiler can be a daunting task: compilers sometimes supports extensions of a very restricted kind (e.g. single-language rewrites), but these are not sufficient: for Rupicola to generate code whose performance matches that of handwritten programs we need users to be able to plug in new translation strategies, new logic, and new decision procedures.

Implementing such complex extensions in a traditional compiler would be a daunting task: it would typically require writing new compilation passes or extending existing ones by directly modifying the implementation of the compiler itself. Rupicola is intended to make this much easier, and this section gives some evidence to that effect.

Anecdotal observation suggests that the corresponding effort in Rupicola is minimal: once users develop sufficient familiarity with our framework, they find it manageable to teach the compiler new lemmas to support the patterns that they are interested in (I have observed this anecdotally as more students became involved in this project). I summarize some examples of estimated effort in development time and lines of code in .

Verification effort for user extensions. Time estimates are rough indicators. “Lemma” and “Proof” refer to line counts using a verbose notation; a more succinct notation would shrink the “Lemma” column by about 30%.
Domain	Operation	Lemma	Proof	Time (min)
nondet	alloc, peek	26+24	17+11	13+6
cells	get, put	22+23	5+ 3	7+3
	iadd	31	7	8
io	read, write	25+26	7+10	11+8

Adding support for new monads is also straightforward, though naturally a bit more complicated. As a concrete example, I estimate that adding support for a writer monad starting from a blank file required about an hour and a half, with a bit over 15 minutes spent defining the monad and proving its properties (17 lines of code, 5 lines of proofs), 30 minutes spent setting up the compilation of that monad (56 lines of code, 8 lines of proofs), 20 minutes to add a Gallina primitive and compilation lemmas for it (mapping writes to I/O trace operations at the Bedrock2 level; 50 lines of code, 15 lines of proofs), 15 minutes to write a small example and compile it (4 lines of Gallina model, 6 lines for the Bedrock2 signature, and 1 line for the compilation “proof”: compile.), and about 3 seconds to derive the actual code [20]. The same example written by imitating other monad examples would probably take roughly a third to half as long.

5.1.2 Case study: Implementing compiler extensions to support new low-level patterns

above describes in detail two of the most significant compiler extensions that we implemented. Here I give evidence of Rupicola's usability through two additional examples, both of which are instances of one of the most interesting kinds of extensions for a relational compiler: extensions that expose in the shallow world features of the low-level language. Specifically, I look at stack allocation and inline tables.

Implementation efforts for both of these extensions were led by Dustin Jamner, with extensions by the author of this thesis; I report on them here as further evidence that Rupicola is usable by experienced Coq users, and as interesting case studies of Rupicola extensions.

5.1.2.1 Stack allocation

Bedrock2 supports (lexically scoped) stack allocations: a block of code can be wrapped in a binding construct giving it access to a pointer to a block of compile-time constant-size memory allocated on the stack. This is particularly useful for any program that needs access to a small working area, and unlike a global buffer it does not pollute external specifications (beyond changing the function's stack-space requirements, which Bedrock2 tracks).

As a case study (and because a larger development using Rupicola was planning to make use of it), we extended Rupicola to provide access to this feature in Coq programs. We exposed it under three APIs.

For allocating space intended to be fully initialized by a single operation or function call, we use a semantically transparent annotation, stack. When Rupicola sees let/n x := stack (term) in …, it generates a stack allocation in Bedrock2 and resumes compilation with the plain program let x/n := term in … and a memory context containing an uninitialized block of memory pointed to by "x". The determination of how large a block to allocate is done by looking up an instance of a type class. Then, the compilation of term is expected to initialize the allocated block fully.
For allocating space that is expected to be fully initialized in multiple small steps, we use the buffer API discussed in , with consecutive calls to push followed by a final call to convert the buffer into an array. The idea generalizes: we construct a separation logic predicate that captures the initialized parts of the allocated space, with APIs that progressively grow the allocated section, and finally an API to convert from that predicate to a simpler one that applies only to fully allocated value.
For allocating uninitialized space that is not expected to be fully initialized right away, we use a monadic computation returning a list of fixed length containing arbitrary values in Gallina, and we map this to a stack allocation of the same width in Bedrock2. In cases where possible to prove that a monadic computation involving stack allocations is in fact deterministic (e.g. because the code in fact overrides all positions in the nondeterministic list that the allocation returns), then the nondeterminism could be restricted to the code fragment performing the allocation, and as a result the local effect does not leak.

5.1.2.2 Inline tables

Inline tables are another Bedrock2 feature that is usefully exposed at the functional level; they are const arrays local to a Bedrock2 function, useful for implementing lookup and translation tables.

The Gallina API that we implemented is the same as that of arrays, except that only one operation (get) is available. Crucially, the API does not impede reasoning about the code: simply unfolding the definition of InlineTable.get reveals that it is just the function nth on lists.

The Gallina API accepts any type in the array; it is the user's responsibility to then show how these values can be cast to a scalar type (bytes or machine words). This flexibility makes it possible to store values of types such as Fin.t (a subset integer type equivalent to {i | i < n }) in an inline table (as long as n is less than \(256\) for a byte or \(2^\textsf{width}\) for a word): we use this in the UTF8 decoder to store arrays of indices into another array, which makes it trivial to statically encode the fact that all values in the first array are valid indices into the second array.

We have Rupicola compilation lemmas to load either a single byte or a full machine word from these tables. Due to the way the semantics of Bedrock2 are written, the effort for machine words is much greater than the effort for the bytes version of the same lemmas (hundreds versus tens of lines). Most required lemmas are dues to an idiosyncrasy of the semantics of inline tables: they are about basic properties of an otherwise seldom-used set of Bedrock2 functions and are irrelevant to Rupicola (the plan is to offer these lemmas to the authors of Bedrock2 for merging into that repository, which will bring the longer proof back to tens of lines).

5.1.3 Case study: Rupicola's expression compiler

This section and the next attempt to give a sense of the effort involved in developing and using relational compilers.

Rupicola is really two relational compilers rolled into one: one targeting Bedrock2's statements and one targeting its expressions. Originally, however, we assumed that the expression part of the compilation process was so simple that it would not warrant the cost of relational compilation. Instead, we compiled expressions by reifying them into an AST type and then using a very simple verified compiler targeting Bedrock2's expression language, and we planned to handle all necessary extensions by plugging in new cases in our reflection tactics and proofs.

We were wrong; over time this reflective compiler grew more complicated and accumulated various complications to make it easier to extend to more types and more operations; and we needed constant extensions to it, because programmers use a wide range of numeric types in Gallina and expect to be able to map operations on them to low-level expressions (rather than to a sequence of individual statements, each performing a single operation).

I switched to a relational compiler at a time when the reflective compiler had support only for machine words, operations on Z, and a limited subset of Boolean operations. The original reflective compiler was written with conciseness in mind, requiring about 450 lines of code (including 200 lines of tactics and 60 lines of typeclass definitions) and would (we estimated) require about 100 more lines of tactics, 100 lines of proofs, and about the same amount in refactoring to existing tactics to add support for byte operations. In addition, the implementation was fairly technical, so extensions had to be handled by someone intimately familiar with the code base.

The relational compiler that I replaced it with was about 250 lines of code (of which about 30 Hint commands to assemble the lemmas into a compiler) and then quickly grew to about 400 lines to support bytes, Booleans, integers, two representations of natural numbers, and mixed expressions (with casts between different types). None of these extensions required deep expertise; in fact, shown below is all of the code that we needed to support byte.and, the change that daunted us into switching to relational compilation:

Lemma expr_compile_byte_and
      (m: mem) (l: locals) (b1 b2: byte) (e1 e2: expr) :
  DEXPR m l e1 (of_byte b1) ->
  DEXPR m l e2 (of_byte b2) ->
  DEXPR m l (expr.op and e1 e2) (of_byte (byte.and b1 b2)).m: mem
l: locals
b1, b2: byte
e1, e2: expr
DEXPR m l e1 (of_byte b1) ->
DEXPR m l e2 (of_byte b2) ->
DEXPR m l (expr.op and e1 e2) (of_byte (byte.and b1 b2))
Proof. rewrite byte_morph_and; apply expr_compile_word_and. Qed.

Hint Extern 1 => simple eapply expr_compile_byte_and; shelve : compiler.

Even better, the switch to relational compilation allowed plugging in support for transformations with complex side conditions trivially — operations such as arithmetic shifts [21] or array dereferences [22], for example.

Surprisingly, the performance cost (on compilation times, not on performance of compiled programs) was never more than two times, yielding an overall 30% slowdown in the worst case; significant, but smaller than I feared given the performance benefits I had hoped for when originally adopting a reflective approach.

5.1.4 Case study: End-to-end verification with Rupicola

Narrowly speaking, the exact techniques that programmers employ to generate input suitable for compilation with Rupicola is out of the scope of this thesis: Rupicola is built to be agnostic to this. Authors that find themselves missing a feature may chose to implement compiler extension to support it (confident in the knowledge that they will not break the rest of the compiler) or may choose to lower these unsupported constructs into ones that Rupicola does support, if their development process lends itself to that (in a refinement-based pipeline, for example). As a concrete example, when this paragraph was originally written, Rupicola had support for left folds on lists but not right ones: a programmer employing the later could have proven a new compilation lemma, or if their program permitted it, lowered it either to a left-fold variant or to one of the more basic iteration primitives that Rupicola supported (most likely iterating over a range of numbers, retrieving the \(n\)-th element of the list after each iteration).

Still, asking authors to lower their programs leaves the question of the expressivity of Rupicola's input language: how easy is it to massage high-level functional programs into ones that Rupicola will accept? In general, I have found it very easy, for two reasons. First, all reasoning happens between shallowly embedded programs, so the verification experience is one that interactive theorem provers excel at: proving equivalences between relatively small pure functions that operate on inductive data types. Second, in many cases, the lowering that Rupicola requires is really a form of transparent or semitransparent program annotation: for example, we specify which object we intend to mutate by using a variant of let annotated with a variable name, or we specify that a particular object should be stack-allocated by wrapping its initial value with the stack function. Because these annotations are semantically irrelevant (they are just identity function with extra arguments), unfolding them returns to the original program.

To illustrate these points on a concrete example, I implemented and verified the TCP/IP checksum algorithm, a one's-complement sum of 16-bit values (unsigned addition with carries added back in).

I chose this specific program because it proved particularly vexing in previous work: when we implemented this function in the context of Narcissus [Narcissus+Delaware+ICFP2019], we did not manage, even with a careful (and unverified) extraction setup, to achieve satisfactory performance for it. Instead, we resorted to replacing it as a whole by an unverified but sufficiently fast implementation in OCaml.

For this instance, I sought to make the specification as readable and easily auditable as possible (for example, instead of relying on a bitvector type, I used Coq's built-in byte and Z types).

Definition onec_add16 (z1 z2: Z) :=
  let sum := z1 + z2 in
  (Z.land sum 0xffff + (Z.shiftr sum 16)).

Definition ip_checksum (bs: list byte) :=
  let c := List.fold_left onec_add16
    (List.map le_combine (chunk 2 bs))
    0xffff in
  Z.land (Z.lnot c) 0xffff.

There is a subtlety in the program above: even though IP checksums are defined on 2-byte blocks, the input may contain an odd number of octets. The specification handles this case gracefully (the last chunk that it receives is only one byte long).

I then separately wrote a Rupicola-ready version of the same program (next page); this version separates iteration over the even-length prefix of the input and the optional last byte. In a more traditional development workflow, this is the functional model that one would verify a handwritten low-level implementation against (direct verification against the high-level spec would be possible but would involve all the complexity of the two-step process forced into a single larger proof).

Proving the equivalence of these two versions is straightforward and proceeds in two steps, none of which use lemmas particularly specific to IP checksums. First, we show that we can remerge the loop and the last conditional step; this is because the nth function that we use in the loop returns a default value if we ask for an out-of-bounds value [23]. Then, we translate the remerged loop on an integer range into a fold, and from there, the proof is straightforward. The total amount of proofs that is specific to IP checksums is a few tens of lines, including some properties of the original programs to guarantee that one's-complement sums fit in the finite types that we use in the implementation [24].

Definition ip_checksum_upd (c16: word) (b0 b1: byte) :=
  let/n w16 := b0 |w (b1 <<w 8) in
  let/n c17 := c16 +w w16 in
  let/n c16 := (c17 &w 0xffff) +w (c17 >>w 16) in
  c16.

Definition ip_checksum' (bs: list byte) : word :=
  let/n c16 := 0xffff in
  (* Main loop *)
  let/n c16 := nd_ranged_for_all
    0 (Z.of_nat (length bs) / 2)
    (fun c16 idx =>
       let/n b0 := ListArray.get bs (2 * idx) in
       let/n b1 := ListArray.get bs (2 * idx + 1) in
       let/n c16 := ip_checksum_upd c16 b0 b1 in
       c16) c16 in
  (* Final iteration *)
  let/n c16 := if Nat.odd (length bs)
     then let/n b0 := ListArray.get bs (length bs - 1) in
          ip_checksum_upd c16 b0 x00
     else c16 in
  (* Clean up *)
  let/n c16 := (~w c16) &w 0xffff in
  c16.

Finally, I wrote a signature for the function and ran the compiler. It takes a few seconds for the derivation to complete, and it requires a single extension: a lemma to prove that if \(n \ge 0\) is odd, then \(n - 1 \ge 0\), which is required to prove that the final array access (at length bs - 1) is in bounds.

Instance spec_of_ip_checksum : spec_of "ip_checksum" :=
  fnspec! "ip_checksum" data_ptr wlen / data R ~> chk, {
    requires tr mem :=
      wlen = of_nat (length data) /\
      (bytes data_ptr data ⋆ R) mem;
    ensures tr' mem' :=
      tr' = tr /\ chk = ip_checksum' data /\
      (bytes data_ptr data ⋆ R) mem'
  }.

Overall, implementing and verifying IP checksums in Rupicola was a matter of a few hours, and the resulting performance is on par with a hand-coded version in C — but with proofs!

5.2 Performance benchmarks

Flexibility and extensibility are not the only metric that Rupicola is optimizing for: both are intended to enable users to generate code that competes with handwritten C programs on performance.

To measure the performance of code generated by Rupicola, we took a collection of tasks for which existing C implementations were available, implemented corresponding programs in Coq, and used Rupicola to compile them. Here, we give evidence that the performance of the resulting code is on par with C programs. To run these programs we do not use Bedrock2's compiler to RISC-V; instead we use a trivial pretty-printer to C to feed our programs to a regular C compiler (it would be possible to use Bedrock2's compiler or CompCert for greater assurance, albeit at a performance cost).

We chose programs from a variety of domains, including string manipulation, hash functions, and packet-manipulating (network) programs. Not discussed in the following are an additional suite of dozens of programs testing features around arithmetic, monadic extensions, and stack allocation (a subset of which are covered in ).

gives a short description of each program that we benchmarked and shows the results of benchmarking (running on an Intel Core i7-4810MQ CPU @ 2.80GHz). As usual, benchmarks involving C compilers are very sensitive to small encoding decisions, so we measure performance across three compilers: overall the differences both in favor and against Rupicola are within the expected fluctuations across optimizing compilers, though we do suffer from a missed vectorization opportunity in upstr with GCC ( discusses this result in more detail).

Our benchmark suite. The “Source”, “Lemmas”, and “Hints” columns measure programmer effort in lines of code to write the original program and its signature, to prove the properties needed by Rupicola to compile that specific program, and to configure Rupicola prior to running the compiler, respectively. The “End-to-End” column indicates whether we have proofs from high-level specifications. The remaining columns describe which compiler extensions each program leverages.
Name	Source	Lemmas	Hints	End-to-End	Arithmetic	Inline	Arrays	Loops	Mutation
crc32	31	16	3		✓	✓	✓	✓
Error-detecting code (cyclic redundancy check)
utf8	56	-	6		✓	✓	✓
Branchless UTF-8 decoding
m3s	11	-	-		✓
Scramble part of the Murmur3 algorithm
upstr	21	-	6	✓	✓		✓	✓	✓
In-place string uppercase
ip	37	3	7	✓	✓		✓	✓
IP (one's-complement) checksum (RFC 1071)
fasta	19	6	5	✓		✓	✓	✓	✓
In-place DNA sequence complement
fnv1a	35	-	2		✓		✓	✓
Fowler-Noll-Vo hash

Performance benchmarks: Rupicola vs. handwritten C. Errors bars indicate 95% bootstrap confidence intervals over 1000 runs. The large fluctuations in upstr are due to inconsistent vectorization.

Performance benchmarks: Rupicola vs. handwritten C and extracted OCaml on one example. — Performance benchmarks: Rupicola and handwritten C vs OCaml extracted from Coq specifications on one example. Error bars indicate 95% bootstrap confidence intervals over 1000 runs. The first plot uses a linear scale; the second one shows the same data with a logarithmic scale. The \(x\) axis on the second plot starts at 1 cycle/byte. Of four variants of the OCaml code generated by Coq (extracting the specs or the implementation, with or without additional extraction commands) only the fastest (specs with Coq integers unsoundly extracted to native ints) is shown, as the others have algorithmic complexity issues and hence do not complete in a reasonable amount of time.

Comparing the performance of the original Coq code extracted to OCaml using Coq's native extraction features versus the C code produced by Rupicola yields results that are very problem-dependent: in most cases, extracting with Rupicola leads to algorithmic complexity changes (e.g. changing a linear nth lookup into a constant-time pointer dereference). When complexity is unchanged, a reasonable approximation is a speedup of 30 to 200× versus plain Coq extraction (typically closer to the latter). An example is given in . It is possible to improve the performance of the OCaml code using potentially unsound extraction commands (as shown in , which maps Coq's unbounded integers to native integers), but only up to a point; and each additional customization of Coq's native extraction process is one opportunity for subtle bugs.

5.2.1 Performance considerations: C compilers and Bedrock2's C pretty-printer

Bedrock2 is a low-level language, but it is not equivalent to C: in some ways it is higher-level (in particular, its control flow is more structured), but in other ways it is much lower level: it has a single type, machine words, and all operations go through that type. Additionally, the semantics of Bedrock2 and (ISO) C differ: Bedrock2 has a very simple view of memory (a map from locations to bytes), whereas C attaches types to pointers and has complex restrictions on aliasing memory and accessing it through pointers of different types.

None of this is a concern when using the native (and verified) Bedrock2 toolchain ([Lightbulb+Erbsen+PLDI2021]). When pretty-printing Bedrock2 to C, however, these differences become problematic.

It might be possible to reconstruct enough information from a piece of Bedrock2 code to produce a richly typed C program, but this would require relatively complex reasoning and hence additional verification effort; instead, the authors of Bedrock2 chose to write a very simple translator that produces C code in which all variables have type uintptr_t, and all memory accesses go through a memcpy to avoid alignment issues.

The result is most likely not valid according to the ISO specification of C, a dubious distinction that Bedrock2's pretty-printer shares with lots of systems software: low-level C code tends to be written in the dialect of C supported by whichever compilers a project targets, not in ISO C [25]. I find it best to think of this C pretty-printing approach as a quick way to run programs compiled with Rupicola, rather than as a lasting piece of a trustworthy pipeline: in the long run, we will want to either verify the pretty-printer to C, pretty-print to a language with fewer pitfalls, or improve the Bedrock2 compiler enough to have competitive performance (hence my note in that the traditional compilation pass from Bedrock2 to ASM pass should be “ideally, verified”). [26]

5.2.1.1 Memory loads and stores

Modern C compilers are very good at optimizing these sort of programs, but we did have to go through a few iterations before we found a way to phrase memory accesses that compilers had no trouble with. The original implementation used the following (this discussion assumes a little-endian platform):

static inline uintptr_t _br2_load(uintptr_t a, size_t sz) {
  uintptr_t r = 0;
  memcpy(&r, (void*)a, sz);
  return r;
}

Unfortunately, GCC 9 does not recognize the fact that memcopy will not read from pointer &r, and hence generates code that allocates space for r in memory and zeroes it out.

Explicitly masking the return value helps GCC (it recognizes that the mask is superfluous and optimizes it away, but additionally removes the unwanted store to &r), but this hurts performance in Clang 10, which does not eliminate the mask:

static uintptr_t _br2_load(uintptr_t a, size_t sz) {
  uintptr_t r = 0;
  memcpy(&r, (void*)a, sizeof(uintptr_t));
  uintptr_t mask = (uintptr_t)-1 >> (8 * (sizeof(uintptr_t) - sz));
  return r & mask;
}

A commonly recommended alternative is to use byte-wide reads and use shifts and ors to reconstruct wider values [ByteOrder+Tunney+Blog2021]:

#define READ8(S) ((255 & (S)[0]))
#define READ16LE(S) ((255 & (S)[1]) << 8 | (255 & (S)[0]))
#define READ32LE(S)                                                    \
  ((uint32_t)(255 & (S)[3]) << 030 | (uint32_t)(255 & (S)[2]) << 020 | \
   (uint32_t)(255 & (S)[1]) << 010 | (uint32_t)(255 & (S)[0]) << 000)
#define READ64LE(S)                                                    \
  ((uint64_t)(255 & (S)[7]) << 070 | (uint64_t)(255 & (S)[6]) << 060 | \
   (uint64_t)(255 & (S)[5]) << 050 | (uint64_t)(255 & (S)[4]) << 040 | \
   (uint64_t)(255 & (S)[3]) << 030 | (uint64_t)(255 & (S)[2]) << 020 | \
   (uint64_t)(255 & (S)[1]) << 010 | (uint64_t)(255 & (S)[0]) << 000)

static inline uintptr_t _br2_load(uintptr_t a, size_t sz) {
  switch (sz) {
  case 1: return READ8((unsigned char*)a);
  case 2: return READ16LE((unsigned char*)a);
  case 4: return READ32LE((unsigned char*)a);
  case 8: return READ64LE((unsigned char*)a);
  default: __builtin_unreachable();
  }
}

Unfortunately, while both GCC 9+ and clang will optimize each branch of this function to a single memory load of the right width, that optimization interacts poorly with inlining in GCC, and as a result the inlined version of this function sometimes end up performing individual byte-wide loads when compiled with GCC 10. Eventually, we settled on the following definition, which is correctly optimized by GCC 9+ and Clang 10+:

static inline uintptr_t _br2_load(uintptr_t a, uintptr_t sz) {
  switch (sz) {
  case 1: { uint8_t  r = 0; memcpy(&r, (void*)a, 1); return r; }
  case 2: { uint16_t r = 0; memcpy(&r, (void*)a, 2); return r; }
  case 4: { uint32_t r = 0; memcpy(&r, (void*)a, 4); return r; }
  case 8: { uint64_t r = 0; memcpy(&r, (void*)a, 8); return r; }
  default: __builtin_unreachable();
  }
}

Naturally, none of this mess would be needed were we converting directly to a lower language than C.

5.2.1.2 Machine integers

Beyond the implementation of memory loads and stores, Bedrock2's use of uintptr_t when pretty-printing to C also introduces variations when compared to handwritten C code. This is particularly visible in the benchmark results for upstr, as code generated by Rupicola for the former is poorly optimized by GCC (but runs at speed comparable to the handwritten version in Clang).

That performance issue is readily explained: GCC simply misses a vectorization opportunity. It can be reproduced on the following simplified program:

#include <stdint.h>

void uintptr_mask(uintptr_t ubytes, int len) {
  for (int i = 0; i < len; i++)
    ((char*) ubytes)[i] = 0x5f & ((char*) ubytes)[i];
}

Compiling this code with GCC 9.4.0 -O2 -ftree-vectorize -fopt-info-vec-all -S invert. -o /dev/null produces the following output (GCC 10 and 11.1.0 produce similar output):

$ gcc-9 -O3 -fopt-info-vec-all -S uintptr_mask.c -o /dev/null
<stdin>:4:3: missed: couldn't vectorize loop
<stdin>:5:25: missed: not vectorized: compilation time alias: _4 = *_3;
*_3 = _5;
<stdin>:3:6: note: vectorized 0 loops in function.

This is in contrast to the following program, which GCC 9+ vectorizes successfully:

#include <stdint.h>

void uintptr_maskv(uintptr_t ubytes, int len) {
  char* bytes = (char*)ubytes;
  for (int i = 0; i < len; i++)
    bytes[i] = 0x5f & bytes[i];
}

$ gcc-9 -O3 -fopt-info-vec-all -S uintptr_maskv.c -o /dev/null
<stdin>:5:3: optimized: loop vectorized using 16 byte vectors
<stdin>:3:6: note: vectorized 1 loops in function.

Clang 10, in contrast, vectorizes both programs, and indeed we see no performance difference between the Rupicola version of upstr and the corresponding handwritten code in Clang:

$ clang-10 -O3 -Rpass=loop-vectorize -S uintptr_mask.c -o /dev/null
uintptr_mask.c:4:3: remark: vectorized loop↵
    (vectorization width: 16, interleaved count: 2)↵
    [-Rpass=loop-vectorize]
  for (int i = 0; i < len; i++)
  ^
$ clang-10 -O3 -Rpass=loop-vectorize -S uintptr_maskv.c -o /dev/null
uintptr_maskv.c:5:3: remark: vectorized loop↵
    (vectorization width: 16, interleaved count: 2)↵
    [-Rpass=loop-vectorize]
  for (int i = 0; i < len; i++)
  ^

5.2.1.3 Speed ups

The discussion above highlights one instance where a compiler struggles to optimize Rupicola's output as efficiently as it does handwritten code. The reverse also happens occasionally, but it is harder to make a systematic case for: any output that Rupicola produces could reasonably be produced, character-per-character, by hand, and then Rupicola would no longer have an edge — in that sense, a program produced by Rupicola can never really beat a handwritten program.

What I have observed, however, is cases where Rupicola's implementation of memory access plays better with compiler implementations. The original (handwritten) implementation of the ip benchmark, for instance, used memcpy to load potentially-unaligned 16-bits value from memory; the Rupicola implementation that I wrote at first, in contrast, used two separate 8-bit loads and then combined them (it was easier to write the code that way in Rupicola). It turns out that, for this particular example, Clang handles the latter better than the former, which led to a roughly 30% speedup when using Rupicola and Clang over handwritten C with Clang. Of course, the pattern used in Rupicola is also expressible in C, so the results in show the manually corrected handwritten code, which runs at the same speed as Rupicola.

6 Discussion

6.1 Compilation speed

The programs that Rupicola produces are fast, but Rupicola itself is not. On simple, straightline code, it compiles in the order of tens of bindings per second. On programs that require more complex context manipulations, or on programs that need to invoke solvers to solve compilation side-conditions (e.g. programs with loops), it gets much slower.

Compiling the entirety of Rupicola's distribution on an 8-cores machine (dozens of example programs) takes in the order of minutes, and examples like the Utf8 decoder take tens of seconds to compile. The result is a compiler whose performance is sufficient for the programs that this thesis focuses on (and a few orders of magnitude faster than Fiat-to-Facade [FiatToFacade+PitClaudel+MIT2016]!), but still painfully slow.

Where is all this time spent? Depending on the example, profiling suggests that about 50 and 80% of it is spent waiting for Coq's autorewrite tactic to realize than none of anywhere from 3 to 10 equations apply to the current goals [27] (the interesting part of Rupicola's work mostly happens in the step_with_db line in the profile shown below, which takes about 5% of the actual execution time):

 tactic                                   local  total   calls       max
────────────────────────────────────────┴──────┴──────┴───────┴─────────┘
─compile -------------------------------   0.0% 100.0%       1   38.922s
└compile_step --------------------------   0.0%  99.9%     229    2.501s
 ├─compile_autocleanup with (ident) ----   0.4%  56.1%     205    0.954s
 │└autorewrite with (ne_preident_list) (  55.1%  55.2%     386    0.945s
 ├─compile_solve_side_conditions -------  38.0%  38.3%     333    1.546s
 │ ├─compile_autocleanup with (ident) --   0.3%  21.3%     145    1.182s
 │ │└autorewrite with (ne_preident_list)  20.5%  20.5%     207    1.173s
 │ ├─solve_map_get_goal ----------------   0.0%  10.4%      12    1.539s
 │ │└solve_map_get_goal_step -----------   4.4%  10.4%      33    1.529s
 │ │ ├─solve_map_get_goal_refl ---------   0.0%   6.0%       3    1.523s
 │ │ │└reify_map -----------------------   0.2%   5.6%       3    1.470s
 │ │ │└set_change (uconstr) with (uconst   0.0%   5.3%       3    1.433s
 │ │ │└set (sx := x) -------------------   5.0%   5.0%       3    1.393s
 │ │ └─rewrite map.get_put_diff --------   4.2%   4.2%       9    0.210s
 │ └─step_with_db (ident) --------------   0.0%   5.7%      62    0.684s
 │  └unshelve (tactic1) ----------------   0.0%   5.7%     103    0.684s
 │  └typeclasses eauto (nat_or_var_opt)    0.1%   5.7%     103    0.684s
 │  └cbn -------------------------------   4.3%   4.3%      57    0.609s
 └─compile_triple ----------------------   0.0%   5.2%      24    1.215s
  └compile_unset_and_skip --------------   0.0%   3.1%       5    1.209s
  └compile_cleanup_post ----------------   0.2%   3.1%       7    0.250s
  └compile_autocleanup with (ident) ----   0.1%   2.7%      16    0.210s
  └autorewrite with (ne_preident_list) (   2.6%   2.6%      30    0.203s

These profiles also do not suggest that there would be much to be gained in migrating from Coq's venerable Ltac1 tactic language to Ltac2. It is possible to speed things up quite a bit by being more discerning in applying autorewrite, but this adds complexity for users, who would have to grasp a more complex set of hint databases. Instead, I have chosen to go for maximal ease-of-use.

Performance issues like the ones that Rupicola suffers from are a common plague of projects that rely heavily on Coq's proof language. They stem primarily from a mix of asymptotically suboptimal algorithms and unoptimized implementation, which are not particularly surprising (Coq was not originally designed with this kind of heavy automation in mind).

Other performance issues pop up from time to time that come from unexpected pitfalls, often related to reduction: to give two brief examples, at one point a colleague tracked down the source of a tenfold slowdown to Coq eagerly unfolding the noskip function described above while checking a proof [28]; at another I noticed that each unfolding hint added to a Rupicola database was causing an exponential increase in proof time [29] (there is, thankfully, a workaround for the problem).

6.2 Incremental compiler construction and backtracking

Rupicola's design makes it easy to build compilers incrementally. In that context, and important feature is the ability do debug incomplete compilation runs: users need to be able to straightforwardly determine which construct Rupicola failed to compile, and how to add support for it.

Let us look at an example to see what concretely happens when Rupicola does not know how to compile a source code construct. We look at a simple program that takes as input an array of bytes, reinterprets it as an array of little-endian machine words, and counts the number of matches for a given search term in the resulting array.

Definition count_ws (data: ListArray.t byte) (needle: word) :=
  let/n r := 0 in
  let/n data := bs2ws data in
  let/n r := ListArray.fold_left (fun r w64 =>
       let/n hit := word.eqb w64 needle in
       let/n r := r + Z.b2z hit in
       r)
    data 0 in
  let/n data := ws2bs data in
  r.

Instead of applying the mask byte by byte, the program starts by casting its input into a list of 64-bit words, processes these words, and finally casts the result back to a lists of bytes. Without loading appropriate libraries, Rupicola will not recognize these patterns:

Instance spec_of_count_ws : spec_of "count_ws" :=
  fnspec! "count_ws" ptr wlen needle / (bs: list byte) R ~> r, {
    requires tr mem :=
      wlen = of_nat (length bs) /\
      Z.of_nat (length bs) < 2 ^ width /\
      (Datatypes.length bs mod bytes_per_word = 0)%nat /\
      (bytes ptr bs ⋆ R) mem;
    ensures tr' mem' :=
      tr' = tr /\ r = of_Z (count_ws bs needle) /\
      (bytes ptr bs ⋆ R) mem'
  }.

Context (bytes_per_word_nz : bytes_per_word <> 0%nat).
Context (bytes_per_word_range: 0 < Z.of_nat bytes_per_word < 2 ^ width).

Derive count_ws_br2fn SuchThat
  (defn! "count_ws"("data", "len", "needle") ~> "r" { count_ws_br2fn },
   implements count_ws) As count_ws_br2fn_ok.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
bytes_per_word_range: 0 < Z.of_nat bytes_per_word < 2 ^ width
count_ws_br2fn:= ("count_ws", (["data"; "len"; "needle"], ["r"], ?Goal)): bedrock_func
let body := ?Goal in
__rupicola_program_marker count_ws ->
forall functions : list bedrock_func,
spec_of_count_ws (count_ws_br2fn :: functions)
Proof.
  compile.Compilation incomplete.
You may need to add new compilation lemmas using `Hint Extern 1 => simple eapply … : compiler` or to tell Rupicola about your custom bindings using `Hint Extern 2 (IsRupicolaBinding (xlet (A := ?A) ?vars _ _)) => exact (RupicolaBinding A vars) : typeclass_instances`.
{{ tr; mem0;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v }#; functions }}
  ?k_impl
{{ pred
     (let/n data as "data" := bs2ws bs in
      let/n r as "r" :=
        ListArray.fold_left
          (fun (r : Z) (w64 : word) =>
           let/n hit as "hit" := word.eqb w64 needle in
           let/n r0 as "r" := r + Z.b2z hit in r0) data 0 in
      let/n _ as "data" := ws2bs data in r) }}

In fact, compilation stops immediately, and Rupicola prints a message indicating that we need new lemmas. Inspecting the goal indicates that Rupicola already combined the binding for count and then ran into trouble with the second binding (the call to bs2ws), so we add a lemma for it:

  Lemma compile_bs2ws [t m l σ] (bs: list byte):
    let v := bs2ws bs in

    forall {P} {pred: P v -> _} {k: nlet_eq_k P v} K r bs_var ptr,

      (bytes ptr bs ⋆ r) m ->
      (length bs mod bytes_per_word = 0)%nat ->

      (forall m',
       (words ptr v ⋆ r) m' ->
       {{ t; m'; l; σ }} K {{ pred (k v eq_refl) }}) ->

      {{ t; m; l; σ }} K {{ pred (nlet_eq [bs_var] v k) }}.Ignoring implicit binder declaration in unexpected position.
[unexpected-implicit-declaration,syntax]
Ignoring implicit binder declaration in unexpected position.
[unexpected-implicit-declaration,syntax]
Ignoring implicit binder declaration in unexpected position.
[unexpected-implicit-declaration,syntax]
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
bytes_per_word_range: 0 < Z.of_nat bytes_per_word < 2 ^ width
t: Semantics.trace
m: mem
l: locals
σ: list (string * (list string * list string * cmd))
bs: list byte
let v := bs2ws bs in
forall (P : list word -> Type)
  (pred : P v -> Semantics.trace -> mem -> locals -> Prop) 
  (k : nlet_eq_k P v) (K : cmd) (r : mem -> Prop) 
  (bs_var : string) (ptr : word),
(bytes ptr bs ⋆ r) m ->
(Datatypes.length bs mod bytes_per_word)%nat = 0%nat ->
(forall m' : mem,
 (words ptr v ⋆ r) m' -> {{ t; m'; l; σ }} K {{ pred (k v eq_refl) }}) ->
{{ t; m; l; σ }} K {{ pred (let/n x as bs_var eq:Heq := v in k x Heq) }}
  Proof. intros; seprewrite_in bytes_as_words H; eauto. Qed.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0))) ?k_impl)))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H5: (bytes ptr bs ⋆ R) mem0
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
{{ tr; mem0;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v }#; functions }}
  ?k_impl
{{ pred
     (let/n data as "data" := bs2ws bs in
      let/n r as "r" :=
        ListArray.fold_left
          (fun (r : Z) (w64 : word) =>
           let/n hit as "hit" := word.eqb w64 needle in
           let/n r0 as "r" := r + Z.b2z hit in r0) data 0 in
      let/n _ as "data" := ws2bs data in r) }}

One we have defined this new lemma, we can plug it in an make a bit more progress:

  Hint Extern 1 => simple eapply compile_bs2ws; shelve : compiler.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0))) ?k_impl)))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H5: (bytes ptr bs ⋆ R) mem0
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
{{ tr; mem0;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v }#; functions }}
  ?k_impl
{{ pred
     (let/n data as "data" := bs2ws bs in
      let/n r as "r" :=
        ListArray.fold_left
          (fun (r : Z) (w64 : word) =>
           let/n hit as "hit" := word.eqb w64 needle in
           let/n r0 as "r" := r + Z.b2z hit in r0) data 0 in
      let/n _ as "data" := ws2bs data in r) }}

  compile_step; try solve [repeat compile_step].width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0))) ?k_impl)))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H5: (bytes ptr bs ⋆ R) mem0
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
forall m' : mem,
(words ptr (bs2ws bs) ⋆ R) m' ->
{{ tr; m';
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v }#; functions }}
  ?k_impl
{{ pred
     (let/n r as "r" :=
        ListArray.fold_left
          (fun (r : Z) (w64 : word) =>
           let/n hit as "hit" := word.eqb w64 needle in
           let/n r0 as "r" := r + Z.b2z hit in r0) 
          (bs2ws bs) 0 in
      let/n _ as "data" := ws2bs (bs2ws bs) in r) }}
{{ tr; m';
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v }#; functions }}
  ?k_impl
{{ pred
     (let/n r as "r" :=
        ListArray.fold_left
          (fun (r : Z) (w64 : word) =>
           let/n hit as "hit" := word.eqb w64 needle in
           let/n r0 as "r" := r + Z.b2z hit in r0) 
          (bs2ws bs) 0 in
      let/n _ as "data" := ws2bs (bs2ws bs) in r) }}

This time Rupicola stops because we have not given it a way to compile ListArray.fold_left; the default is fine, so we import the UnsizedListArrayCompiler module:

  Import UnsizedListArrayCompiler.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0))) ?k_impl)))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H5: (bytes ptr bs ⋆ R) mem0
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
{{ tr; m';
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v }#; functions }}
  ?k_impl
{{ pred
     (let/n r as "r" :=
        ListArray.fold_left
          (fun (r : Z) (w64 : word) =>
           let/n hit as "hit" := word.eqb w64 needle in
           let/n r0 as "r" := r + Z.b2z hit in r0) 
          (bs2ws bs) 0 in
      let/n _ as "data" := ws2bs (bs2ws bs) in r) }}
  compile_step; try solve [repeat compile_step].width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
0 <= Z.of_nat (Datatypes.length (bs2ws bs)) < 2 ^ width
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
DEXPR m'
  #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
     "needle" => needle; "r" => of_Z v; (gs "_gs_from") => of_Z 0 }# 
  ?e (of_Z (Z.of_nat (Datatypes.length (bs2ws bs))))
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
let v :=
  ListArray.fold_left
    (fun (r : Z) (w64 : word) =>
     let/n hit as "hit" := word.eqb w64 needle in
     let/n r0 as "r" := r + Z.b2z hit in r0) (bs2ws bs) 0 in
forall (tr0 : Semantics.trace) (mem0 : mem) (locals0 : locals),
tr0 = tr /\
locals0 =
#{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
   "needle" => needle; "r" => of_Z v;
   (gs "_gs_from") => of_Z Z.of_nat (Datatypes.length (bs2ws bs));
   (gs "_gs_to") => of_Z Z.of_nat (Datatypes.length (bs2ws bs)) }# /\
(words ptr (bs2ws bs) ⋆ R) mem0 ->
{{ tr0; mem0; locals0; functions }}
  ?c
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v) }}
0 <= Z.of_nat (Datatypes.length (bs2ws bs)) < 2 ^ width
DEXPR m'
  #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
     "needle" => needle; "r" => of_Z v; (gs "_gs_from") => of_Z 0 }# 
  ?e (of_Z (Z.of_nat (Datatypes.length (bs2ws bs))))
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z Z.of_nat (Datatypes.length (bs2ws bs));
      (gs "_gs_to") => of_Z Z.of_nat (Datatypes.length (bs2ws bs)) }#; functions }}
  ?c
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}

This time we see that we are missing three components: a side condition about the length of the result of converting bytes to words; an expression compilation goal to compute that length for the loop; and a final goal to cast the list of words back to a list of bytes. We can make progress on the first two by registering a hint:

  Hint Rewrite bs2ws_len using solve[eauto] : compiler_side_conditions.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
0 <= Z.of_nat (Datatypes.length (bs2ws bs)) < 2 ^ width
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
DEXPR m'
  #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
     "needle" => needle; "r" => of_Z v; (gs "_gs_from") => of_Z 0 }# 
  ?e (of_Z (Z.of_nat (Datatypes.length (bs2ws bs))))
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
v0:= ListArray.fold_left
  (fun (r : Z) (w64 : word) =>
   let/n hit as "hit" := word.eqb w64 needle in
   let/n r0 as "r" := r + Z.b2z hit in r0) (bs2ws bs) 0: Z
mem1: mem
H9: (words ptr (bs2ws bs) ⋆ R) mem1
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z Z.of_nat (Datatypes.length (bs2ws bs));
      (gs "_gs_to") => of_Z Z.of_nat (Datatypes.length (bs2ws bs)) }#; functions }}
  ?c
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}
  Hint Rewrite Nat2Z_inj_div : compiler_side_conditions.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
0 <= Z.of_nat (Datatypes.length (bs2ws bs)) < 2 ^ width
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
DEXPR m'
  #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
     "needle" => needle; "r" => of_Z v; (gs "_gs_from") => of_Z 0 }# 
  ?e (of_Z (Z.of_nat (Datatypes.length (bs2ws bs))))
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
v0:= ListArray.fold_left
  (fun (r : Z) (w64 : word) =>
   let/n hit as "hit" := word.eqb w64 needle in
   let/n r0 as "r" := r + Z.b2z hit in r0) (bs2ws bs) 0: Z
mem1: mem
H9: (words ptr (bs2ws bs) ⋆ R) mem1
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z Z.of_nat (Datatypes.length (bs2ws bs));
      (gs "_gs_to") => of_Z Z.of_nat (Datatypes.length (bs2ws bs)) }#; functions }}
  ?c
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}
  Hint Extern 10 => nia : compiler_side_conditions.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
0 <= Z.of_nat (Datatypes.length (bs2ws bs)) < 2 ^ width
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
DEXPR m'
  #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
     "needle" => needle; "r" => of_Z v; (gs "_gs_from") => of_Z 0 }# 
  ?e (of_Z (Z.of_nat (Datatypes.length (bs2ws bs))))
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to") 
           ?e
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?c))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
v0:= ListArray.fold_left
  (fun (r : Z) (w64 : word) =>
   let/n hit as "hit" := word.eqb w64 needle in
   let/n r0 as "r" := r + Z.b2z hit in r0) (bs2ws bs) 0: Z
mem1: mem
H9: (words ptr (bs2ws bs) ⋆ R) mem1
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z Z.of_nat (Datatypes.length (bs2ws bs));
      (gs "_gs_to") => of_Z Z.of_nat (Datatypes.length (bs2ws bs)) }#; functions }}
  ?c
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}

  all: repeat compile_step.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
cmd
width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to")
           (expr.op bopname.divu "len"
              (word.unsigned (of_Z (Z.of_nat bytes_per_word))))
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?Goal))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
v0:= ListArray.fold_left
  (fun (r : Z) (w64 : word) =>
   let/n hit as "hit" := word.eqb w64 needle in
   let/n r0 as "r" := r + Z.b2z hit in r0) (bs2ws bs) 0: Z
mem1: mem
H9: (words ptr (bs2ws bs) ⋆ R) mem1
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z
                           Z.of_nat (Datatypes.length bs) /
                           Z.of_nat bytes_per_word;
      (gs "_gs_to") => of_Z
                         Z.of_nat (Datatypes.length bs) / Z.of_nat bytes_per_word }#;
   functions }}
  ?Goal
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z
                           Z.of_nat (Datatypes.length bs) /
                           Z.of_nat bytes_per_word;
      (gs "_gs_to") => of_Z
                         Z.of_nat (Datatypes.length bs) / Z.of_nat bytes_per_word }#;
   functions }}
  ?Goal
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}

And finally all that is left is the call that casts the data back, which requires a lemma very similar to the previous one:

  Hint Extern 1 => simple eapply compile_bs2ws_rev; shelve : compiler.width: Z
BW: Bitwidth width
word: Interface.word width
mem: map.map word Init.Byte.byte
locals: map.map string word
env: map.map string (list string * list string * cmd)
ext_spec: Semantics.ExtSpec
word_ok: word.ok word
mem_ok: map.ok mem
locals_ok: map.ok locals
env_ok: map.ok env
ext_spec_ok: Semantics.ext_spec.ok ext_spec
bytes_per_word_nz: bytes_per_word <> 0%nat
count_ws_br2fn:= ("count_ws",
(["data"; "len"; "needle"], ["r"],
noreassign
  (noskips
     (cmd.seq (cmd.set "r" (word.unsigned (of_Z 0)))
        (cmd_loop_fresh false (gs "_gs_from") 0 (gs "_gs_to")
           (expr.op bopname.divu "len"
              (word.unsigned (of_Z (Z.of_nat bytes_per_word))))
           (cmd.seq
              (cmd.set (gs "_gs_tmp")
                 (expr.load access_size.word
                    (offset "data" (gs "_gs_from")
                       (Z.of_nat bytes_per_word))))
              (cmd.seq
                 (cmd.set "hit"
                    (expr.op bopname.eq (gs "_gs_tmp") "needle"))
                 (cmd.seq
                    (cmd.set "r" (expr.op bopname.add "r" "hit"))
                    (cmd.seq (cmd.set "r" "r")
                       (fold_right
                          (fun (v : string) (c : cmd) =>
                           cmd.seq (cmd.unset v) c) cmd.skip
                          ["_gs_tmp"; "hit"]))))) 
           ?Goal))))): bedrock_func
H: __rupicola_program_marker count_ws
functions: list bedrock_func
ptr, needle: word
bs: list Init.Byte.byte
tr: Semantics.trace
mem0: mem
pred:= fun z : Z =>
wp_bind_retvars ["r"]
  (fun (rets : list word) (tr' : Semantics.trace) (mem' : mem) (_ : locals)
   =>
   exists r : word,
     rets = [r] /\ tr' = tr /\ r = of_Z z /\ (bytes ptr bs ⋆ R) mem'): Z -> Semantics.trace -> mem -> locals -> Prop
H3: Z.of_nat (Datatypes.length bs) < 2 ^ width
H2: (Datatypes.length bs mod bytes_per_word)%nat = 0%nat
H0: 0 < Z.of_nat bytes_per_word
H4: Z.of_nat bytes_per_word < 2 ^ width
v:= 0: Z
m': mem
H1: (words ptr (bs2ws bs) ⋆ R) m'
v0:= ListArray.fold_left
  (fun (r : Z) (w64 : word) =>
   let/n hit as "hit" := word.eqb w64 needle in
   let/n r0 as "r" := r + Z.b2z hit in r0) (bs2ws bs) 0: Z
mem1: mem
H9: (words ptr (bs2ws bs) ⋆ R) mem1
{{ tr; mem1;
   #{ "data" => ptr; "len" => of_Z Z.of_nat (Datatypes.length bs);
      "needle" => needle; "r" => of_Z v0;
      (gs "_gs_from") => of_Z
                           Z.of_nat (Datatypes.length bs) /
                           Z.of_nat bytes_per_word;
      (gs "_gs_to") => of_Z
                         Z.of_nat (Datatypes.length bs) / Z.of_nat bytes_per_word }#;
   functions }}
  ?Goal
{{ pred (let/n _ as "data" := ws2bs (bs2ws bs) in v0) }}
  repeat repeat compile_step.
Qed.

Once this process of interactive development is complete, we can move our new lemmas out of the proof (and possibly into a library, if they are — like here — generally applicable). The five hints that make up the proof are our compiler.

This interactive debugging and compiler construction experience is one of the main reasons why backtracking is so unappealing in the context of Rupicola [30]: as long as there is no backtracking, the user can be confident that the compilation goal they are looking at is the furthest that Rupicola could progress. In a compiler with backtracking, the user instead has to start by reconstructing the failed compilation path, and then understand why that path failed (this is a common problem with Coq's eauto tactic, and in fact with many automated theorem proving technologies: eauto is very pleasant to use when it works, but when a call to it fails, e.g. due to a change in a theorem statement, the debugging experience is quite poor).

Another important reason for avoiding backtracking is predictability: we want users to be in complete control of the code generation process, so to have at most one lemma applicable to each source code pattern loaded at any time — it is the responsibility of the user to annotate their code to indicate which pattern applies, and to load the appropriate compilation modules.

Finally, the last reason to avoiding backtracking is performance: going down unsuccessful compilation paths wastes time.

6.3 Nondeterminism

The predecessor to Rupicola, Fiat-to-Facade, hardcoded the nondeterminism monad: it inherited that trait from the Fiat system [Fiat+Delaware+POPL2015], and we initially worried that Rupicola would not have the same flexibility.

How Rupicola deals with non-determinism depends on its kind:

6.3.1 Erasable nondeterminism

Some datastructures expose a deterministic interface while relying on nondeterminism internally. A fixed-size stack, for example, contains a data section and some uninitialized space to grow into. Methods of the stack do not provide access to the unitialized section, so the stack exposes a deterministic interface.

Other structures operate deterministically, but we may prefer to abstract certain details of their implementation. For example, a binary search tree used to implement a set datastructure with insert, remove, and contains methods will answer contains queries deterministically, even if its exact layout is unknown (e.g. we may not know which element is at the root).

In these cases, it is possible to work with deterministic models of the data structure and to capture the nondeterminism at the separation-logic predicate level. Specifically, we can model both the stack and the tree as the list of elements that they contain, and hide the nondeterminism using existential quantification within our representation predicates. For stacks, we might write the following, which is essentially what we do with the buffer API of :

Definition stack_at addr capacity model :=
   fun m: mem => ∃ suffix,
     length (model ++ suffix) = capacity /\
     bytes addr (model ++ suffix) m.

For sets, we might write something similar to the following:

Definition set_at {E} addr element_at (model: list E) :=
   fun m => ∃ t: tree E,
      ∃ t: tree E,
        is_bst t /\
        is_permutation model (tree_elements t) /\
        tree_at addr element_at t m.

6.3.2 Observable nondeterminism

Other datastructures expose non-determinism to their callers.

This may be because the structure actually implements nondeterministic operations (maybe because one of its operations uses concurrent programming under the hood), or it may be because the model that we chose omits details of the implementation (perhaps to allow changes in the implementation).

For example, if we were to add a peek operation to our binary search tree returning the root of the tree, we would get different results depending on the layout of the tree, which did not matter for contains tests.

Nondeterminism stemming from underspecification is common, but not excessively so: indeed, if we want to be able to verify a low-level program that implements the operation, we need a sufficiently precise representation invariant: an invariant that abstracts away the hash function used to build a hash table by existentially quantifying it, for example, would not permit us to prove the correctness of a lookup operation.

When nondeterminism or underspecification is present, we write Rupicola programs in the non-determinism monad, and we adjust representation predicates and function postconditions accordingly. For separation logic predicates, we can generalize any deterministic predicate element_at over a family of possible objects:

Definition nondet_at {A} addr
           (element_at: word -> A -> mem -> Prop)
           (val: A -> Prop) mem :=
  ∃ a, val a /\ element_at addr a mem.

For function postconditions, where we would have previously asserted that the output of a Bedrock2 function should equal a given Gallina value, we assert instead that the value returned by the Bedrock2 program should belong to the set of values allowed by the nondeterministic Gallina program ().

6.4 Trusted base

Which moving parts does one have to trust when running a program compiled with Rupicola? Roughly the following:

Rupicola's inputs, or any higher-level specifications: Rupicola proves that its outputs match its inputs, but no more — bugs in latter are dutifully replicated in the former. If Rupicola's inputs are verified against — or generated from — higher-level specifications, then these need to be trusted. In the example of the IP checksum code, this means trusting the high-level Gallina implementation.
Bedrock2's pretty-printer to C — or to RISCV: The conversion of Rupicola's outputs from Bedrock2 to C is done using a small (~200 lines) but unverified pretty-printer written in Gallina. When compiling using Bedrock, the trusted base is reduced to the Coq notations that are used to print a byte dump of the assembled RISCV code.
Any unverified portions of the lower-level compilation toolchain: Going through C to compile and run Rupicola requires trusting the lower-level compilation toolchain: compiler, linker, and assembler. This is not an issue when compiling with Bedrock2's verified compiler.
Coq's proof checker: Rupicola's guarantees are only as good as those provided by the Coq proof assistant: bugs in its proof checker could lead it to accept incorrect proofs.
The environment in which programs execute, and assumptions about it: Issues in the operating system, if any, or hardware components of the machine running programs compiled with Rupicola can derail an otherwise-verified execution.

6.5 Contributions and credits

I have been lucky to collaborate with a wonderful group of people at MIT; parts of Rupicola's development are due to them.

First and foremost is Jade Philipoom. We brainstormed many aspects of the design of Rupicola together early on, pair-programmed some of the core definitions, and she contributed multiple of Rupicola's early examples and library functions, especially cryptography-related ones. Jade and I also spent a lot of time brainstorming ways to support complex borrowing and mutation schemes that are not part of this thesis.

Dustin Jamner later worked on the implementation of Rupicola's support for inline tables and stack allocation.

Andres Erbsen wrote and ran most of Rupicola's benchmarks.

A rough count attributing each source code line to the last author having edited them (git blame) in the current head of Rupicola's git repository, excluding a large Bedrock2-related refactoring, returns the following results:

  741 author Dustin Jamner
 2759 author Jade Philipoom
15930 author Clément Pit-Claudel

Outside of Rupicola's repository, Ashley Lin is compiling modular exponentiation routines, and I am working with Dustin Jamner and Andres Erbsen to use Rupicola to derive implementations of cryptographic primitives.

Parts of this dissertation are under review for publication as a conference paper co-authored with Jade Philipoom, Dustin Jamner, Andres Erbsen, and Adam Chlipala.

7 Related work

Generally speaking, Rupicola's unique strength is its combination of extensibility, foundational proofs, and high performance. Specifically, Rupicola supports extension of the source to introduce new monads and datatypes in the input; extension of the compilation process to support new data representations and foreign function calls; support for soundly linking against code written directly in Bedrock2; explicit memory management without automatic garbage collection; and the ability to derive end-to-end proofs, as well as the embedding in a proof assistant that makes the whole system interactive. Most other work lacks at least one of these aspects.

7.1 Binary code extraction and extensible extraction of imperative code

Maybe the most closely relevant competing project in the space of relational compilation is Fiat-to-Facade (F2F) [FiatToFacade+PitClaudel+IJCAR2020] [FiatToFacade+PitClaudel+MIT2016]. It was the first demonstration of an end-to-end pipeline for deriving code automatically from high-level specifications to low-level code, and it strove for both performance and extensibility in a foundational context. Unfortunately, it did not fully realize these goals:

Extensions are possible in F2F, but many are hard to express, owing to the choice of a linear language as the compilation target. (consequently, authors are encouraged to write most complex code directly in the lower-level languages, leaving F2F for the glue code surrounding calls to low-level functions). Rupicola uses separation logic directly, and some of the data-structure primitives in Rupicola's standard library are in fact derived using Rupicola.
F2F proves partial correctness: programs in F2F are only correct up to termination. Rupicola proofs are total.
F2F's code-derivation strategy is based on setoid rewriting, leading to significant performance issues (a typical compilation run may take tens of minutes even for a small program, compared to just tens of seconds for Rupicola).
F2F compilers are assembled from compilation lemmas but cannot then be extended: extensibility is done through hooks instead of hint databases as in Rupicola (the latter is made possible, or at least much easier, by Rupicola's encoding of continuations).
F2F hardcodes the nondeterminism monad and does not support extensions to other monads. Rupicola encodes its pre- and postconditions in a different way that is agnostic to the ambient monad, making it easy to represent values with custom representations and to compile computations that represent target-language effects using a custom monad.

The line of work on Imperative/HOL by P. Lammich et. al. is also closely related to this work, starting from extraction from Isabelle/HOL to Imperative/HOL in 2015 [ImperativeHOL+Lammich+ITP2015] (a refinement framework to translate functional code into a shallowly embedded imperative language with GC), all the way to the generation of verified LLVM from Isabelle/HOL in 2019 [LLVMHOL+Lammich+ITP2019] through a similar refinement process. Early work targeted a shallowly embedded language, but the latest work extracts directly to LLVM. The main difference is the scope of the translation: LLVM/HOL uses a direct embedding of LLVM into HOL, so relational compilation is used to perform what is essentially a one-to-one translation where all effects in the source are encoded extensionally. Rupicola, on the other hand, accepts more complex inputs and supports most effects intensionally, adding complexity in the compiler in exchange for a richer input language. (Another difference is that Rupicola integrates into a verified pipeline: code extracted with Rupicola can be soundly compiled and linked against other code written in Bedrock2 or directly in machine code, within Coq, whereas there is no verified implementation of LLVM is Isabelle/HOL today.)

Another closely related line of work uses proof-producing extraction to translate HOL programs to deeply embedded CakeML (a dialect of ML for which there exists a verified compiler [CakeML+Kumar+POPL2014]) [CakeMLExtraction+Myreen+ICFP2012] [CakeMLExtraction+Myreen+JFP2014] [CakeMLExtraction+Ho+IJCAR2018]. It bridges a much narrower gap than Rupicola does (it targets a language with garbage collection), but in exchange it offers a much more complete translation pipeline, in the sense that it supports a better-defined and larger part of its input language, HOL.

Members of the F* team take a slightly different approach in KreMLin [Kremlin+Protzenko+ICFP2017], an extraction framework from Low* (an imperative subset of F* to C: the extraction process is not verified (though a proof-producing strategy for a subset was described in [MetaFStar+Martínez+ESOP2019]), but the close match between the Low* style and C means that KreMLin's trusted base is reasonably small, and the emphasis on output-code quality means that C code from KreMLin can easily be integrated into larger, potentially unverified C programs (as has in fact been done with cryptography routines [EverCrypt+Protzenko+IEEESP2020]). This strategy is viable because F* provides convenient facilities for reasoning about stateful programs in shallowly embedded style, making it possible to prove the connection between code written in high-level functional style and low-level imperative style without resorting to reasoning about deeply embedded low-level terms. More recent developments explore metaprogramming and code generation using stateful functors [MetaFunctors+Protzenko+2021].

The authors of Cogent [Cogent+OConnor+JFP2021] take a different approach. Instead of translating between two languages (one functional and one imperative), they guarantee (using a restrictive type system) that all Cogent programs admit an efficient implementation that does not depend on a runtime or a garbage collector (this is similar to the way in which Facade [FiatToFacade+PitClaudel+IJCAR2020] was essentially a linear type system on top of Cito [Cito+Wang+OOPSLA2014]). As a result, unlike Rupicola and F2F, Cogent is complete, but it is also much more restrictive: unlike Rupicola, it does not support arbitrary user-supplied extensions, nor custom translation of specific high-level patterns: all optimizations must be expressed in the source program, not as transformations to be applied as part of the source-to-target translation process.

7.2 Other related work

7.2.1 Coq extraction

Coq's traditional extraction mechanism [CoqExtraction+PaulinMohring+JSC1989] [CoqMLSynthesis+PaulinMohring+JSC1993] is not machine-verified, but it is proven on paper [CoqExtraction+Letouzey+TYPES2002], and it supports a form of (unsound) extension by remapping constructors and functions to arbitrary OCaml expressions, a feature very commonly used in large extracted Coq developments. With sufficiently arcane combinations of extraction commands, it is often possible to improve performance significantly, at some risk to soundness. More principled are approaches based on reification: with a sufficiently restricted subset of Gallina, it is possible to reify terms into a deeply embedded AST using Ltac's reflection and certify correctness of that translation by interpreting deeply embedded results back into Gallina [OEuf+Mullen+CPP2018] [HELIX+Zaliva+CoqPL2019]. This approach works best when the input language is restricted in a way that makes the interpretation function easy to write.

The CertiCoq project [CertiCoq+Anand+CoqPL2017] supports verified compilation from Coq to assembly: it starts by reifying all of Coq into a deeply embedded AST type (this step is not proof-producing) and then proceeds as a traditional verified compiler. Unlike in Rupicola, the extraction process is not extensible, which forces users to pay the price of inefficiencies at the Gallina level — but in exchange it supports all of Coq instead of small, custom subsets of it.

7.2.2 Verified compilation

More generally, the last few years have seen an explosion of work on the topic of compiler verification, most notably with CakeML [CakeML+Kumar+POPL2014] and CompCert [CompCert+Leroy+POPL2006]. F2F depended on a verified compiler called Cito [Cito+Wang+OOPSLA2014]; Rupicola uses Bedrock2 [Lightbulb+Erbsen+PLDI2021].

7.2.3 Translation validation

Complete verification of a compiler can be onerous, and verifying that the compiler produces correct outputs on all inputs is often qualitatively more complex that establishing that property for any given input/output pair. As a result, many verified systems employ translation validation instead of verification: a (trustworthy, ideally verified) checker is used to confirm, for each run of the compiler, that the outputs are correct. The problem is undecidable for most input and output languages, so a variety of heuristics coexist in the literature [TranslationValidation+Pnueli+TACAS1998] [TranslationValidation+Necula+PLDI2000] [InstructionSchedulingValidation+Tristan+POPL2008] [CodeMotionVerification+Tristan+PLDI2009] [LLVMTranslationValidation+Tristan+PLDI2011] [ValidatingParsers+Jourdan+ESOP2012] [CertifiedSynthesis+Watanabe+2021], some quite close to the relational extraction style that I advocate [CertifyingExtraction+Forster+ITP2019]. often, the unverified compiler produces additional information (“witnesses”) besides the output program, and that information is used to facilitate the work of the checker. It would not be unreasonable to classify Rupicola as a translation-validation system, since it uses unverified Ltac scripts to generate output programs along with “witnesses” of correctness in the form of Coq proof terms. That proof term is then validated by Coq's kernel before being accepted, but this “verification” step is trivial, in the sense that it can only fail in the presence of bugs in the implementation of the Ltac scripting language and its primitive tactics (and not because of bugs in compilers built with Rupicola, since these bugs manifest as compilation stopping early, in a partially compiled state).

There is an additional sense in which Rupicola is a translation-validation system, however: at its core, Rupicola is really a strongest postcondition (SP) calculator for Bedrock2 code, except the code that it evaluates is constructed piecemeal, based on the shape of a Gallina value (see Box 1). Since evaluation in Gallina (unfolding consecutive let bindings) and partial evaluation in Bedrock2 (SP computation) proceed in lockstep as the Bedrock2 code is being derived, all that typically remains to be done at the end of a Rupicola compilation run is to unify the result of symbolic evaluation with the expected postcondition — the “translation-validation” step. This step is simple because each individual compilation lemma maintains the connection between Bedrock2 and Gallina.

7.2.4 Automatic generation of functional models (importing code into Coq)

The typical way to run code written in an interactive theorem prover like Coq [Coq+Zenodo2021] or in a verification-aware language like Dafny [Dafny+Leino+LPAR2010] is to use program extraction to generate executable code. This may be undesirable when precise control over the final software artifact is needed, e.g. for performance or ease-of-integration reasons (for example, Dafny supports code generation in Go, but the result is neither fully idiomatic nor customizable). In these cases, an alternative to extraction (which generates executable code from functional models) is to generate functional models from executable code, i.e. to take a handwritten program and automatically translate it into an equivalent model in Coq, Dafny, or some other tool. This is how hs-to-coq [SpectorZabusky+HS2Coq+CPP2018], CFML [Charguéraud+CFML+ICFP2011], or Goose [Chajed+Perennial+SOSP2019] operate: they take code in Haskell, OCaml, and Go respectively and translate it into a Gallina model that can be reasoned about using Coq.

7.2.5 Optimizing compilation of functional programs

Finally, there is of course a long history of developing optimizations for functional languages. Recent development of note are the development of the FLambda backend for the OCaml compiler, which significantly reduces boxing [OCamlFLambda+Chambart+Online2021]; the introduction of a clever analysis in the Lean compiler to automatically reuse space (introducing a form of transparent mutation), leveraging reference counts to determine when it is safe to overwrite a value [Perceus+Reinking+PLDI2021]; and multiple optimizations for CertiCoq [CertiCoqRewriteRules+Li+ICFP2021] [CertiCoqCompositional+Paraskevopoulou+ICFP2021].

7.2.6 Extensible compilers, optimization DSLs, and program synthesis

Stepping further, Rupicola's design shares a lot with work on extensible compilation and domain-specific languages for optimization [MetaProgramming+Parreaux+EPFL2020] [Racket+TobinHochstadt+PLDI11] [Xcert+Tatlock+PLDI2010] [Rhodium+Lerner+2005] and more generally with work on automated program derivation and program synthesis, all the way back to deductive program synthesis [DeductiveSynthesis+Manna+TOPLAS1980].

8 Conclusion

This thesis introduced the unifying framework of relational compilation and presented Rupicola, a relational-compilation toolkit that leverages modular compiler extensions to derive high-performance, verified low-level programs automatically from functional sources. Rupicola is unique in its combination of extensibility, foundational proofs, and performance. We are in the process of extending it to support further application domains, and we are looking into integrating its verified outputs into existing widely used libraries. I hope that, in the long run, the techniques presented in this thesis will provide a solid foundation for systems verified from end to end and worthy of trust.

9 Bibliography

[Cogent+Amani+ASPLOS2016]

Sidney Amani, Alex Hixon, Zilin Chen, Christine Rizkallah, Peter Chubb, Liam O'Connor, Joel Beeren, Yutaka Nagashima, Japheth Lim, Thomas Sewell, Joseph Tuong, Gabriele Keller, Toby Murray, Gerwin Klein, and Gernot Heiser. Cogent: verifying high-assurance file system implementations. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, 175–188. New York, NY, USA, 2016. Association for Computing Machinery. URL: https://doi.org/10.1145/2872362.2872404, doi:10.1145/2872362.2872404.

[CertiCoq+Anand+CoqPL2017] (1,2)

Abhishek Anand, Andrew Appel, Greg Morrisett, Zoe Paraskevopoulou, Randy Pollack, Olivier Savary Belanger, Matthieu Sozeau, and Matthew Weaver. CertiCoq: A verified compiler for Coq. In CoqPL'17: The Third International Workshop on Coq for PL. January 2017.

[VST+Appel+ESOP2011]

Andrew W. Appel. Verified software toolchain. In Proceedings of the 20th European Conference on Programming Languages and Systems: Part of the Joint European Conferences on Theory and Practice of Software, ESOP'11/ETAPS'11, 1–17. Berlin, Heidelberg, 2011. Springer-Verlag.

[ProofGeneral+Aspinall+ETAPS2000]

David Aspinall. Proof General: A generic tool for proof development. In International Conference on Tools and Algorithms for Construction and Analysis of Systems, TACAS 2000, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS 2000, Berlin, Germany, March 25 - April 2, 2000, 38–42. 2000. doi:10.1007/3-540-46419-0\_3.

[Curve25519+Bernstein+PKC2006]

Daniel J. Bernstein. Curve25519: new diffie-hellman speed records. In Public Key Cryptography - PKC 2006, 9th International Conference on Theory and Practice of Public-Key Cryptography, New York, NY, USA, April 24-26, 2006, Proceedings, 207–228. 2006. URL: https://doi.org/10.1007/11745853\_14, doi:10.1007/11745853\_14.

[Koika+Bourgeat+PLDI2020]

Thomas Bourgeat, Clément Pit-Claudel, Adam Chlipala, and Arvind. The essence of Bluespec: a core language for rule-based hardware design. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2020, 243–257. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://pit-claudel.fr/clement/papers/koika-PLDI20.pdf, doi:10.1145/3385412.3385965.

[Idris+Brady+JFP2013]

Edwin C. Brady. Idris, a general-purpose dependently typed programming language: design and implementation. Journal of Functional Programming, 23(5):552–593, 2013. URL: https://doi.org/10.1017/S095679681300018X, doi:10.1017/S095679681300018X.

[Chajed+Perennial+SOSP2019]

Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. Verifying concurrent, crash-safe systems with perennial. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, 243–258. 2019. URL: https://doi.org/10.1145/3341301.3359632, doi:10.1145/3341301.3359632.

[OCamlFLambda+Chambart+Online2021]

Pierre Chambart, Mark Shinwell, Damien Doligez, and OCaml Contributors. Optimization with flambda. Feb 2016. URL: https://ocaml.org/manual/flambda.html.

[Charguéraud+CFML+ICFP2011]

Arthur Charguéraud. Characteristic formulae for the verification of imperative programs. In Proceeding of the 16th ACM SIGPLAN international conference on Functional Programming, ICFP 2011, Tokyo, Japan, September 19-21, 2011, 418–430. 2011. URL: https://doi.org/10.1145/2034773.2034828, doi:10.1145/2034773.2034828.

[Bedrock+Chlipala+ICFP2013]

Adam Chlipala. The Bedrock structured programming system: combining generative metaprogramming and Hoare logic in an extensible program verifier. In ACM SIGPLAN International Conference on Functional Programming, ICFP 2013, Boston, MA, USA - September 25 - 27, 2013, 391–402. 2013. doi:10.1145/2500365.2500592.

[Fiat+Chlipala+SNAPL2017]

Adam Chlipala, Benjamin Delaware, Samuel Duchovni, Jason Gross, Clément Pit-Claudel, Sorawit Suriyakarn, Peng Wang, and Katherine Ye. The end of history? Using a proof assistant to replace language design with library design. In Benjamin S. Lerner, Rastislav Bodík, and Shriram Krishnamurthi, editors, 2nd Summit on Advances in Programming Languages (SNAPL 2017), volume 71 of Leibniz International Proceedings in Informatics (LIPIcs), 3:1–3:15. Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2017/7123/, doi:10.4230/LIPIcs.SNAPL.2017.3.

[Fiat+Delaware+SNAPL2017]

[Lean+deMoura+CADE2015]

Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, 378–388. 2015. URL: https://doi.org/10.1007/978-3-319-21401-6\_26, doi:10.1007/978-3-319-21401-6\_26.

[Fiat+Delaware+POPL2015] (1,2)

Benjamin Delaware, Clément Pit-Claudel, Jason Gross, and Adam Chlipala. Fiat: deductive synthesis of abstract data types in a proof assistant. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages - POPL '15, 689–700. ACM Press, 2015. URL: https://pit-claudel.fr/clement/papers/fiat-POPL15.pdf, doi:10.1145/2676726.2677006.

[Narcissus+Delaware+ICFP2019]

Benjamin Delaware, Sorawit Suriyakarn, Clément Pit-Claudel, Qianchuan Ye, and Adam Chlipala. Narcissus: correct-by-construction derivation of decoders and encoders from binary formats. Proceedings of the ACM on Programming Languages, 3(ICFP):82:1–82:29, July 2019. URL: https://pit-claudel.fr/clement/papers/narcissus-ICFP19.pdf, doi:10.1145/3341686.

[EWD:EWD209]

Edsger W. Dijkstra. A constructive approach to the problem of program correctness. Circulated privately, August 1967. URL: https://www.cs.utexas.edu/users/EWD/ewd02xx/EWD209.PDF.

[FiatCrypto+Erbsen+MIT2017]

Andres Erbsen. Crafting certified elliptic curve cryptography implementations in coq. Master's thesis, Massachusetts Institute of Technology, 2017. URL: https://dspace.mit.edu/handle/1721.1/112843.

[Lightbulb+Erbsen+PLDI2021] (1,2,3,4,5)

Andres Erbsen, Samuel Gruetter, Joonwon Choi, Clark Wood, and Adam Chlipala. Integration verification across software and hardware for a simple embedded system. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 604–619. New York, NY, USA, 2021. Association for Computing Machinery. URL: https://doi.org/10.1145/3453483.3454065.

[FiatCrypto+Erbsen+IEEESP2019] (1,2)

Andres Erbsen, Jade Philipoom, Jason Gross, Robert Sloan, and Adam Chlipala. Simple high-level code for cryptographic arithmetic - with proofs, without compromises. 2019 IEEE Symposium on Security and Privacy (SP), May 2019. URL: http://dx.doi.org/10.1109/sp.2019.00005, doi:10.1109/sp.2019.00005.

[CertifyingExtraction+Forster+ITP2019]

Yannick Forster and Fabian Kunze. A certifying extraction with time bounds from coq to call-by-value λ-calculus. In Interactive Theorem Proving - 10th International Conference, ITP 2019, Portland, USA, 17:1–17:19. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Apr 2019. Also available as arXiv:1904.11818.

[HOL+Gordon+Cambridge1993]

M. J. C. Gordon and T. F. Melham. Introduction to HOL: A Theorem Proving Environment for Higher Order Logic. Cambridge University Press, USA, 1993. ISBN 0521441897.

[Greenaway_AK_12]

David Greenaway, June Andronick, and Gerwin Klein. Bridging the gap: automatic verified abstraction of C. In International Conference on Interactive Theorem Proving, ITP 2012, Princeton, NJ, USA, August 13-15, 2012, 99–115. 2012. doi:10.1007/978-3-642-32347-8\_8.

[DataRepresentationSynthesis+Hawkins+PLDI2011]

Peter Hawkins, Alex Aiken, Kathleen Fisher, Martin C. Rinard, and Mooly Sagiv. Data representation synthesis. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, 38–49. 2011. doi:10.1145/1993498.1993504.

[CakeMLExtraction+Ho+IJCAR2018]

Son Ho, Oskar Abrahamsson, Ramana Kumar, Magnus O. Myreen, Yong Kiam Tan, and Michael Norrish. Proof-producing synthesis of CakeML with I/O and local state from monadic HOL functions. Lecture Notes in Computer Science, pages 646–662, 2018. doi:10.1007/978-3-319-94205-6\_42.

[Haskell+Hudak+HOPL2007]

Paul Hudak, John Hughes, Simon L. Peyton Jones, and Philip Wadler. A history of haskell: being lazy with class. In Proceedings of the Third ACM SIGPLAN History of Programming Languages Conference (HOPL-III), San Diego, California, USA, 9-10 June 2007, 1–55. 2007. URL: https://doi.org/10.1145/1238844.1238856, doi:10.1145/1238844.1238856.

[HOLCakeML+Hupel+ESOP2018]

Lars Hupel and Tobias Nipkow. A verified compiler from isabelle/hol to cakeml. In Programming Languages and Systems - 27th European Symposium on Programming, ESOP 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings, 999–1026. 2018. URL: https://doi.org/10.1007/978-3-319-89884-1\_35, doi:10.1007/978-3-319-89884-1\_35.

[10.1007/978-3-642-28869-2_20]

Jacques-Henri Jourdan, François Pottier, and Xavier Leroy. Validating lr(1) parsers. In Helmut Seidl, editor, Programming Languages and Systems, 397–416. Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.

[ValidatingParsers+Jourdan+ESOP2012]

Jacques-Henri Jourdan, François Pottier, and Xavier Leroy. Validating lr(1) parsers. In Helmut Seidl, editor, Programming Languages and Systems - 21st European Symposium on Programming, ESOP 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24 - April 1, 2012, Proceedings, 397–416. Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.

[BinaryCodeExtraction+Kumar+ITP2109]

Ramana Kumar, Eric Mullen, Zachary Tatlock, and Magnus O. Myreen. Software verification with itps should use binary code extraction to reduce the TCB - (short paper). In Jeremy Avigad and Assia Mahboubi, editors, Interactive Theorem Proving - 9th International Conference, ITP 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 9-12, 2018, Proceedings, volume 10895 of Lecture Notes in Computer Science, 362–369. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-94821-8\_21, doi:10.1007/978-3-319-94821-8\_21.

[CakeML+Kumar+POPL2014] (1,2)

Ramana Kumar, Magnus O. Myreen, Michael Norrish, and Scott Owens. Cakeml: A verified implementation of ML. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2014, San Diego, CA, USA, January 20-21, 2014, 179–192. 2014. doi:10.1145/2535838.2535841.

[ImperativeHOL+Lammich+ITP2015] (1,2)

Peter Lammich. Refinement to Imperative/HOL. In International Conference on Interactive Theorem Proving, ITP 2015, Nanjing, China, August 24-27, 2015, 253–269. 2015. doi:10.1007/978-3-319-22102-1\_17.

[LLVMHOL+Lammich+ITP2019] (1,2)

Peter Lammich. Generating verified LLVM from Isabelle/HOL. In International Conference on Interactive Theorem Proving, ITP 2019, September 9-12, 2019, Portland, OR, USA, 22:1–22:19. 2019. doi:10.4230/LIPIcs.ITP.2019.22.

[Dafny+Leino+LPAR2010]

K. Rustan M. Leino. Dafny: an automatic program verifier for functional correctness. Logic for Programming, Artificial Intelligence, and Reasoning, pages 348–370, 2010.

[DafnyTriggers+CAV2016]

K. Rustan M. Leino and Clément Pit-Claudel. Trigger selection strategies to stabilize program verifiers. In Swarat Chaudhuri and Azadeh Farzan, editors, Computer Aided Verification: 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part I, volume 9779 of Lecture Notes in Computer Science, pages 361–381. Springer International Publishing, July 2016. URL: https://pit-claudel.fr/clement/papers/dafny-trigger-selection-CAV16.pdf, doi:10.1007/978-3-319-41528-4_20.

[Rhodium+Lerner+2005]

Sorin Lerner, Todd D. Millstein, Erika Rice, and Craig Chambers. Automated soundness proofs for dataflow analyses and transformations via local rules. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2005, Long Beach, California, USA, January 12-14, 2005, 364–377. 2005. doi:10.1145/1040305.1040335.

[CompCert+Leroy+POPL2006]

Xavier Leroy. Formal certification of a compiler back-end or: programming a compiler with a proof assistant. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2006, Charleston, South Carolina, USA, January 11-13, 2006, 42–54. 2006. doi:10.1145/1111037.1111042.

[CoqExtraction+Letouzey+TYPES2002]

Pierre Letouzey. A new extraction for coq. In International Workshop on Types for Proofs and Programs, TYPES 2002, Berg en Dal, The Netherlands, April 24-28, 2002, 200–219. Springer Berlin Heidelberg, 2002. URL: http://dx.doi.org/10.1007/3-540-39185-1_12, doi:10.1007/3-540-39185-1_12.

[CertiCoqRewriteRules+Li+ICFP2021]

John M. Li and Andrew W. Appel. Deriving efficient program transformations from rewrite rules. Proc. ACM Program. Lang., August 2021. URL: https://doi.org/10.1145/3473579, doi:10.1145/3473579.

[HOLVerilog+Loow+FormaliSE2019]

Andreas Lööw and Magnus O. Myreen. A proof-producing translator for verilog development in HOL. In Proceedings of the 7th International Workshop on Formal Methods in Software Engineering, FormaliSE@ICSE 2019, Montreal, QC, Canada, May 27, 2019, 99–108. 2019. URL: https://doi.org/10.1109/FormaliSE.2019.00020, doi:10.1109/FormaliSE.2019.00020.

[L_w_2021]

Andreas Lööw. Lutsig: a verified verilog compiler for verified circuit development. Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs, Jan 2021. URL: http://dx.doi.org/10.1145/3437992.3439916, doi:10.1145/3437992.3439916.

[VerilogExtraction+Lööw+PLDI2019]

Andreas Lööw, Ramana Kumar, Yong Kiam Tan, Magnus O. Myreen, Michael Norrish, Oskar Abrahamsson, and Anthony Fox. Verified compilation on a verified processor. Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun 2019. URL: http://dx.doi.org/10.1145/3314221.3314622, doi:10.1145/3314221.3314622.

[DeductiveSynthesis+Manna+TOPLAS1980]

Zohar Manna and Richard J. Waldinger. A deductive approach to program synthesis. ACM Transactions on Programming Languages and Systems, 2(1):90–121, January 1980. doi:10.1145/357084.357090.

[MetaFStar+Martínez+ESOP2019]

Guido Martínez, Danel Ahman, Victor Dumitrescu, Nick Giannarakis, Chris Hawblitzel, Cătălin Hriţcu, Monal Narasimhamurthy, Zoe Paraskevopoulou, Clément Pit-Claudel, Jonathan Protzenko, Tahina Ramananandro, Aseem Rastogi, and Nikhil Swamy. Meta-F*: proof automation with SMT, tactics, and metaprograms. In Luís Caires, editor, Programming Languages and Systems - 28th European Symposium on Programming, ESOP 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6-11, 2019, Proceedings, 30–59. Springer International Publishing, 2019. URL: https://pit-claudel.fr/clement/papers/meta-fstar-ESOP19.pdf, doi:10.1007/978-3-030-17184-1_2.

[Ladder+Montgomery+MC1987]

Peter L. Montgomery. Speeding the pollard and elliptic curve methods of factorization. Mathematics of Computation, 48(177):243–264, 1987. URL: http://dx.doi.org/10.1090/s0025-5718-1987-0866113-7, doi:10.1090/s0025-5718-1987-0866113-7.

[OEuf+Mullen+CPP2018] (1,2)

Eric Mullen, Stuart Pernsteiner, James R. Wilcox, Zachary Tatlock, and Dan Grossman. Œuf: minimizing the coq extraction tcb. In Proceedings of the 7th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2018, 172–185. New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3167089, doi:10.1145/3167089.

[MyreenGS12]

Magnus O. Myreen, Michael J. C. Gordon, and Konrad Slind. Decompilation into logic - improved. In Formal Methods in Computer-Aided Design, FMCAD 2012, Cambridge, UK, October 22-25, 2012, 78–81. 2012. URL: https://ieeexplore.ieee.org/document/6462558/.

[CakeMLExtraction+Myreen+ICFP2012]

Magnus O. Myreen and Scott Owens. Proof-producing synthesis of ML from higher-order logic. In ACM SIGPLAN International Conference on Functional Programming, ICFP 2012, Copenhagen, Denmark, September 9-15, 2012, 115–126. 2012. doi:10.1145/2364527.2364545.

[CakeMLExtraction+Myreen+JFP2014] (1,2)

Magnus O. Myreen and Scott Owens. Proof-producing translation of higher-order logic into pure and stateful ML. Journal of Functional Programming, 24(2-3):284–315, Jan 2014. URL: http://dx.doi.org/10.1017/s0956796813000282, doi:10.1017/s0956796813000282.

[TranslationValidation+Necula+PLDI2000]

George C. Necula. Translation validation for an optimizing compiler. Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation - PLDI ’00, 2000. URL: http://dx.doi.org/10.1145/349299.349314, doi:10.1145/349299.349314.

[Cogent+OConnor+JFP2021]

Liam O'Connor, Zilin Chen, Christine Rizkallah, Vincent Jackson, Sidney Amani, Gerwin Klein, Toby Murray, Thomas Sewell, and Gabriele Keller. Cogent: uniqueness types and certifying compilation. Journal of Functional Programming, 31:25, 2021. doi:10.1017/S095679682100023X.

[Cogent+OConnor+2016]

Liam O'Connor, Christine Rizkallah, Zilin Chen, Sidney Amani, Japheth Lim, Yutaka Nagashima, Thomas Sewell, Alex Hixon, Gabriele Keller, Toby C. Murray, and Gerwin Klein. COGENT: certified compilation for a functional systems language. CoRR, 2016. URL: https://arxiv.org/abs/1601.05520.

[CertiCoqCompositional+Paraskevopoulou+ICFP2021]

Zoe Paraskevopoulou, John M. Li, and Andrew W. Appel. Compositional optimizations for certicoq. Proc. ACM Program. Lang., August 2021. URL: https://doi.org/10.1145/3473591, doi:10.1145/3473591.

[MetaProgramming+Parreaux+EPFL2020]

Lionel Parreaux. Type-Safe Metaprogramming and Compilation Techniques For Designing Efficient Systems in High-Level Languages. PhD thesis, EPFL, Lausanne, 2020. URL: http://infoscience.epfl.ch/record/281735, doi:10.5075/epfl-thesis-10285.

[CoqExtraction+PaulinMohring+JSC1989]

Christine Paulin-Mohring. Extraction de programmes dans le Calcul des Constructions. Theses, Université Paris-Diderot - Paris VII, January 1989. URL: https://tel.archives-ouvertes.fr/tel-00431825.

[CoqMLSynthesis+PaulinMohring+JSC1993]

Christine Paulin-Mohring and Benjamin Werner. Synthesis of ml programs in the system coq. J. Symb. Comput., 15(5–6):607–640, May 1993. URL: https://doi.org/10.1016/S0747-7171(06)80007-6, doi:10.1016/S0747-7171(06)80007-6.

[FiatCrypto+Philipoom+MIT2018]

Jade Philipoom. Correct-by-construction finite field arithmetic in coq. Master's thesis, Massachusetts Institute of Technology, 2018. URL: https://dspace.mit.edu/handle/1721.1/119582.

[FiatToFacade+PitClaudel+MIT2016] (1,2)

Clément Pit-Claudel. Compilation using correct-by-construction program synthesis. Master's thesis, Massachusetts Institute of Technology, August 2016. URL: http://pit-claudel.fr/clement/MSc/.

[Alectryon+PitClaudel+SLE2020]

Clément Pit-Claudel. Untangling mechanized proofs. In Proceedings of the 13th ACM SIGPLAN International Conference on Software Language Engineering, SLE 2020, 155–174. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://pit-claudel.fr/clement/papers/alectryon-SLE20.pdf, doi:10.1145/3426425.3426940.

[DSLs+PitClaudel+CoqPL2021]

Clément Pit-Claudel and Thomas Bourgeat. An experience report on writing usable DSLs in coq. In CoqPL'21: The Seventh International Workshop on Coq for PL. April 2021. URL: https://pit-claudel.fr/clement/papers/koika-dsl-CoqPL21.pdf.

[Cuttlesim+ASPLOS2021]

Clément Pit-Claudel, Thomas Bourgeat, Stella Lau, Adam Chlipala, and Arvind. Effective simulation and debugging for a high-level hardware language using software compilers. In Tim Sherwood, Emery Berger, and Christos Kozyrakis, editors, Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual, April 19-23, 2021, ASPLOS 2021. Association for Computing Machinery, 2021. URL: https://pit-claudel.fr/clement/papers/cuttlesim-ASPLOS21.pdf.

[CompanyCoq+PitClaudel+CoqPL2016]

Clément Pit-Claudel and Pierre Courtieu. Company-Coq: taking Proof General one step closer to a real IDE. In CoqPL'16: The Second International Workshop on Coq for PL. January 2016. URL: https://hdl.handle.net/1721.1/101149, doi:10.5281/zenodo.44331.

[FiatToFacade+PitClaudel+IJCAR2020] (1,2,3)

Clément Pit-Claudel, Peng Wang, Benjamin Delaware, Jason Gross, and Adam Chlipala. Extensible extraction of efficient imperative programs with foreign functions, manually managed memory, and proofs. In Nicolas Peltier and Viorica Sofronie-Stokkermans, editors, Automated Reasoning: 10th International Joint Conference, IJCAR 2020, Paris, France, July 1–4, 2020, Proceedings, Part II, volume 12167 of Lecture Notes in Computer Science, 119–137. Springer International Publishing, July 2020. URL: https://pit-claudel.fr/clement/papers/fiat-to-facade-IJCAR20.pdf, doi:10.1007/978-3-030-51054-1_7.

[TranslationValidation+Pnueli+TACAS1998]

A. Pnueli, M. Siegel, and E. Singerman. Translation validation. In Bernhard Steffen, editor, Tools and Algorithms for the Construction and Analysis of Systems - 4th International Conference, TACAS'98 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS'98 Lisbon, Portugal, March 28 – April 4, 1998 Proceedings, 151–166. Berlin, Heidelberg, 1998. Springer Berlin Heidelberg.

[MetaFunctors+Protzenko+2021]

Jonathan Protzenko and Son Ho. Zero-cost meta-programmed stateful functors in F. CoRR, 2021. URL: https://arxiv.org/abs/2102.01644, arXiv:2102.01644.

[EverCrypt+Protzenko+IEEESP2020]

Jonathan Protzenko, Bryan Parno, Aymeric Fromherz, Chris Hawblitzel, Marina Polubelova, Karthikeyan Bhargavan, Benjamin Beurdouche, Joonwon Choi, Antoine Delignat-Lavaud, Cedric Fournet, and et al. Evercrypt: a fast, verified, cross-platform cryptographic provider. 2020 IEEE Symposium on Security and Privacy (SP), May 2020. URL: http://dx.doi.org/10.1109/sp40000.2020.00114, doi:10.1109/sp40000.2020.00114.

[Kremlin+Protzenko+ICFP2017] (1,2,3)

Jonathan Protzenko, Jean Karim Zinzindohoué, Aseem Rastogi, Tahina Ramananandro, Peng Wang, Santiago Zanella Béguelin, Antoine Delignat-Lavaud, Catalin Hritcu, Karthikeyan Bhargavan, Cédric Fournet, and Nikhil Swamy. Verified low-level programming embedded in F*. Proceedings of the ACM on Programming Languages, 1(ICFP):17:1–17:29, 2017. doi:10.1145/3110261.

[Perceus+Reinking+PLDI2021]

Alex Reinking, Ningning Xie, Leonardo de Moura, and Daan Leijen. Perceus: garbage free reference counting with reuse. Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Jun 2021. URL: http://dx.doi.org/10.1145/3453483.3454032, doi:10.1145/3453483.3454032.

[SeparationLogic+Reynolds+LICS2002]

John C. Reynolds. Separation logic: A logic for shared mutable data structures. In IEEE Symposium on Logic in Computer Science, LICS 2002, 22-25 July 2002, Copenhagen, Denmark, 55–74. 2002. doi:10.1109/LICS.2002.1029817.

[NoAlias+Ritchie+CompLangC1988]

Dennis Ritchie. Noalias comments to x3j11. 1988. URL: https://usenetarchives.com/view.php?id=comp.lang.c&mid=PDc3NTNAYWxpY2UuVVVDUD4.

[Sewell+PLDI+PLDI2013]

Thomas Arthur Leck Sewell, Magnus O. Myreen, and Gerwin Klein. Translation validation for a verified OS kernel. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2013, Seattle, WA, USA, June 16-19, 2013, 471–482. 2013. doi:10.1145/2491956.2462183.

[Solar-Lezama2009]

Armando Solar-Lezama. The sketching approach to program synthesis. In Asian Symposium on Programming Languages and Systems, APLAS 2009, Seoul, Korea, December 14-16, 2009, 4–13. 2009. doi:10.1007/978-3-642-10672-9\_3.

[SpectorZabusky+HS2Coq+CPP2018]

Antal Spector-Zabusky, Joachim Breitner, Christine Rizkallah, and Stephanie Weirich. Total haskell is reasonable coq. In Proceedings of the 7th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2018, Los Angeles, CA, USA, January 8-9, 2018, 14–27. 2018. URL: https://doi.org/10.1145/3167092, doi:10.1145/3167092.

[LXM+Steele+OOPSLA2021]

Guy L. Steele Jr. and Sebastiano Vigna. Lxm: better splittable pseudorandom number generators (and almost as fast). Proceedings of the ACM on Programming Languages, 5(OOPSLA):1–31, Oct 2021. URL: http://dx.doi.org/10.1145/3485525, doi:10.1145/3485525.

[CompCompPOPL15]

Gordon Stewart, Lennart Beringer, Santiago Cuellar, and Andrew W. Appel. Compositional CompCert. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, 275–287. 2015. doi:10.1145/2676726.2676985.

[FStar+Swamy+POPL2016]

Nikhil Swamy, Catalin Hritcu, Chantal Keller, Aseem Rastogi, Antoine Delignat-Lavaud, Simon Forest, Karthikeyan Bhargavan, Cédric Fournet, Pierre-Yves Strub, Markulf Kohlweiss, Jean Karim Zinzindohoue, and Santiago Zanella Béguelin. Dependent types and multi-monadic effects in F*. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, 256–270. 2016. URL: https://doi.org/10.1145/2837614.2837655, doi:10.1145/2837614.2837655.

[Xcert+Tatlock+PLDI2010]

Zachary Tatlock and Sorin Lerner. Bringing extensibility to verified compilers. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2010, Toronto, Ontario, Canada, June 5-10, 2010, 111–121. 2010. doi:10.1145/1806596.1806611.

[Coq+Zenodo2021] (1,2)

The Coq Development Team. The coq proof assistant. January 2021. URL: https://doi.org/10.5281/zenodo.4501022, doi:10.5281/zenodo.4501022.

[CoqRefMan+Zenodo2021]

The Coq Development Team. The Coq Proof Assistant: Reference Manual, version 8.13. January 2021. URL: https://doi.org/10.5281/zenodo.4501022, doi:10.5281/zenodo.4501022.

[Racket+TobinHochstadt+PLDI11]

Sam Tobin-Hochstadt, Vincent St-Amour, Ryan Culpepper, Matthew Flatt, and Matthias Felleisen. Languages as libraries. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, 132–141. 2011. doi:10.1145/1993498.1993514.

[LLVMTranslationValidation+Tristan+PLDI2011]

Jean-Baptiste Tristan, Paul Govereau, and Greg Morrisett. Evaluating value-graph translation validation for llvm. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, 295–305. New York, NY, USA, 2011. Association for Computing Machinery. URL: https://doi.org/10.1145/1993498.1993533, doi:10.1145/1993498.1993533.

[InstructionSchedulingValidation+Tristan+POPL2008]

Jean-Baptiste Tristan and Xavier Leroy. Formal verification of translation validators: a case study on instruction scheduling optimizations. In Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '08, 17–27. New York, NY, USA, 2008. Association for Computing Machinery. URL: https://doi.org/10.1145/1328438.1328444, doi:10.1145/1328438.1328444.

[CodeMotionVerification+Tristan+PLDI2009]

Jean-Baptiste Tristan and Xavier Leroy. Verified validation of lazy code motion. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, 316–326. New York, NY, USA, 2009. Association for Computing Machinery. URL: https://doi.org/10.1145/1542476.1542512, doi:10.1145/1542476.1542512.

[ByteOrder+Tunney+Blog2021]

Justine Tunney. The byte order fiasco. 2021. URL: https://justine.lol/endian.html.

[Monads+Wadler+MSC1992]

Philip Wadler. Comprehending monads. Mathematical Structures in Computer Science, 2(4):461–493, 1992. doi:10.1017/S0960129500001560.

[facade-tr]

Peng Wang. The Facade language. Technical Report, MIT CSAIL, 2016. URL: https://people.csail.mit.edu/wangpeng/facade-tr.pdf.

[Cito+Wang+OOPSLA2014] (1,2)

Peng Wang, Santiago Cuellar, and Adam Chlipala. Compiler verification meets cross-language linking via data abstraction. In ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2014, part of SPLASH 2014, Portland, OR, USA, October 20-24, 2014, 675–690. 2014. doi:10.1145/2660193.2660201.

[CertifiedSynthesis+Watanabe+2021]

Yasunari Watanabe, Kiran Gopinathan, George P\^ırlea, Nadia Polikarpova, and Ilya Sergey. Certifying the synthesis of heap-manipulating programs. Proc. ACM Program. Lang., August 2021. URL: https://doi.org/10.1145/3473589, doi:10.1145/3473589.

[ISOC+Yodaiken+PLOS2021]

Victor Yodaiken. How iso c became unusable for operating systems development. In Proceedings of the 11th Workshop on Programming Languages and Operating Systems, PLOS '21, 84–90. New York, NY, USA, 2021. Association for Computing Machinery. URL: https://doi.org/10.1145/3477113.3487274, doi:10.1145/3477113.3487274.

[HELIX+Zaliva+CoqPL2019]

Vadim Zaliva and Matthieu Sozeau. Reification of shallow-embedded DSLs in coq with automated verification. In CoqPL'19: The Fifth International Workshop on Coq for PL. 2019.

[FiatMonitors+SETTA2018]

Teng Zhang, John Wiegley, Theophilos Giannakopoulos, Gregory Eakman, Clément Pit-Claudel, Insup Lee, and Oleg Sokolsky. Correct-by-construction implementation of runtime monitors using stepwise refinement. In Xinyu Feng, Markus Müller-Olm, and Zijiang Yang, editors, Proceedings of the 4th International Symposium Dependable Software Engineering: Theories, Tools, and Applications - SETTA '18, 31–49. Springer International Publishing, 2018. URL: https://pit-claudel.fr/clement/papers/monitors-SETTA18.pdf, doi:10.1007/978-3-319-99933-3_3.

[CoqProlog+Zimmermann+TTT2017]

Théo Zimmermann and Hugo Herbelin. Coq's Prolog and application to defining semi-automatic tactics. In Type Theory Based Tools. Paris, France, January 2017. URL: https://hal.archives-ouvertes.fr/hal-01671994.