Assignment 3
Submission Instructions This assignment is due at 5pm on the date indicated in the class calendar. In order to submit your code, pack yourq1a.sq, q1b.sq, q2.py
files into a tar file
and submit through canvas (same as last time).
Collaboration: You are allowed to collaborate with one other person for this assignment, but each of you is responsible for submitting your own solution. If you do collaborate with someone else, make sure to name your collaborator in your submission.
If you have any questions, please ask!
Problem 1 (55 pts)
Your goal is to use Synquid to synthesize some non-trivial functions. We suggest that you first study the tutorial examples in the interactive Synquid demo. Your goal for this exercise is to develop a function that can serialize a binary search tree into a sorted list. Below are the definitions of the two main datastructures. The code block also includes some useful measures that will be helpful for the transformation.Part a)
Your goal for this part of the assignment is to synthesize a functiontransform: BST a -> SortedList a
that
takes in a BST
defined as above, and outputs
a SortedList
containing the exact same values.
You need to set up the synthesis problem so that
Synquid synthesizes the transform function with the desired property.
For this part, you are allowed to use a component
merge
that takes as input two sorted lists and produces
a merged sorted list that contains all the elements in the two input lists.
You do not have to synthesize merge, you are only providing its interface.
For example, you could provide the interface to sqrt without synthesizing it as such.
Note the lack of a definition.
Part b)
For this part of the assignment, your goal is to synthesize a version oftransform
that does not use
the merge
function from earlier.
Instead, you will have to synthesize a helper function
helper:: BST a -> SortedList a -> SortedList a
.
The high-level intuition is that the helper function takes as
the second input a partially constructed list SortedList a
and produces its output by appending all the elements in the BST
to the partially constructed list. Note that this imposes
some strong constraints on the types of helper
.
For example, since all the elements in the BST will be appended
to the SortedList, it is important that all the elements in the
SortedList be less than all the elements in the BST.
Hints:
Part b) is significantly harder than part a), but here are a few hints that will help you solve this.- You will need to synthesize one function in addition to helper.
- Expressing the property that all the elements in the temporary list passed to helper are smaller than all the elements in the BST is tricky. In order to do this, you will have to define a measure that you can use to express the property that the elements in the SortedList are all bounded by the BST.
-
The syntax of measures is a bit quirky. For example, note from the examples that the
syntax for a measure implicitly assumes that it is defined over an unnamed recursive datatype.
measure vals :: SortedList a -> Set a where Nil -> [] Cons x xs -> vals xs + [x] It is possible to define measures with more than one parameter, but the last parameter must be the recursive datatype over which the measure is defined and be unnamed, for example:measure something :: x:a -> SortedList a -> Set a where Cases for the SortedList that may use parameter x. -
Reasoning about these measures is tricky, and the solver will need a bit of help.
For example, one thing that the type checker is going to have a
hard time figuring out on its own is that all the elements in the
left subtree are bounded by all the elements on the right subtree.
This can be done using the measure mentioned in the previous hint, but that's going
to require some slight changes to the
data
declaration for the BST. -
Another fact that the type checker will not be able to discover on its own is that
if all elements in a BST are greater than x, then x is a lower bound for the BST.
In general, if there is a fact about a value that I don't think the type checker
will discover on its own, I can create a function that enforces that fact.
For example, if at some point I want the solver to use the fact that
a > b and b > c implies a > c
, I can write a functionlemma :: a : Int -> b: {Int | a > _v} -> c: {Int | b > _v} -> {Int | _v == a && _v > c }
The synthesizer can easily come up with an implementation for this function, but more importantly, by calling(lemma a b c)
, the synthesizer knows it can get a value equal toa
, with the property that it is greater than bothb
andc
without having to reason about why that is true (that reasoning happened when synthesizing a body oflemma
). For simple arithmetic reasoning like this, it is never really necessary to add a lemma like this, but it will be necessary for the more complex property mentioned earlier.Keep in mind that while the output is just one of the inputs, and the purpose is just to introduce a dependent type, the actual output value does matter as it restricts the contexts where the lemmma can be used. For easiest effect it's probably best to just duplicate the lemma for each argument that you're proving a property about.
lemma_a :: a : Int -> b: {Int | a > _v} -> c: {Int | b > _v} -> {Int | _v == a && _v > c } lemma_c :: a : Int -> b: {Int | a > _v} -> c: {Int | b > _v} -> {Int | _v == c && a > _v } - Once you have all the elements in place, it takes about a minute to synthesize everything on the web interface.
q1a.sq
and q1b.sq
.
Problem 2 (45 pts)
For this problem, you will be using the LLama model to do some simple LLM-based program synthesis.
The skeleton can be found here. As a first step, you should install the dependencies.
q2.py
and set the port to the value posted on piazza. Finally, add the following line
next to the line setting the port
q2.py
.
The model used in this experiment is the relatively small codellama/CodeLlama-7b-Instruct-hf model. It is an open
source model that is trained specifically to work with the generation of code.
Part a)
After installing the requirements you can explore the dataset and run queries on the model by running the following
command
text
is the question, example_test
is an example of the function's desired behavior,
code
is a reference solution, and test_list
are used to
test the function. Note that only the text
and example_test
should be
used to synthesize the function.
Additionally, run
python3 q2.py a
.
You should get an accuracy of about 40%. Note that we run the model multiple times but just
getting *a* correct answer is enough to count as correct. This is because we have no
way of knowing which of the multiple runs is the correct one.
Part b)
Implement the one-shot prompting strategy in q2.py. You can test your implementation by running
python3 q2.py b
.
You should get a similar accuracy to before; with this prompt the one-shot strategy doesn't help much (even seems to hurt a bit).
Part c)
Implement a syntactic constraint, in check_syntactic
. You can test your implementation by running
python3 q2.py c
You'll notice your accuracy didn't improve much. This is because the model is already pretty good at
generating syntactically correct code.
Part d)
Implement a semantic constraint, in check_single_test
. You can test your implementation by running
python3 q2.py d
You should see a significant improvement in accuracy, to about 60%. This is because even a single test case
can filter out a lot of the incorrect completions.
As a fun (optional) exercise, try to see if you can improve further. One strategy that has been proposed is
prompting the model separately to generate additional test cases, and then using those to filter the completions.
Part e)
One pitfall of LLMs for code is that unlike techniques described in the class, they aren't robust to renamings
or other small changes. For this part, see if you can come up with two prompts that are semantically equivalent
but use different wording, such that the model gets high accuracy on one but low accuracy on the other. (For
this example we will use the part (a) algorithm, with no semantic filtering).
Try to get the two to have a gap of at least 30%. You can test your implementation by running
python3 q2.py e
.
Hint: feel free to be mean and use terms in non-standard ways. Just make sure a reasonable person who closely read the question would agree that the two prompts are asking for the same thing.