# Assignment 3

Submission Instructions This assignment is due at 5pm on the date indicated in the class calendar. In order to submit your code, pack your q1a.sq, q1b.sq, q2.py files into a tar file and submit through canvas (same as last time).

Collaboration: You are allowed to collaborate with one other person for this assignment, but each of you is responsible for submitting your own solution. If you do collaborate with someone else, make sure to name your collaborator in your submission.

## Problem 1 (55 pts)

Your goal is to use Synquid to synthesize some non-trivial functions. We suggest that you first study the tutorial examples in the interactive Synquid demo.

Your goal for this exercise is to develop a function that can serialize a binary search tree into a sorted list. Below are the definitions of the two main datastructures. The code block also includes some useful measures that will be helpful for the transformation. data BST a where Empty :: BST a Node :: x: a -> l: BST {a | _v < x} -> r: BST {a | x < _v} -> BST a data SortedList a where Nil :: SortedList a Cons:: x: a -> xs:SortedList { a | _v > x } -> SortedList a termination measure len :: SortedList a -> {Int | _v >= 0} where Nil -> 0 Cons x xs -> 1 + len xs termination measure height :: BST a -> {Int | _v >= 0} where Empty -> 0 Node x l r -> 1 + if (height l) < (height r) then (height r) else (height l) measure keys :: BST a -> Set a where Empty -> [] Node x l r -> keys l + keys r + [x] measure vals :: SortedList a -> Set a where Nil -> [] Cons x xs -> vals xs + [x] Note! The error messages you get from Synquid are not always very helpful. For example, the below error implies that there is some type mismatch in your code, but really indicates that the synthesizer couldn't find a solution Cannot match shape 'SortedList' with shape 'BST' when checking Nil :: BST {SortedList A2|True && _v < (Nil)} in \l . \b . match Node Nil Nil ?? with

### Part a)

Your goal for this part of the assignment is to synthesize a function transform: BST a -> SortedList a  that takes in a BST defined as above, and outputs a SortedList containing the exact same values. You need to set up the synthesis problem so that Synquid synthesizes the transform function with the desired property.

For this part, you are allowed to use a component merge that takes as input two sorted lists and produces a merged sorted list that contains all the elements in the two input lists. You do not have to synthesize merge, you are only providing its interface.

For example, you could provide the interface to sqrt without synthesizing it as such. Note the lack of a definition. sqrt:: x: Int -> {Int | _v * _v <= x && x < (_v + 1) * (_v + 1)}

### Part b)

For this part of the assignment, your goal is to synthesize a version of transform that does not use the merge function from earlier. Instead, you will have to synthesize a helper function helper:: BST a -> SortedList a -> SortedList a. The high-level intuition is that the helper function takes as the second input a partially constructed list SortedList a and produces its output by appending all the elements in the BST to the partially constructed list. Note that this imposes some strong constraints on the types of helper. For example, since all the elements in the BST will be appended to the SortedList, it is important that all the elements in the SortedList be less than all the elements in the BST.

#### Hints:

Part b) is significantly harder than part a), but here are a few hints that will help you solve this.
• You will need to synthesize one function in addition to helper.
• Expressing the property that all the elements in the temporary list passed to helper are smaller than all the elements in the BST is tricky. In order to do this, you will have to define a measure that you can use to express the property that the elements in the SortedList are all bounded by the BST.
• The syntax of measures is a bit quirky. For example, note from the examples that the syntax for a measure implicitly assumes that it is defined over an unnamed recursive datatype. measure vals :: SortedList a -> Set a where Nil -> [] Cons x xs -> vals xs + [x] It is possible to define measures with more than one parameter, but the last parameter must be the recursive datatype over which the measure is defined and be unnamed, for example: measure something :: x:a -> SortedList a -> Set a where Cases for the SortedList that may use parameter x.
• Reasoning about these measures is tricky, and the solver will need a bit of help. For example, one thing that the type checker is going to have a hard time figuring out on its own is that all the elements in the left subtree are bounded by all the elements on the right subtree. This can be done using the measure mentioned in the previous hint, but that's going to require some slight changes to the data declaration for the BST.
• Another fact that the type checker will not be able to discover on its own is that if all elements in a BST are greater than x, then x is a lower bound for the BST. In general, if there is a fact about a value that I don't think the type checker will discover on its own, I can create a function that enforces that fact. For example, if at some point I want the solver to use the fact that a > b and b > c implies a > c, I can write a function lemma :: a : Int -> b: {Int | a > _v} -> c: {Int | b > _v} -> {Int | _v == a && _v > c } The synthesizer can easily come up with an implementation for this function, but more importantly, by calling (lemma a b c), the synthesizer knows it can get a value equal to a, with the property that it is greater than both b and c without having to reason about why that is true (that reasoning happened when synthesizing a body of lemma). For simple arithmetic reasoning like this, it is never really necessary to add a lemma like this, but it will be necessary for the more complex property mentioned earlier.

Keep in mind that while the output is just one of the inputs, and the purpose is just to introduce a dependent type, the actual output value does matter as it restricts the contexts where the lemmma can be used. For easiest effect it's probably best to just duplicate the lemma for each argument that you're proving a property about. lemma_a :: a : Int -> b: {Int | a > _v} -> c: {Int | b > _v} -> {Int | _v == a && _v > c } lemma_c :: a : Int -> b: {Int | a > _v} -> c: {Int | b > _v} -> {Int | _v == c && a > _v }

• Once you have all the elements in place, it takes about a minute to synthesize everything on the web interface.

You should submit your answers to this problem in two separate files named q1a.sq and q1b.sq.

## Problem 2 (45 pts)

For this problem, you will be using the LLama model to do some simple LLM-based program synthesis. The skeleton can be found here. As a first step, you should install the dependencies. pip3 install -r requirements.txt Then, go into q2.py and set the port to the value posted on piazza. Finally, add the following line next to the line setting the port openai.api_key = "EMPTY" For this assignment, you should submit q2.py. The model used in this experiment is the relatively small codellama/CodeLlama-7b-Instruct-hf model. It is an open source model that is trained specifically to work with the generation of code.

### Part a)

After installing the requirements you can explore the dataset and run queries on the model by running the following command python3 -i q2.py then run >>> datum(0) to see the first example in the dataset. It should look something like this { 'text': 'Write a function to find the longest chain which can be formed from the given set of pairs.', 'example_test' : 'assert max_chain_length([Pair(5, 24), Pair(15, 25),Pair(27, 40), Pair(50, 60)], 4) == 3', 'code': 'class Pair(object): \r\n\tdef __init__(self, a, b): \r\n\t\tself.a = a \r\n\t\tself.b = b \r\ndef max_chain_length(arr, n): \r\n\tmax = 0\r\n\tmcl = [1 for i in range(n)] \r\n\tfor i in range(1, n): \r\n\t\tfor j in range(0, i): \r\n\t\t\tif (arr[i].a > arr[j].b and\r\n\t\t\t\tmcl[i] < mcl[j] + 1): \r\n\t\t\t\tmcl[i]=mcl[j] + 1\r\n\tfor i in range(n): \r\n\t\tif (max < mcl[i]): \r\n\t\t\tmax=mcl[i] \r\n\treturn max', 'test_list' : ['assert max_chain_length([Pair(1, 2), Pair(3, 4),Pair(5, 6), Pair(7, 8)], 4)==4', 'assert max_chain_length([Pair(19, 10), Pair(11, 12),Pair(13, 14), Pair(15, 16), Pair(31, 54)], 5) == 5' ]} Where text is the question, example_test is an example of the function's desired behavior, code is a reference solution, and test_list are used to test the function. Note that only the text and example_test should be used to synthesize the function.

Additionally, run >>> get_completions("def is_prime(n):", num_completions=3, max_tokens=100, temperature=0.8) to query the model. Play around with the model and dataset to get a feel for how it works. Then implement the parts of q2.py that are marked with "Question 2a" in the comments. Here you are implementing the zero-shot prompting strategy, as well as a quick evaluation of the model. You can test your implementation by running python3 q2.py a. You should get an accuracy of about 40%. Note that we run the model multiple times but just getting *a* correct answer is enough to count as correct. This is because we have no way of knowing which of the multiple runs is the correct one.

### Part b)

Implement the one-shot prompting strategy in q2.py. You can test your implementation by running python3 q2.py b. You should get a similar accuracy to before; with this prompt the one-shot strategy doesn't help much (even seems to hurt a bit).

### Part c)

Implement a syntactic constraint, in check_syntactic. You can test your implementation by running python3 q2.py c You'll notice your accuracy didn't improve much. This is because the model is already pretty good at generating syntactically correct code.

### Part d)

Implement a semantic constraint, in check_single_test. You can test your implementation by running python3 q2.py d You should see a significant improvement in accuracy, to about 60%. This is because even a single test case can filter out a lot of the incorrect completions. As a fun (optional) exercise, try to see if you can improve further. One strategy that has been proposed is prompting the model separately to generate additional test cases, and then using those to filter the completions.

### Part e)

One pitfall of LLMs for code is that unlike techniques described in the class, they aren't robust to renamings or other small changes. For this part, see if you can come up with two prompts that are semantically equivalent but use different wording, such that the model gets high accuracy on one but low accuracy on the other. (For this example we will use the part (a) algorithm, with no semantic filtering). Try to get the two to have a gap of at least 30%. You can test your implementation by running python3 q2.py e.

Hint: feel free to be mean and use terms in non-standard ways. Just make sure a reasonable person who closely read the question would agree that the two prompts are asking for the same thing.