Assignment 1
Submission Instructions This assignment is due at 5pm on the date indicated in the class calendar. The starter code for the assignment can be found here. In order to submit your code, pack yourBottomUpEnumerator.jl
, Question2.jl
,
question3a.js
, and question3b.txt
files into a tar file and submit through Canvas (we will create a Canvas assignment as the deadline approaches).
Collaboration policy
You are allowed to consult with at most one other student
while working on the pset, but every student is responsible for
writing their own code and submitting their own work. If you
do consult with another student, you should explicitly acknowledge
them in your submission (you can do that in a comment in BottomUpEnumerator.jl
).
Setup instructions
First, you should install julia using the instructions here. Then, install the packages you will need for this assignment by running the following commands in the julia REPL:Problem 1 (40 pts)
The goal of this assignment is for you to experiment with some of the bottom up inductive synthesis algorithm discussed in class. The code defining the language and interpreter is in thesrc
directory, under Combinata.jl
and Interpreter.jl
. You should
only modify the BottomUpEnumerator.jl
file; you will submit only this file for Question 1.
We have defined a simple language for the following expressions:
interpret()
that you can use to evaluate expressions in this language.
Make sure to use make_coord
, make_rect
, make_triangle
, and make_circle
rather than calling the constructors directly. These functions will check that the coordinates are valid and throw an error if they are not.
To see examples of how to render shapes for debugging, you can look at the Render.jl
file, specifically the
function render_examples
.
Problem 1a
As a starter task, implement the functiongrow
in BottomUpEnumerator.jl
. This function
takes a list of shapes and returns a list of shapes containing
- All the shapes in the input list
- And all the shapes that can be obtained by applying
SUnion
,SIntersection
, orSMirror
once to shapes in the input list
Problem 1b
Your goal is now to implement a new the rest of the bottom up enumeration algorithm. Specifically, implementelim_equivalents
, which takes a list of shapes and returns a list of shapes containing
all the shapes in the input list, but with all equivalent shapes removed. Two shapes are equivalent if they
have the same input/output behavior on the given points. The Dict
feature in Julia
will be useful for this task.
Then, implement synthesize
, which takes a list of input/output pairs and returns a
shape that matches the input/output pairs.
You can test your implementation by running the following command on your command line:
Many of these tests might take several seconds, but not much longer. The entire question-1b suite takes around 20-90 seconds on our laptops. If your tests are taking much longer than this (> 5 minutes), you should consider revising your implementation.
Hint for the test flipped_triangle
: make sure that the Context independent equivalence property holds for your implementation.
Problem 2 (30 pts)
Consider the following grammar for a problem:last2Digits(x + 3) ++ last3Digits(x + 537) ++ last4Digits(x + 82) ++ last5Digits(x + 87)
applied to 9928 should return 31465001010015 (31 ++ 465 ++ 0010 ++ 100015).
The question marks ?? indicate unknown constants.
You should assume the function operates only over the integers.
Even though the space of potential expression is very large, the search space has a lot of
structure. Our goal for this problem is to come up with a representation of the search
space and a search strategy that exploits this structure and allows us to efficiently solve for a program in
this space given a set of input/output pairs. In particular, the goal is to find a way
to factor the search so it is possible to solve for the unknowns without having
to search the exponentially large space.
Your goal is to implement a function with the interface below.
inputoutputs
is a list of input output pairs (each input/output pair
represented as a tuple of Int64, String). The function should return an AST for the synthesized
program using the same format as example_lkd
.
The key requirement is that your function
should execute in linear time with respect to the size of this input list.
Your code should be well commented to clearly explain why your algorithm is linear
time and complete (guaranteed to find a solution if one exists).
Hints:
- If the last 3 digits of (x + y) are z, then the last 3 digits of (z - x) are the last 3 digits of y.
- For a given program, the locations at which the output breaks into terms are fixed.
- A solution that has to iterate through the input list a large number of times is still linear time as long as the number of passes is bounded by a constant.
Problem 3 (30 pts)
For this problem, you will be working with an implementation of a top-down stochastic synthesis algorithm, and will be applying it to solve symbolic regression problems. Start by installing Node.js and npm using the instructions here. Then, from the question3/js-synth
directory, install the dependencies with:
To make sure everything is set up properly, ensure that the following command runs successfully without crashing. It could take up to a minute or so to run, and it's okay if some of the outputs say "INCORRECT".
Problem 3a
Define a grammar for a symbolic regression language in thelang
variable of question3/question3a.js
.
Refer to the annotated example grammar in question3/js-synth/languages/simplmaplang.js
to understand the format.
Your language should have:
- Arithmetic operations
- Trigonometric functions
- Exponentiation
- If-then-else conditional operator (with a boolean condition supporting less-than comparisons)
- Integer constants 0 through 5
- Input variable
x
(this will be handled for you by the library)
"float->float"
, and are listed in problemDefinitions
at the top of question3/gen_data.js
.
Note: Primitive function implementations should not throw errors, however NaN values are fine and the loss function will assign them the highest possible loss.
You can try out your implementation by running (from question3/
)
You can then launch a local server to view the results by running (from question3/
)
Finally, for a slower more thorough evaluation, you can benchmark your implementation by running
Expected performance: Stochastic synthesis algorithms will vary between runs, so we don't expect every run to find an exact solution. Additionally, synthesis can get stuck in local optima that prevent finding the exact solution. We therefore don't expect your implementation to solve all of the problems exactly, nor to do so every time. It should be possible to solve many of the easier problems exactly fairly often, but it is okay to solve the later problems much less frequently - and having an average loss that is below 0.1 for each problem is reasonable.
For reference, a benchmarking run of our implementation (a fairly general grammar, not particularly specialized to the problems) gave the following results:Problem 3b
Language models can often struggle with the kind of reasoning required for symbolic regression from input-output examples. Try prompting a language model of your choice (e.g. ChatGPT) to see if it can solve some of these problems - pick 3 problems of varying difficulty to try.
Write your own prompt, and be sure to provide the grammar along with (some or all of) the input-output examples for the problem (you can find these in problems.json
).
question3b.txt
.