# Assignment 3: Generating Graphics Programs with (Vision-)Language Models

Welcome to Assignment 3! In this assignment, you will build an end-to-end pipeline which uses a vision-language model (VLM) to generate programs in a simple graphics domain, where the specification is given as a desired output image. If the word vision-language modeling seems scary, don't worry.
We will be working with the model at a high level of abstraction, at which the nuances between VLMs and regular LLMs fade away.

**Implementation Note**: This notebook comes with pre-defined function skeletons in each answer cell. You may modify these as you see fit, but please do stick to implementing things in their right place (that is, in the answer cell for the task that requires this logic) - this will make your TA's life significantly easier and lessens the risk of misunderstandings during the grading process!

## Assignment Structure

This assignment consists of **4 main parts**:
- **Part 1: Zero-Shot Prompting** - Designing effective prompts for this task
- **Part 2: Data Generation & QLoRA Finetuning** - Create synthetic training data and finetune the model to improve performance
- **Part 3: Data Quality & Model Enhancement** - Identify and address limitations from Part 2 with a view towards the training data
- **Part 4: Extensions (Optional)** - Open-ended exploration for bonus credit

Since we're a bit low on compute, you will use `unsloth` with `4bit` quantization for all model operations. For detailed guidance on model loading, inference, and QLoRA training with Gemma-3 Vision, refer to [this notebook by Unsloth](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb).

**Notes on Evaluation**: We will provide you with a base `Metrics` class that defines the metrics you will have to use in your experiments. We also provide the test dataset you will use for quantitative evaluation, as well as a single challenging test instance that you can use for some rough qualitative analysis of the model's performance. You must use these metrics and data to evaluate your models.

## Deliverables & Grading

Each part will involve the following deliverables:
- **Code**: Working implementations demonstrating the concepts
- **Experiments and analysis**: Quantitative evaluation on test set plus qualitative analysis on the challenging test input, along with your analysis of the results - What is the performance? How does it compare to other approaches or baselines defined earlier? Did it match your expectations and, if not, why might that be?
- **Reflections**: A few more questions prodding you to explain your implementation and design decisions. The idea is that even if your solution itself is incorrect or suboptimal, we will be able to gauge your understanding from the reflections, thus potentially earning you higher marks.

At the end, you will submit this notebook with all your code executed (with the output visible, so we don't need to rerun it) and the reflection questions answered.

Let's get started!

## Setup

Before we begin coding, there are some things we need to set up to make sure you can access the models. Follow the instructions below to the letter, and report on Canvas Discussions if you run into any issues.

### Runtime
We expect most students will use Google Colab for this assignment (although if you have a GPU with sufficent amounts of VRAM available elsewhere, feel free to use that – we expect needing around 12-16GB of VRAM).
If you are using Colab, we recommend you connect to the `Python 3 - L4 GPU` runtime type. **Set the runtime in** `Runtime > Change runtime type`.

Note that the L4 GPU will consume about 2 compute credits per hour (the rate is variable, so your mileage may vary). This means that you may only have 50 or so compute hours available with a Google Colab Pro membership. This should be sufficent for the purposes of completing this assignment, but to err on the side of caution **we recommend always making a first pass at the code locally**, either in a local Jupyter instance or by using [Jupytext](https://jupytext.readthedocs.io/en/latest/) to convert to/from `.py` scripts (for those of you whose editors work better with raw Pyhon files). We also advise you to be mindful of the need to manually shut down idling resources whenever possible, to preserve your compute credits.

### Obtaining access to `google/gemma-3-4B-it`
For this assignment we will be using Google's Gemma-3-4B-it (instruction tuned) model. This is a rather small VLM designed for on-device usage, which suits us well since we do not have access to much compute.

To start off with, navigate to https://huggingface.co/google/gemma-3-4b-it and obtain access to the model. (Note that you need to make a huggingface account first, if you do not already have one.)

**Warning: sometimes getting access to models can take a while, so make sure you submit your request as soon as possible.**

### Setting up your huggingface token
Next, navigate to https://huggingface.co/settings/tokens and make a new token with `read` permission; the name does not matter. Copy this token, and (back in Colab) click on the "key" icon (should be called "Secrets", if you are using a screen reader) and add the token under the name `HF_TOKEN`. Enable the "notebook access" toggle.

### Optional: add model/dataset caching to your google drive
By default, nothing in Colab is persistent, which makes it a bit annoying if you have to redownload large models and datasets from the huggingface hub every time you restart Colab.

If you want to, you can try to enable caching in your Google Drive by setting `False` to `True` in the below code snippet. This will save you a lot of time spent waiting around, at the cost of using up some of the storage in your Drive.

In [None]:
# @title Set False to True below to attempt to mount Google Drive in Colab
# @markdown This will set the Hugging Face cache directory to a location in your Google Drive.
# @markdown If you are not using Google Colab, setting `HF_HOME` in your environment is sufficient.
import os

if False:
    try:
        from google.colab import drive

        drive.mount("/content/drive")

        # Set the HF_HOME environment variable to your desired cache location in Google Drive
        os.environ["HF_HOME"] = "/content/drive/MyDrive/Colab_HF_hub"

        if not os.path.exists(os.environ["HF_HOME"]):
            os.makedirs(os.environ["HF_HOME"])
    except ImportError:
        # if not in colab, assume HF_HOME is set in the environment
        pass

    print(f"Hugging Face cache directory set to: {os.environ['HF_HOME']}")

### 0.4 Imports
We are now almost ready to begin coding. The one thing left to do is to install the necessary packages from pip, and then import them into the notebook. Run the below cells. You can ignore the following error if it appears:
```
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
```

In [None]:
# @title Install required packages
!pip install lark-parser shapely
!pip install "unsloth[colab-new]"

In [None]:
# @title Base imports
# @markdown These imports likely cover everything you will need for the assignment, but feel free to change them if you want.
from unsloth import FastVisionModel, get_chat_template
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from dataclasses import dataclass
import random
import torch
from transformers import AutoProcessor
import numpy as np
from datasets import Dataset, load_dataset
from PIL import Image, ImageDraw
from IPython.display import display
from lark import Lark, Transformer
from shapely.geometry import Point, box, Polygon
from shapely.affinity import rotate

In [None]:
# @title Set seeds for reproducibility
SEED = 6930
random.seed(SEED)
np.random.seed(SEED)
torch.random.manual_seed(SEED)

torch._dynamo.config.cache_size_limit = 32

## The Domain-Specific Language (DSL)

Having finished the setting up, let us now introduce the DSL we will be working with. It defines 2D shape compositions on a 16x16 canvas (rendered at 224x224 resolution for nicer-looking results).

### DSL Syntax

- **Circles**: `(Circle radius x y)` - Creates a circle with given radius centered at (x, y)
- **Rectangles**: `(Quad x y width height angle)` - Creates a rectangle centered at (x, y) with given width, height, and rotation angle
- **Boolean Operations**:
  - `(+ shape1 shape2 ...)` - Union of two or more shapes (variadic, requires at least 2 arguments)
  - `(- shape1 shape2 ...)` - Difference of two or more shapes (left-associative, requires at least 2 arguments)

### Parameters
- `radius`, `x`, `y`, `width`, `height`: integers between 0 and 15 (inclusive)
- `angle`: integer multiple of 45 degrees, from 0 to 315 (inclusive)
- Coordinate system: (0,0) is bottom-left, (15,15) is top-right

### Examples
- `(Circle 3 8 8)` - A circle with radius 3 centered at (8, 8)
- `(Quad 5 5 4 6 45)` - A 4x6 rectangle centered at (5, 5) rotated 45 degrees
- `(+ (Circle 2 4 4) (Circle 2 12 12) (Circle 3 8 8))` - Union of three circles
- `(- (Quad 8 8 8 8 0) (Circle 3 8 8) (Circle 1 6 6))` - A square with two circular holes

In [None]:
# @title Defining the grammar, parser, and interpreter
# @markdown Since the focus of this assignment is the use of the LLM and not the DSL itself,
# @markdown these functions have been implemented for you.
# @markdown Understanding how they work might still be useful later on.
grammar = r"""
    start: s?

    ?s: binop
      | circle
      | quad

    binop: "(" op s s+ ")"
    op: add | sub
    add: "+"
    sub: "-"

    circle: "(" "Circle" number number number ")"
    quad: "(" "Quad" number number number number angle ")"

    number: INT     // Will validate range in Transformer
    angle: INT      // Will also validate this

    %import common.INT
    %import common.WS
    %ignore WS
"""

parser = Lark(grammar, start="start")


class ASTBuilder(Transformer):
    def start(self, args):
        # start: s? - can be empty
        return args[0] if args else None

    def s(self, args):
        # s: binop | circle | quad
        return args[0]

    def binop(self, args):
        # binop: (op s s+) - now handles two or more arguments
        op = args[0]
        operands = args[1:]
        return (op, operands)

    def op(self, args):
        # op: add | sub
        return args[0]

    def add(self, _):
        # add: "+"
        return "+"

    def sub(self, _):
        # sub: "-"
        return "-"

    def circle(self, args):
        # circle: (Circle number number number)
        r, x, y = args
        return ("circle", r, x, y)

    def quad(self, args):
        # quad: (Quad number number number number angle)
        x, y, w, h, theta = args
        return ("quad", x, y, w, h, theta)

    def number(self, n):
        val = int(n[0])
        if not (0 <= val <= 15):
            raise ValueError(f"Number out of range: {val}")
        return val

    def angle(self, a):
        angle = int(a[0])
        if not (0 <= angle < 360):
            raise ValueError(f"Angle out of range: {angle}")
        return int(a[0])


def shape_from_ast(ast):
    if ast is None:
        # Empty program produces empty shape (white canvas)
        return Polygon()
    if ast[0] == "circle":
        _, r, x, y = ast
        return Point(x, y).buffer(r)  # circle
    elif ast[0] == "quad":
        _, x, y, w, h, theta = ast
        rect = box(x - w / 2, y - h / 2, x + w / 2, y + h / 2)
        return rotate(rect, theta, origin=(x, y))
    elif ast[0] == "+":
        # Handle variadic union (requires at least 2 operands)
        _, operands = ast
        if len(operands) < 2:
            raise ValueError("Union operator requires at least 2 operands")
        result = shape_from_ast(operands[0])
        for operand in operands[1:]:
            result = result.union(shape_from_ast(operand))
        return result
    elif ast[0] == "-":
        # Handle variadic difference (left-associative: a - b - c = (a - b) - c, requires at least 2 operands)
        _, operands = ast
        if len(operands) < 2:
            raise ValueError("Difference operator requires at least 2 operands")
        result = shape_from_ast(operands[0])
        for operand in operands[1:]:
            result = result.difference(shape_from_ast(operand))
        return result
    else:
        raise ValueError(f"Unknown AST node: {ast}")


def rasterize_shape(final_shape, size=256, debug=False) -> Image.Image:
    """
    Rasterize the Shapely geometry into a size x size black/white image.
    Black = filled (1), White = background (0).
    """
    # Create white background
    img = Image.new("L", (size, size), 255)  # "L" = grayscale, 255=white
    draw = ImageDraw.Draw(img)

    scale = size / 16.0  # 16x16 grid → upscaled

    if debug:
        print(f"Shape type: {final_shape.geom_type}")
        print(f"Shape bounds: {final_shape.bounds}")
        print(f"Shape is empty: {final_shape.is_empty}")
        print(f"Shape area: {final_shape.area}")

    def draw_polygon(geom, fill=0):
        if geom.is_empty:
            if debug:
                print("Geometry is empty, skipping")
            return
        if geom.geom_type == "Polygon":
            pts = [(x * scale, size - y * scale) for x, y in geom.exterior.coords]
            if debug:
                print(
                    f"Drawing polygon with {len(pts)} points: {pts[:5]}..."
                )  # Show first 5 points
            draw.polygon(pts, fill=fill)
            for interior in geom.interiors:
                pts = [(x * scale, size - y * scale) for x, y in interior.coords]
                draw.polygon(pts, fill=255)  # White holes
        elif geom.geom_type == "MultiPolygon":
            if debug:
                print(f"Drawing MultiPolygon with {len(geom.geoms)} parts")
            for g in geom.geoms:
                draw_polygon(g, fill=fill)
        elif geom.geom_type == "GeometryCollection":
            if debug:
                print(f"Drawing GeometryCollection with {len(geom.geoms)} parts")
            for g in geom.geoms:
                draw_polygon(g, fill=fill)

    draw_polygon(final_shape, fill=0)  # Black = 0
    # make sure image is RGB
    if img.mode != "RGB":
        img = img.convert("RGB")
    return img


class RenderingException(Exception):
    pass


def render_dsl(program: str, size=256, show=False, debug=False) -> Image.Image:
    """
    Render a DSL program into an image.
    :param program: DSL program as a string.
    :param size: Size of the output image (default 256x256).
    :param show: If True, display the image inline.
    :param debug: If True, print debugging information.
    :return: PIL Image object.
    """
    if debug:
        print(f"Rendering program: '{program}'")

    # Handle empty programs
    if not program.strip():
        # Empty program produces white canvas
        final_shape = Polygon()
    else:
        try:
            ast = parser.parse(program)
            if debug:
                print(f"Parsed AST: {ast}")
            ast = ASTBuilder().transform(ast)
            if debug:
                print(f"Transformed AST: {ast}")
            final_shape = shape_from_ast(ast)
        except Exception:
            raise RenderingException(f"Could not render program {program}")

    img = rasterize_shape(final_shape, size=size, debug=debug)

    if show:
        display(img)

    return img

In [None]:
# @title Load the test dataset and define the qualitative test case
# @markdown You must use these data for evaluation.

qualitative_test_programs = [
    # An exclamation mark - easy-ish
    "(+ (Quad 8 10 2 8 0) (Quad 8 3 2 2 0))",
    # A grid of rotated quads - medium-ish
    "(+ (Quad 4 4 2 4 0) (Quad 12 4 2 4 45) (Quad 4 12 2 4 90) (Quad 12 12 2 4 135))",
    # A smiley face - quite challening
    "(+ (- (Circle 5 7 5) (Circle 5 7 6)) (Quad 7 6 2 2 45) (Circle 2 4 10) (Circle 2 10 10))",
    # Biological hazard sign-esque - very challenging
    "(- (+ (- (Circle 5 8 8) (Circle 4 8 8)) (- (Circle 4 8 12) (Circle 3 8 13)) (- (Circle 4 5 6) (Circle 3 4 5)) (- (Circle 4 11 6) (Circle 3 12 5))) (Circle 2 8 8))",
]

print("Test programs for qualitative analysis:")
for test_program in qualitative_test_programs:
    print(test_program)
    test_image = render_dsl(test_program, show=True)
    print()

test_dataset = load_dataset("theoxo/6.5130-2025-test-set")["test"]
print(f"\nLoaded {len(test_dataset)} test samples")
print("\nSample test programs:")
for i in (0, 20, 47):
    print(f"\nTest input {i + 1}:")
    display(test_dataset[i]["image"])

In [None]:
# @title `Metrics` infrastructure for evaluation
# @markdown The final piece of infrastructure code we will provide you with is this Metrics class.
# @markdown Given a list of generated programs as well as targets (programs and images),
# @markdown it will compute an aggregated set of metrics for you use during your analysis.
# @markdown You must use these metrics for your experiments, so it is absolutely essential that you understand their significance.
@dataclass
class Metrics:
    pixel_accuracy: float
    parse_rate: float
    precision: float
    recall: float
    f1_score: float

    def __init__(
        self, generated_programs: list[str], targets: list[dict[str, str | Image.Image]]
    ):
        """
        Calculate metrics from generated programs and target images.
        Properly aggregates counts before calculating precision, recall, and F1.

        Args:
            generated_programs: List of generated DSL programs
            targets: List of target dictionaries containing 'program' and 'image'

        Returns:
            Metrics object with calculated values
        """
        # Collect metrics per sample for pixel accuracy, and parse rate
        pixel_accuracies = []
        parse_rates = []

        # Aggregate counts for precision, recall, and F1
        total_true_positives = 0
        total_false_positives = 0
        total_false_negatives = 0

        for program, target in zip(generated_programs, targets):
            target_image = target["image"]
            target_array = np.array(target_image)
            target_black = target_array == 0
            try:
                generated_image = render_dsl(program, show=False)
                # Convert to numpy arrays for comparison
                gen_array = np.array(generated_image)

                # Pixel accuracy (exact match)
                pixel_accuracy = float(np.mean(gen_array == target_array))
                pixel_accuracies.append(pixel_accuracy)

                # Parse rate (1.0 if program parsed successfully, 0.0 otherwise)
                parse_rates.append(1.0)

                # Black pixel metrics (black pixels have value 0, white pixels have value 255)
                generated_black = gen_array == 0
                true_positives = np.sum(generated_black & target_black)
                false_positives = np.sum(generated_black & ~target_black)
                false_negatives = np.sum(~generated_black & target_black)
            except RenderingException:
                # If rendering fails, we consider the pixel accuracy as 0 and parse rate as 0
                pixel_accuracies.append(0.0)
                parse_rates.append(0.0)
                # We also consider the output as 100% wrong,
                # ie that it predicts black pixels where it should not
                # and white pixels where it should not
                true_positives = 0
                false_positives = np.sum(~target_black)
                false_negatives = np.sum(target_black)
            finally:
                # Aggregate counts across all samples
                total_true_positives += true_positives
                total_false_positives += false_positives
                total_false_negatives += false_negatives

        # Calculate aggregated precision, recall, and F1
        if total_true_positives + total_false_positives > 0:
            precision = total_true_positives / (
                total_true_positives + total_false_positives
            )
            recall = total_true_positives / (
                total_true_positives + total_false_negatives
            )
        else:
            precision = float(total_true_positives == 0 and total_false_positives == 0)
            recall = 0.0

        if precision + recall > 0:
            f1_score = 2 * (precision * recall) / (precision + recall)
        else:
            f1_score = 0.0

        self.pixel_accuracy = (
            sum(pixel_accuracies) / len(pixel_accuracies) if pixel_accuracies else 0.0
        )
        self.parse_rate = sum(parse_rates) / len(parse_rates) if parse_rates else 0.0
        self.precision = precision
        self.recall = recall
        self.f1_score = f1_score

    def __str__(self) -> str:
        return (
            f"Metrics(pixel_accuracy={self.pixel_accuracy:.4f}, "
            f"parse_rate={self.parse_rate:.4f}, "
            f"precision={self.precision:.4f}, "
            f"recall={self.recall:.4f}, "
            f"f1_score={self.f1_score:.4f})"
        )

## Part 1: Zero-Shot Prompting & Evaluation

We are now finally ready to start building our system!

### Objectives
In this first part, you will:
1. Establish baseline performance measurements
2. Design an effective zero-shot prompt for graphics program generation in this DSL
3. Evaluate your prompting strategy and compare its performance to the baseline

### Considerations

**Prompt Design**: Create a prompt structure that effectively instructs the VLM to generate DSL programs from images. Consider:
- How to describe the DSL syntax and semantics
- How to format and organize the contents of the prompt
- How to extract programs from the model's responses

**Technical Setup**:
- Use `unsloth` with `4bit` quantization for model loading. Refer to [this Unsloth notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb) for guidance, but *note that the model you should use is `google/gemma-3-4b-it`* (which is not quite the same model as they use).
- Test your approach on both the test dataset and the challenging test program defined above.

### Deliverables
1. **Code**: Model loading and inference, zero-shot prompt design, and evaluation functions
2. **Analysis**: Quantitative and qualitative evaluation
3. **Reflection**: Discussion of your prompt design choices and evaluation approach

In [None]:
# @title Part 1.1: Implement and evaluate a naive baseline
# TODO: Implement and evaluate a naive baseline which always predicts the empty program.
# The empty program corresponds to predicting all-white images, no matter the input.
# Having a naive baseline is always *the most important* step of experimental research,
# because without it you cannot know if your system is doing something sensible or not!
# We get the following results: `Baseline metrics: Metrics(pixel_accuracy=0.8214, parse_rate=1.0000, precision=1.0000, recall=0.0000, f1_score=0.0000)`
def evaluate_baseline(test_dataset: Dataset) -> Metrics:
    # Baseline always predicts empty program (all-white images)
    return Metrics([], [])  # FIXME


baseline_metrics = evaluate_baseline(test_dataset)
print(f"Baseline metrics: {baseline_metrics}")

In [None]:
# @title Part 1.2: Implement model loading and inference
# TODO: Load the model using unsloth with 4bit quantization
# TODO: Implement generation infrastructure
# Refer to: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb
def load_model_and_processor(
    model_name: str = "google/gemma-3-4b-it",
) -> tuple[FastVisionModel, AutoProcessor]:
    pass  # FIXME


def extract_program_from_text(text: str) -> str:
    """Extracts the DSL program from the generated text."""
    pass  # FIXME


# The chat-style prompt format used in the Unsloth notebook referenced previously
# implies this somewhat convoluted type for the prompts
type Message = dict[str, str | list[dict[str, str | Image.Image]]]
type Prompt = list[Message]


def generate(
    model: FastVisionModel,
    processor: AutoProcessor,
    prompts: list[Prompt],
    images: list[list[Image.Image]] | None,
    temperature: float = 1.0,
    top_p: float = 0.95,
    top_k: int = 64,
    do_sample: bool = True,
    num_return_sequences: int = 1,
    max_new_tokens: int = 256,
    **other_forward_kwargs,
) -> list[str]:
    # Note: your generation method should operate on batches of inputs, yielding a batch of outputs
    # - this will save you a lot of time down the line.

    # Feel free to make this assumption about the inputs; depending on your implementation
    # it might save you some trouble.
    assert len(prompts) == 1 or num_return_sequences == 1, (
        "Either batch size must be 1 or num_return_sequences must be 1"
    )

    pass  # FIXME

In [None]:
# @title Part 1.3: Design your zero-shot prompts
# TODO: Implement your zero-shot prompting strategy
# Your function should take an image and return a properly formatted prompt
# Consider: DSL description, examples, output formatting instructions
def zero_shot_prompt(image: Image.Image) -> Prompt:
    pass  # FIXME

In [None]:
# @title Part 1.4: Evaluate your zero-shot approach
# TODO: Test your prompt on the challenging test program
# TODO: Evaluate quantitatively on the full test dataset
# TODO: Compare against the baseline metrics
# TODO: Optionally, generate any and all figures/tables that you want to refer to in your discussion below.
# This will likely take a few minutes to run.
def evaluate_quantitatively_on_test_set(
    model: FastVisionModel,
    processor: AutoProcessor,
    test_data: Dataset = test_dataset,
    batch_size: int = 16,
) -> Metrics:  # For the full test set, just return the metrics (for qual. analysis)
    pass  # FIX ME


def evaluate_qualitatively_on_test_programs(
    model: FastVisionModel, processor: AutoProcessor
) -> tuple[
    Metrics, list[str], list[Image.Image | None]
]:  # Return metrics, generated program sand rendered output images (if successful)
    # Should evaluate on `qualitative_test_programs`
    pass  # FIX ME


model, processor = load_model_and_processor()
zero_shot_metrics = evaluate_quantitatively_on_test_set(model, processor)
print(f"Zero-shot metrics: {zero_shot_metrics}")

# Generate prediction for the qualitative test cases
qual_metrics, predicted_programs, predicted_images = (
    evaluate_qualitatively_on_test_programs(model, processor)
)
for gt_program, predicted_program, predicted_image in zip(
    qualitative_test_programs, predicted_programs, predicted_images
):
    print("-" * 50)
    print(f"Target program: {gt_program}")
    print("Target image:")
    render_dsl(gt_program, show=True)
    print(f"\nZero-shot predicted program: {predicted_program}")
    if predicted_image is not None:
        print("Zero-shot predicted image:")
        display(predicted_image)
    else:
        print("Failed to render predicted program.")
print(f"Metrics on qual. test instances: {qual_metrics}")

# TODO: Optionally, any tables/figures that you want to refer to in your discussion below can be generated here.

### Part 1.5: Analysis/Discussion of Results

(*TODO: Describe and discuss the results you observed in Part 1.4.*)

### Part 1.6: Reflections for Part 1

- Discuss your implementation for this part. Did you run into any challenges whilst implementing the code? Were there any parts that you were less or more confident about how to approach?

- What strategy did you settle on for extracting the program from the model's response? Did you try any variations of it, and if so what made you settle on this one?

- When constructing zero-shot prompts (ie, in the complete absensce of training data), there might still be some information about the domain or the task that we can give to the model. What, if any, such information did you include in your prompt? Why?

## Part 2: Evolutionary Data Generation & QLoRA Finetuning

We now have a sense of how well the model is able to perform this task out of the box. Chances are, depending on your chosen prompt, that the model's performance in Part 1 was not particularly impressive! In this part, we will try to improve upon it by leveraging (QLoRA) finetuning.
Since the DSL is completely novel, and we have an interpreter for it,
we can most effectively obtain training data by synthesizing it.
However, instead of manually writing that synthesizer, we will first experiment with getting the model itself to do the heavy lifting by leveraging an evolutionary-search-style data generation process.

The key features involved are thus:
1. **Population**: Maintain a collection of DSL programs; this can be initialized with a few "seed" programs
2. **Fitness Function**: In this case, evaluate how "informative" each program would be as training data
3. **Selection**: Choose high-scoring programs for mutation
4. **Mutation**: Use the VLM to generate variations of selected programs
5. **Evolution**: Iterate until you have a high-quality dataset

### Objectives
In this second part, you will:
1. Implement evolutionary-style synthetic data generation
2. Use the evolved dataset for QLoRA finetuning
3. Evaluate whether this lead to an improvement over the zero-shot performance in Part 1


### Considerations

**Fitness Function**: In practice the fitness function is an integral part of evoluationary search, since it is what guides the search procedure. However, in this part, we will not use a fitness function; instead, you should simply filter out candidate programs which are not syntactically valid (ie., which cannot be rendered). We will return to the topic of constructing a fitness function in the (optional) Part 4.

**QLoRA Finetuning**:
Once you have obtained the data, you will have to use QLoRA to fit the model to it. Thanks to the maturity of the ecosystem, this does not involve much work, essentially boiling down to:
- Configuring LoRA parameters for the vision-language model
- Processing your dataset into the format expected by the trainer
- Training on your evolved dataset using transformer/unsloth's high-level supervised fine-tuning training interface

Refer to the previously Unsloth Gemma-3 Vision notebook for implementation guidance. One important aspect of that notebook that you should not neglect to incorporate into your solution is that of `train_on_responses_only`; this parameter is actually pretty important!
It ensures that the model is trained only on the assistant's responses, not on the user prompts.
That is, it means that you are only training the model to generate the program based on the image, not to learn
the distribution of images!
In most applications, only learning the conditional distribution leads to better downstream performance.

**Runtimes**:
Our implementation of problem 2.1 takes a bit over half an hour to run, and fine-tuning in 2.2 can likewise take a considerable amount of time (scaling with the dataset size). Given this, you may want to test your generator by generating and fine tuning on a smaller number of programs, to ensure you catch any crashes and issues that might come up. Of course, your final version should generate the number of programs set in the skeleton code.

**Out of Memory errors**: If you encounter "out of memory" errors in this or any other part of the assignment, two common solutions are lowering the batch size and/or restarting the notebook kernel to clear any leaked memory. For example, rerunning cells more than once can sometimes leak memory until the kernel is restarted (or the memory is otherwise freed). Note however that restarting the kernel will require rerunning earlier cells, or caching/saving intermediate results – which is why we highly recommend saving the dataset after generating it in 2.1.

### Deliverables
1. **Code**: Evolutionary algorithm and QLoRA training code
2. **Analysis**: Comparison of model performance before/after evolution and training
3. **Reflection**: Discussion of evolutionary approach and fitness function design

In [None]:
# @title Part 2.1: Implement evolutionary data generation
# TODO: Implement the evolutionary algorithm:
# - Population management
# - VLM-based mutation with carefully crafted prompts
# - Filter out programs that cannot be successfully rendered (syntactic validity)
# - Evolution loop until you have sufficient valid programs
# *Important*:
# - Your method should generate exactly as many training samples as indicated by `training_set_size`.
# - It should return a `Dataset` with columns `program`, `image` (same as the test dataset).
# - Only include programs that are syntactically valid and can be rendered.
# - DO NOT include any programs that are present in the `barred_programs` set;
#   we will use this to ensure that you do not include any test programs in your training data.
# You might find it helpful to cache/save this dataset to your huggingface account,
# if you find that having to regenerate it every time you return to the pset is slowing you down.
def evolve(
    model, seed_programs: list[str], barred_programs: set[str], training_set_size: int
) -> Dataset:
    pass  # FIXME


# DO NOT CHANGE the below code - it runs the evolution process
# with a fixed set of parameters, so that we can fairly compare results.
seed_programs = [
    "(Circle 3 8 8)",
    "(Quad 5 5 4 6 45)",
    "(- (Circle 5 5 5) (Circle 2 8 8))",
    "(+ (Quad 3 3 2 8 0) (Quad 5 5 4 6 45))",
    "(+ (- (Circle 3 8 8) (Circle 2 8 8)) (Quad 5 5 4 6 45))",
]
training_ds = evolve(
    model,
    seed_programs,
    set(p["program"] for p in test_dataset) | set(qualitative_test_programs),
    500,
)
print(f"Evolved dataset (first 10 rows): {training_ds[:10]}")
# TODO: Optionally, add saving/loading the dataset remotely

In [None]:
# @title Part 2.2: QLoRA finetuning setup and training
# TODO: Configure QLoRA parameters for the vision-language model
# TODO: Prepare your evolved dataset for training (process it into a chat-style format)
# TODO: Set up the SFTTrainer and train your model
# *Important*:
# - You must train your adapter for exactly one training epoch.
# - There are lots of hyperparameters to set here, both for the QLoRA adapter and the SFTTrainer. You don't need to do a hyperparameter sweep;
#   pick values that are supported by papers and resources you find online and which seem sensible to you in this context.
# - On an L4 GPU, you should be able to fit a batch size of 16, but you can try to lower this if you run into out-of-memory issues.
def finetune(
    model: FastVisionModel,
    processor: AutoProcessor,
    dataset: Dataset,
    num_training_epochs: int = 1,
) -> FastVisionModel:
    pass  # FIXME


finetuned_model = finetune(model, processor, training_ds)
print("Finished finetuning model.")

In [None]:
# @title Part 2.3: Evaluate your finetuned model
# TODO: Evaluate quantitatively on the full test dataset and qualitatively on the challenging test programs, as in Part 1
# TODO: Compare performance: baseline vs zero-shot vs finetuned model
# You should be able to reuse the evaluation functions you implemented in Part 1 here.

# TODO: Additionally, any tables/figures that you want to refer to in your discussion below can be generated here as well.

### Part 2.4: Analysis/Discussion of Results

(*TODO: Describe and discuss the results you observed in Part 2.3.*)

### Part 2.5: Reflections for Part 2

- How did you design the evolutionary data generation process? What did you do to encourage the model to generate diverse variations of existing programs?

- Discuss the parameters you chose for the QLoRA adapter and the SFTTrainer. What made you settle on these in particular? (Note that we do not expect you to have carried out extensive hyperparameter search; rather, choosing values that appear attractive a priori or are supported by other resources is a good approach.)

- Do you think training for more epochs would improve your results? Why or why not?

- In the above, we finetuned the model to directly predict the program from the input image. Can you think of any reasons why this might not actually be desirable if we had been finetuning a recent state-of-the-art general-purpose model such as DeepSeek-R1?

## Part 3: Data Quality & Model Enhancement

Part 2 introduced you to evolutionary data generation and QLoRA finetuning.
Hopefully this made it apparent to you that in well-controlled environments such as generating programs in a DSL, obtaining a reasonable amount of training data is not particularly difficult.
However, you may still have found that your finetuned model showed less than ideal capacity, or even performed worse than the base model did!
One solution to that problem could be to simply generate and train on substantially more data,
but that would require more compute than we have access to here.

In this part we will instead take a data-quality-centric view to improving performance while keeping training cost low.
The key idea is to identify sources of low-quality training data and address them systematically.

To begin with, you will have to analyze your evolved dataset from Part 2 to identify potential quality issues, such as degenerancy and redundancy.
Then, you will be asked to *manually* implement a data generator that avoids these issues, which we will then put to the test.
Why manually, you ask? Wasn't the whole point doing evolutionary data generation in the first place to avoid having to do that?
The answer is that in the real world, when working with synthetic data, is is often paramount to combine multiple generation strategies.
As you may have found in Part 2, evolutionary data generation can be a powerful tool, but it may also produce low-quality data if the model's bias is not well-aligned with the task at hand.
On the other hand, purely handcrafted/heuristic data generators may not achieve good coverage of the data distribution.
To keep the length of this problem set manageable, we will only explore the two extremes of purely evolutionary and purely handcrafted data generation in the main parts of the problem set.
If you are interested in exploring more sophisticated hybrid approaches, the (optional) Part 4 is for you!


### Objectives
In this part, you will:
1. Analyze your evolved dataset to identify potential quality issues
2. Implement a handracted/heuristic data generator that avoids these issues
3. Evaluate whether the improved data leads to better downstream model performance

### Deliverables
1. **Code**: Dataset analysis code, data generator code
2. **Analysis**: Comparison between dataset quality with and without your improvements, as well as resulting downstream model performance
3. **Reflection**: Discussion of possible types/sources of low-quality data, how this factored into your data generator design, and of the resulting findings

In [None]:
# @title Part 3.1: Analyze your evolved dataset
# TODO: Implement dataset analysis functions to identify quality issues in your evolved dataset.
# The function should return a dictionary, where the keys indicate the metric (eg "meaningless_ops_frequency").
#
# You will have to come up with your own set of analyses, but below are some suggestions you can use.
# Note: Don't go too overboard! A good answer might only compute five or so metrics about the dataset.
# It is more important to show that you have thought critically about which metrics might be relevant
# to the task than that your analysis is super extensive.
#
# **Degenerate Program Detection:**
# - Count programs with identical overlapping shapes (e.g., "(+ (Circle 3 8 8) (Circle 3 8 8))")
# - Find meaningless differences (subtracting shapes that don't overlap)
# - An effective method for identifying these instances is to perform "AST surgery", removing
#   operands from the operation and seeing if the rendered image changes - if it does not,
#   the operation is likely meaningless.
#
# **DSL Construct Usage:**
# - Frequency distribution: circles vs. quads vs. union vs. difference operations
# - Program complexity: distribution of operation nesting depths (0, 1, 2+ levels)
# - Parameter distributions: are radius/coordinate/angle values diverse or clustered?
#
# **Visual Quality Metrics:**
# - Canvas coverage: what % of programs use less than 5% or more than 75% of canvas?
# - Shape positioning: are shapes clustered in corners or well-distributed?
def analyze_dataset_quality(dataset: Dataset) -> dict:
    """
    Analyze various quality metrics of the evolved dataset.
    Should return a dictionary with analysis results.
    Focus on the metrics listed above to identify specific quality issues.
    """
    pass  # FIXME

In [None]:
# @title Part 3.2: Design a handcrafted data generator
# TODO: Implement a synthetic data generator that avoids the quality issues you identified in Part 3.1.
# *Important*:
# - As before, your method should generate exactly as many training samples as indicated by `training_set_size`.
# - It should return a `Dataset` with columns `program`, `image` (same as the test dataset).
# - DO NOT include any programs that are present in the `barred_programs` set.
def generate_synthetic_dataset(
    barred_programs: set[str], training_set_size: int
) -> Dataset:
    pass  # FIXME


improved_training_ds = generate_synthetic_dataset(
    set([p["program"] for p in test_dataset]) | set(qualitative_test_programs), 500
)
print(f"Generated improved dataset (first 10 rows):\n{improved_training_ds[:10]}")
improved_training_ds_metrics = analyze_dataset_quality(improved_training_ds)
print(f"Improved DS metrics: {improved_training_ds_metrics}")

In [None]:


# @title Part 3.3: Compare datasets and evaluate improvement
# TODO: Train model on your improved dataset, then compare the metrics to those obtained in Part 2.3
# To make it easier for us to understand your code, please use the `compare_dataset_performance` function below
# (which you will have to implement) to compute the difference in performance between the two datasets.
# (Which metrics improved, which got worse, and by how much?)


def compare_dataset_performance(
    old_metrics: Metrics, improved_metrics: Metrics
) -> dict:
    pass  # FIXME


# Compare performance
print("Training on improved DS")
try:
    # First need to reload the model, as it was finetuned above
    # (Unsloth/transformers/peft can be a bit weird about whether the model is updated in-place or not,
    #  but the below should be safe regardless)
    model = model.unload()
except Exception:
    model, processor = load_model_and_processor()
finetuned_improved = finetune(model, processor, improved_training_ds)
improved_metrics = evaluate_quantitatively_on_test_set(
    finetuned_improved, processor, test_dataset
)
print(f"Improved metrics: {improved_metrics}")
print("Comparing performance of original vs. improved datasets...")
old_metrics = dict()  # FIXME: Replace with metrics from previous finetuning
comparison_results = compare_dataset_performance(old_metrics, improved_metrics)
print("Comparison results:")
print("\n".join(f"- {k}: {v}" for k, v in comparison_results.items()))
print("Quantitative performance comparison completed.")

# Generate prediction for the qualitative test cases
qual_metrics, predicted_programs, predicted_images = (
    evaluate_qualitatively_on_test_programs(finetuned_improved, processor)
)
for gt_program, predicted_program, predicted_image in zip(
    qualitative_test_programs, predicted_programs, predicted_images
):
    print("-" * 50)
    print(f"Target program: {gt_program}")
    print("Target image:")
    render_dsl(gt_program, show=True)
    print(f"\nZero-shot predicted program: {predicted_program}")
    if predicted_image is not None:
        print("Zero-shot predicted image:")
        display(predicted_image)
    else:
        print("Failed to render predicted program.")
print(f"Metrics on qual. test instances: {qual_metrics}")

In [None]:
# TODO: Any additional analysis, visualizations, or experiments for Part 3

### Part 3.4: Analysis/Discussion of Results

(*TODO: Describe and analyze the results from your dataset quality analysis and improvement experiments in Part 3.3.*)

### Part 3.5: Reflections for Part 3

- What quality issues did you identify in your original evolved dataset?

- How did you design your heuristic data generation strategy to avoid these issues? What factors did you decide to optimize for, and why?

- Did you obtain a dataset of higher quality after your modifications (per your analysis function)?

- Did the improved dataset lead to better downstream performance? If not, why might that be?

## (Bonus) Part 4: Exploration

Congrats on finishing all the required parts of this assignment!

For those of you who are interested, here is an opportunity to push yourself a bit extra and try your hand at some more open-ended exploration.
This section will earn you up to **20** marks, allowing your grade for the assignment to go all the way up to **120**.
If your final score exceeds 100, the additional marks will be rolled over to your next assignment.

Your task is to go a bit beyond what we have done in the main parts of this assignment, while continuing your exploration of evolutionary data generation, QLoRA finetuning, and data quality.
Specifically, your task is to investigate, implement, experiment with, and report on **one** of the below topics:
1. [0-8 marks]: One alternative way of improving the quality of your training data that does not rely on going all the way to handcrafted data generation would be to augment your evolutionary data generation process with a *fitness function*. In the style of AlphaEvolve, the fitness function (which should assign high scores to higher-quality samples) would then be used to bias the search procedure in terms of which examples are shown to the model and/or by informing the model of which examples are better. Implement and experiment with one such fitness function of your choice, and report on the results.
2. [0-8 marks]: As mentioned briefly in Part 3, in practice, hybrid strategies that combine different data generation methods are often the most effective. One such strategy would be to first generate $N < 500$ seed programs using your handcrafted data generator from Part 3, and then use evolutionary data generation to generate the remaining $500 - N$ programs. Implement and experiment with this strategy, and report on the results.
3. [0-20 marks]: Do both of the above as well as the combined experiment (i.e., three experiments, the last of which combines both ideas).
4. [0-20 marks]: Come up with your own idea for improving the data generation or finetuning process, implement it, and report on the results.

Marks will be awarded based on the rigor of your experimental design, the depth of your analysis, and the insights you are able to draw from your results.
Beware: To avoid low-balling, partial attempts will not be awarded any marks, and completed but unsuccessful attempts (ie., those that do not result in improved performance) may only be awarded only a small number of marks (at the TA's discretion).

No matter what you decide to do, make sure to include **all of your code** and **all of your results** in this section. Have fun!

In [None]:
# @title Part 4
# TODO: Put your code and your discussion here! Feel free to create more cells as needed, and to combine code cells and markdown cells.
# Make sure your investigation is neatly formatted and easy to follow!

## Assignment Submission

Congratulations on completing Assignment 3! Please submit this notebook with all your code executed and the reflection questions answered. Please double check by opening the notebook anew that all the cells' outputs have been included correctly, as we will not be able to run your code for you.