Searching Latent Program Spaces

Paper: arXiv
Code: GitHub
Authors: Clément Bonnet, Matthew V Macfarlane
Date of Publication: 13th November 2024

Overview

In this paper authors use program synthesis using Deep Learning to tackle ARC-AGI. Given a set of inputs and outputs, program synthesis aims to create/generate a program (or function) that transforms the input to the output. Given input and output pairs, authors use an encoder to map them into a latent space. The latent space is the distribution of all programs that explain the transformation form input to the output. Authors then sample from the latent space, and pass the latent variable to the decoder. The decoder takes input the new test input along with the latent vector. Decoder auto-regressively generates the required test output.
Authors also use few methods to search through the latent space to use the most optimal latent variable possible too.

Background

Program Synthesis

Let $X_{m}$ be the set of few input/output pairs (the task to be solved)

X_{m} = {(x_{1}^{m}, y_{1}^{m}), \dots, (x_{n}^{m}, y_{n}^{m})}

A program $f \in Y$ is considered to solve the task associated with $X_{m}$ if it satisfies:

\forall j \in [1, n], f (x_{j}^{m}) = y_{j}^{m}

Let $F_{m}$ represent the true function that generated the Input-Output pairs. In this paper authors explore the problem where we are given a task $X_{m}$ , along with an additional input $x_{n + 1}^{m}$ :

P_{m} = {(x_{1}^{m}, y_{1}^{m}), \dots, (x_{n}^{m}, y_{n}^{m}), x_{n + 1}^{m}}

Goal is not to explain the generating function $F_{m}$ but to find a program $f$ that generalises to the new additional input $x_{n + 1}^{m}$ . ARC-AGI falls exactly in this category.

Variational Auto-Encoders

In the framework of VQ-VAE, given i.i.d samples $x$ from a dataset, we try to infer $p (z ∣ x)$ , where $z$ represents the latent variable. We try to find the latent variable that explains the input samples. Latent variable is usually in a much lower dimension than the input. Since $p (z ∣ x)$ is generally intractable, we approximate it using a neural network, $q_{ϕ} (z ∣ x)$ . We maximize the evidence lower bound (ELBO):

lo g p (x) \geq E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p (z))

By maximising the lower-bound, the log-likelihood increases. The first term in ELBO captures the likelihood of reconstructing $x$ from the latent variable $z$ . The KL divergence term regularizes $q_{ϕ} (z ∣ x)$ to be close to the prior distribution $p (z)$ (usually Gaussian). So given an input $x$ , using the neural network we predict $q_{ϕ} (z ∣ x)$ , and using a decoder we reconstruct the input $p_{θ} (x ∣ z)$ . We train it by maximising the ELBO as shown above. $p (z)$ is a prior distribution chosen for $z$ .

Latent Program Network (LPN)

In this paper authors introduce Latent Program Network, an algorithm that trains a neural network end-to-end to take a specification of input-output pairs and generate output of newly given input.

Latent Program Inference

LPN consists of an Encoder and a Decoder which play a similar role as in VAE.

Encoder
The encoder is trained to predict the latent variable $z$ given the set of I/O pairs. It maps an input-output pair $(x, y)$ to a distribution in the latent space $q_{ϕ} (z ∣ x)$ . The distribution in the latent space represents all possible programs that could explain the transformation of input to the output. The encoder is trained to learn an abstract representation of programs in a continuous latent space.
In practice, they use a multivariate Gaussian distribution to model the latent space. Encoder predicts the mean $μ$ and a diagonal covariance matrix $Σ$ of the Gaussian distribution to sample the latent variable from.

Decoder
The decoder is responsible for mapping the latent variable $z$ sampled from the distribution predicted by the encoder, and the given input $x$ to an output distribution of all possible outputs $y$ , $p_{θ} (y ∣ x, z)$ . Even though the mapping between I/O pairs is deterministic, they use a probabilistic framework so that it is compatible with maximum likelihood learning.

Decoding examples when conditioned on various latent variables

Conditioning the decoder on different points of the latent space leads to different outputs being generated.

Latent Optimization

But there is a problem that needs to be addressed. Usually the latent variable $z$ predicted by the encoder $q_{ϕ} (z ∣ x)$ is just not good enough. It may not encode the correct high level details necessary to explain the I/O transformation. Especially if the task is very novel, the encoder may fail at producing the right latent program, which, fed to the decoder, would generate the wrong output. Therefore, authors include a stage wherein they optimize the latent variable $z$ . Starting from the encoder’s prediction $z$ , they carry out a search to find a better latent $z^{'}$ . This is analogous to system 1/system 2 thinking. Encoder first generating an initial guess $z$ is analogous to system 1 thinking, while the search process to find a better $z^{'}$ is analogous to system 2 thinking.

Encoder: z \sim q_{ϕ} (z ∣ x, y)

Latent Optimization: z^{'} = f (p_{θ}, z, x, y)

Decoder: \overset{y}{^} \sim p_{θ} (y ∣ x, z^{'})

Search Methods for Latent Optimization

Given $n$ input-output pairs ${(x_{i}, y_{i})}_{i = 1... n}$ , the search process $z^{'} = f (p_{θ}, z, x, y)$ attempts to find $z^{'}$ that satisfies:

z^{'} \in ar g z max i = 1 \sum n lo g p_{θ} (y_{i} ∣ x_{i}, z)

This means we need to search for the latent that would most likely make the decoder generate the right outputs given the corresponding inputs. By finding a latent that can explain all the input-output pairs, the latent solution to the optimization problem is more likely to generalise to a new input-output pair.
Authors use two methods of search, namely Random Search and Gradient Ascent

Random Search
Authors sample $k$ latent variables from either $p (z)$ or $q_{ϕ} (z ∣ x, y)$ and select the latent that gives the highest log-likelihood of the input-output pairs according to the decoder.

\forall k \in [1, K], z_{k} \sim p (z) z^{'} \in ar g z_{k} max i = 1 \sum n lo g p_{θ} (y_{i} ∣ x_{i}, z_{k})

Random search asymptotically converges to the true maximum likelihood latent. However, the efficiency of random search decreases as the dimensions of the latent space increase. But it can still be used when the decoder is non-differentiable.

Gradient Ascent
Since the decoder is a differentiable neural network, it’s log-likelihood, $lo g p_{θ} (y ∣ x, z)$ is also differentiable with respect to $z$ . So we can use gradient-ascent to search through the $z$ that maximises the log-likelihood.

z_{0}^{'} = \frac{1}{n} i = 1 \sum n z_{i} \forall k \in [1, K], z_{k}^{'} = z_{k - 1}^{'} + α \cdot \nabla_{z} i = 1 \sum n lo g p_{θ} (y_{i} ∣ x_{i}, z)_{z = z_{k - 1}^{'}}

In practice, they use the best found latent $z$ in the search process, which may not always be the last latent $z_{k}$ .

Gradient field of the decoder log-likelihood

Training

To train an LPN system end-to-end, authors assume to have a dataset like ARC-AGI, consisting of $n$ input-output pairs $(x_{i}, y_{i})$ generated by the same program. To simulate the test-conditions of predicting a new input, authors design the training procedure to reconstruct each of the outputs $y_{i}$ from their inputs $x_{i}$ and all the other $n - 1$ input-output pairs $(x_{j}, y_{j})_{j \neq = i}$ . Authors do not use $y_{i}$ while predicting $y_{i}$ . Every other input-output pair along with $x_{i}$ is used.

When constructing $y_{i}$ , authors first sample latents $z_{j}$ from the encoder $q_{ϕ} (z ∣ x_{j}, y_{j})$ for all $j \neq = i$ . Then they aggregate them by computing their mean, $z_{i} = \frac{1}{n - 1} \sum_{j \neq = i} z_{j}$ , then they perform gradient-ascent on $z_{i}$ to obtain $z_{i}^{'}$ . Finally, they compute negative log-likelihood of the right output $y_{i}$ using corresponding input $x_{i}$ and the refined latent $z_{i}^{'}$ . That is, they use cross-entropy loss of the decoder logits $p_{θ} (\overset{y}{^}_{i} ∣ x_{i}, z_{i}^{'})$ using the label $y_{i}$ .

Losses used are:

L_{rec} (ϕ, θ) = i = 1 \sum n - lo g p_{θ} (y_{i} ∣ x_{i}, z_{i}^{'}) L_{KL} (ϕ) = i = 1 \sum n D_{KL} (q_{ϕ} (z ∣ x_{i}, y_{i}) ∣∣ N (0, I))

They use the standard re-parameterization trick while training.

Training with gradient-ascent to optimize the latent is significantly more expensive. So authors use only 0-5 steps of gradient ascent steps while training, but during test-time we can use increased number of gradient ascent steps as needed.

ARC-AGI Experiments

The Architecture

They use standard encoder-only transformer for encoder and decoder-only transformer for auto-regressive decoding. Authors don’t use any pre-trained models and train only on the ARC-AGI train dataset.

This image is worth 16x16 words. Entire overview of the architecture is self-explanatory via this image.

If you are keen on minute implementation details, authors have open-sourced their code. They also provide extensive details in the paper.

Inference Time scaling. As inference compute increases, the performance increased

GA above stands for Gradient Ascent. For test-time latent optimization they used ADAM with a cosine decay learning rate scheduler. And as the test-time compute is increased, the performance increased.
Although the performance is very bad on the evaluation/validation set. They used GA200 while testing on the private test-set and it scored 3% only. Authors claim that the loss of the model has not converged yet and the model can further be improved but they were lacking the compute needed.
LPN fails to generalise to new tasks as we've seen from the ARC-AGI scores, but it atleast learns the tasks that it was trained on. It manages to learn and execute 180 programs of the ARC-AGI training set.

Thoughts

I really liked the searching in the latent space. But in this setting, we need a loss function that we can optimise during test-time. This is perfect for ARC-AGI, but I don’t know how we can extend this set-up towards broad tasks that we expect AGI to do. We can simply scale up test-time compute using LLMs as done in OpenAI-o1 like models to carry out the search process in general domains without being restricted to narrow tasks.
Clément Bonnet currently works in Ndea, a company founded by François Chollet to research Deep Learning guided program synthesis. I’m looking forward to research in this direction. It looks very promising nonetheless.

Home 🏠

Explorer