## Table of contents

## Introduction

Current AI has been dominated by the paradigm of pretraining and finetuning, across language, vision, and other disciplines. However, there have been different techniques that have helped achieve state-of-the-art results on top of the pretraining-finetuning pipeline. Among the myriad of such methods, I will expand on the concept of test time compute and the possible impact it might have on expanding the frontier of what models can do.

The objectives of this essay are the following:

- Show how
*test time compute*impact on acuraccy is equivalent to increasing 30x the parameter size of the model. - Hint at the fact that it might be a way to
*lift the veil of ignorance*regarding LLMs, as per the quote of OpenAI’s CEO Sam Altman.

## Formalization of Test Time Compute

The idea of *test time compute* can be traced back to the paper
by (Cobbe et al.,
2021) from OpenAI. The goal of the paper was to compare the acuraccy of finetuning
and a proposed architecture of verifiers for mathematical reasoning. The paper’s findings can be summarized as: (a)
the GSM8K dataset for math problems to allow for benchmarking of
advances in mathematical reasoning (b) a
verification/generator architecture (c) empirical evidence that the
generator-verifier architecture can achieve same accuracy as a 30x
increase in parameter size when compared to a finetuned baseline.

### Architecture

The verifier architecture is pretty simple: for every problem, the
*generator* will generate j number of
solutions. These solutions will then be fed to another network, the
*verifier*, which will be trained to predict the correctness of
each solution. Figure 4 of the original paper represents a diagram
of the pipeline where i represents each training example in the
dataset.

Regarding the architecture for the generator the paper uses GPT3 with 6 billion parameters (6B) and 175B parameters. The verifier architecture also was tested using 6B or 175B model, but the most impressive improvement came from the 6B model (as it achieved similar performance to the 30x times larger 175B model)

Regarding the finetuning baseline, the details can be found in the paper, but it’s worth remarking that the generator for the verifier is finetuned with 2 epochs as it achieves the best test@100 accuracy (ratio of valid solutions allowing the model 100 guesses). This 100 guesses, the number of j chosen for the paper, is also accompanied by supporting experiments.

### Results

It’s not rare for the gist of a paper to be summarized in a single figure (as Andrew Ng informally remarked in one of his classes at Stanford), and I believe that this paper is no exception. The following is Figure 5 from the original paper, showing how a 6 billion parameter model with verification achieves better performance than a 175B finetuned model. Not only is the verifier model 30x times smaller in terms of parameter count, but also the scaling tendency is much more vertical than that of finetuning.

### Test time compute

The idea of *test time compute* emerges from the number of solutions
that the verifier will consider at inference. The verifier architecture,
in addition to being more promising than the finetuning baseline
for mathematical reasoning, has the capability of defining the
amount of compute given at test time for each problem set by defining
the number of solutions that the generator will make. In other
words, it shifts the power of compute from merely training to
inference.

## Veil of Ignorance

A big problem of foundational models, and pushing the state of the
art, is the immense cost of compute, resources, and as a result,
time that it takes to achieve better results. In other words, to
achieve state-of-the-art results with LLMs, one has to invest in
pretraining a humongous model for several months and then test it
with the dataset in question, the compute of the latter being
*infinitely smaller* (this is an essay, so yes, I am using *infinitely
smaller* but in a literary sense). As a result, the use of *test
time compute* can be used to *shift* part of the compute used only
for training to inference, thus being able to achieve state-of-the-art
results.

There has been quite some speculation recently on OpenAI’s latest
development, in addition to the whole corporate debacle, regarding
Q* and lifting the veil of ignorance while achieving a smaller
model. In an
article,
the idea of *test time computation* is linked to developments within
OpenAI that led Ilya Sutskever to shift his attention to AI safety.

## Conclusion

In conclusion, I believe that verifiers are a quite interesting way to push the frontier of what models can do by fundamentally shifting the place that compute plays in the deep learning pipeline. Their simple architecture design and training also makes them a potential avenue of research.

## References

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser,
L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., &
Schulman, J. (2021). Training Verifiers to Solve Math Word Problems.
*arXiv preprint arXiv:2110.14168*.