Test Time Compute and Verifiers


Table of contents


Current AI has been dominated by the paradigm of pretraining and finetuning, across language, vision, and other disciplines. However, there have been different techniques that have helped achieve state-of-the-art results on top of the pretraining-finetuning pipeline. Among the myriad of such methods, I will expand on the concept of test time compute and the possible impact it might have on expanding the frontier of what models can do.

The objectives of this essay are the following:

  1. Show how test time compute impact on acuraccy is equivalent to increasing 30x the parameter size of the model.
  2. Hint at the fact that it might be a way to lift the veil of ignorance regarding LLMs, as per the quote of OpenAI’s CEO Sam Altman.

Formalization of Test Time Compute

The idea of test time compute can be traced back to the paper by (Cobbe et al., 2021) from OpenAI. The goal of the paper was to compare the acuraccy of finetuning and a proposed architecture of verifiers for mathematical reasoning. The paper’s findings can be summarized as: (a) the GSM8K dataset for math problems to allow for benchmarking of advances in mathematical reasoning (b) a verification/generator architecture (c) empirical evidence that the generator-verifier architecture can achieve same accuracy as a 30x increase in parameter size when compared to a finetuned baseline.


The verifier architecture is pretty simple: for every problem, the generator will generate j number of solutions. These solutions will then be fed to another network, the verifier, which will be trained to predict the correctness of each solution. Figure 4 of the original paper represents a diagram of the pipeline where i represents each training example in the dataset.

Regarding the architecture for the generator the paper uses GPT3 with 6 billion parameters (6B) and 175B parameters. The verifier architecture also was tested using 6B or 175B model, but the most impressive improvement came from the 6B model (as it achieved similar performance to the 30x times larger 175B model)

Regarding the finetuning baseline, the details can be found in the paper, but it’s worth remarking that the generator for the verifier is finetuned with 2 epochs as it achieves the best test@100 accuracy (ratio of valid solutions allowing the model 100 guesses). This 100 guesses, the number of j chosen for the paper, is also accompanied by supporting experiments.


It’s not rare for the gist of a paper to be summarized in a single figure (as Andrew Ng informally remarked in one of his classes at Stanford), and I believe that this paper is no exception. The following is Figure 5 from the original paper, showing how a 6 billion parameter model with verification achieves better performance than a 175B finetuned model. Not only is the verifier model 30x times smaller in terms of parameter count, but also the scaling tendency is much more vertical than that of finetuning.

Test time compute

The idea of test time compute emerges from the number of solutions that the verifier will consider at inference. The verifier architecture, in addition to being more promising than the finetuning baseline for mathematical reasoning, has the capability of defining the amount of compute given at test time for each problem set by defining the number of solutions that the generator will make. In other words, it shifts the power of compute from merely training to inference.

Veil of Ignorance

A big problem of foundational models, and pushing the state of the art, is the immense cost of compute, resources, and as a result, time that it takes to achieve better results. In other words, to achieve state-of-the-art results with LLMs, one has to invest in pretraining a humongous model for several months and then test it with the dataset in question, the compute of the latter being infinitely smaller (this is an essay, so yes, I am using *infinitely smaller* but in a literary sense). As a result, the use of *test time compute can be used to *shift part of the compute used only for training to inference, thus being able to achieve state-of-the-art results.

There has been quite some speculation recently on OpenAI’s latest development, in addition to the whole corporate debacle, regarding Q* and lifting the veil of ignorance while achieving a smaller model. In an article, the idea of test time computation is linked to developments within OpenAI that led Ilya Sutskever to shift his attention to AI safety.


In conclusion, I believe that verifiers are a quite interesting way to push the frontier of what models can do by fundamentally shifting the place that compute plays in the deep learning pipeline. Their simple architecture design and training also makes them a potential avenue of research.


Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.