Scaling Autonomous Research to Thousands of Agents

Scaling autonomous research systems

When we use autonomous research systems, we want high-quality results, and we want them quickly. But current systems leave us waiting hours or even days for results, and often trap themselves within local minima. We believe that scaling autonomous research will let us discover groundbreaking research in minutes.

This post details our learnings from coordinating thousands of concurrent research agents that break world records in mathematics, neuroscience, biology, and theoretical machine learning.

The final system included a planner agent, which coordinated as many worker agents as needed, each of which coordinated hundreds to thousands of subagents. We used this system to set the world record on ProteinGym in 30 minutes.

The final planner, worker, and subagent system architecture
The final system: a planner agent coordinates as many worker agents as needed, each of which coordinates hundreds to thousands of subagents.

Existing systems like autoresearch are designed to take and edit the best program at every single step. As such, a major limitation is being prone to local minima.

If we were to concurrently launch 100 experiments using autoresearch, it would only keep track of the single best performing program. Instead, if we keep track of multiple interesting candidate programs, we can branch off of each one and launch experiments that lead to more unique and higher-performing results.

The MAP-Elites algorithm is one way to store multiple candidate programs and is used by programs such as AlphaEvolve, ShinkaEvolve, and OpenEvolve: it involves a grid of programs, where each cell contains the best-performing program of each quantile of a tracked characteristic. This has some inherent issues. When using this approach, our databases would have many empty cells.

A regular-grid MAP-Elites archive: candidates concentrate in a few regions, leaving many cells empty.

Here, we observe a highly uneven distribution of programs assigned to cells, so we experimented with tessellation to dynamically assign cells based on clusterings of historical datapoints.This is inspired by the evolutionary algorithm introduced in Using Centroidal Voronoi Tessellations to Scale Up the Multidimensional Archive of Phenotypic Elites Algorithm (Vassiliades et al., 2018).

Cluster-Elites in motion: as points stream in, the adaptive archive re-clusters on all history so every niche stays filled. Add points to compare it against the regular grid.

Using tessellation for our framework, we observed a 36.2% increase in sampling efficiency.

Equipped with a way to store several important programs, we iteratively launched 64 concurrent agents,Interestingly, a similar concept has been explored in the traditional evolutionary algorithms literature (Flageat et al., 2023). We hypothesize there are many techniques proven in traditional evolutionary algorithms literature that would transfer to autonomous research with LLMs. each given a memory of their past experiments, and had them test and improve their ideas.

Concurrent research agents
Concurrent agents branching from a memory of prior experiments.

Using the same model as the OpenEvolve baseline, this led to large gains in both iteration speedup and sample efficiency.

Iteration speedup, sample efficiency, and results table for Aster versus OpenEvolve
Iteration speedup, sample efficiency, and results table for Aster versus the OpenEvolve baseline (circle packing).

Our system had impressive results, including new state-of-the-art results on the Erdős minimum overlap problem, the TriMul GPU kernel, a single-cell RNA denoising task, and the NanoGPT speedrun — and matching the best human solution on ZAPBench neural-activity prediction using less than 1/190th of the compute.

Single-cell embeddings before and after denoisingFluorescence image of the larval zebrafish brain
This system set records on computational biological denoising, neural-activity prediction in the larval zebrafish brain, and more.

To further improve sample efficiency, we used a more complex agent with tool use and file-editing access at every step. This ended up increasing in sample efficiency and iteration speedup.

Iteration speedup, sample efficiency, and results table for Aster with a custom agent versus Aster and OpenEvolve
Adding a tool-using custom agent improved iteration speedup and sample efficiency further, relative to Aster alone and the OpenEvolve baseline.

Equipped with a stronger engine, we explored genuine machine learning research rather than smaller tasks. We used Aster to find novel optimizers, scoring their performance by minimizing the training loss of an LLM trained using that optimizer on the Dolma pretraining dataset for 20 minutes on an A100. Below are the resulting cross-entropy losses of the optimizers that several models generated. The setup for all experiments involved 10 parallel agents for 6 iterations, converging in two hours, and we repeated it using three frontier models.

Optimizer search cross-entropy losses
Cross-entropy losses for optimizers generated by several models.

The resulting optimizers all performed well but were not able to validate overfit. At larger scales, the optimizers were equivalent to AdamW. For future work, we hypothesize that the final reward likely needs to be fitting to a scaling ladder as described in this scaling-law transfer paper.

Effectively coordinating thousands of agents

Our previous approach worked well for tasks with a single narrow goal, for instance searching over optimizers or making architectural modifications to a neuroscience model. However, when we pointed our system towards simultaneously researching hyperparameters, architectures, optimizers, and data, it collapsed because it could not compare experiments across different sources of theoretical changes.

We conceptualize this phenomenon as the "surface area of the code that gets modified." Exploring a small surface area, for example only tuning hyperparameters, risks missing out on important changes on other surfaces. Exploring a large surface area, for example tuning hyperparameters and architectural search and optimizer research, leads to confusion in the system over what changes led to an increase in benchmark score. The next order of action would then be having a mechanism to concurrently explore various surface areas and combine results.

We chose a difficult task, ProteinGym (DMS Substitutions), to test a system that can research multiple surface areas simultaneously. ProteinGym is a benchmark that measures how accurately a model can predict the effects of protein mutations.

A wild-type protein sequence with one residue substituted, and a description of the ProteinGym DMS-substitutions benchmark
The ProteinGym DMS-substitutions task: predicting the fitness effect of single amino-acid substitutions.

Specifically, we chose to make changes to a frozen model, ProSST (K=2048), to improve its performance. We chose a frozen model to eliminate training costs. ProSST scored a 0.507 on the benchmark (Mean Average Spearman), while the best model on the leaderboard scored a 0.518.

Using our initial engine, we stalled while attempting to increase our score from 0.507 to 0.511. We wanted to surpass 0.518, the top score, through principled scientific means. Though, a way to parallelize the problem into separate surface areas remained to be found.

Before we divide the problem, we need to generally improve any surface area from a given prompt. First, we created a single worker agent that manages a surface area, such as an optimizer.

Single worker loop
A worker loop for improving one research surface area.

This worker loop effectively prompt-engineered the models to achieve similar results on constrained tasks with simple input prompts, for example "improve the optimizer."

Afterwards, we needed to attack multiple surface areas at once. To do this, we took inspiration from Cursor's work on parallelizing software engineering, which involved a central planner agent that owned multiple subtasks.

Our resulting structure mimics that of Cursor's:

  1. A planner agent gives each worker agent an overall goal.
  2. Every worker passes a handoff detailing what was done, its learnings, and concerns back to the planner.
  3. The planner synthesizes the best results into the "current best program."
  4. The planner sends the current best program to the workers to improve from, meaning that the best ideas from the most untapped alpha are combined.
Planner-worker coordination for parallel research
Planner-worker coordination for parallel research.

Now, using this architecture to improve ProteinGym through pure inference-time advancements, our system autonomously searched over several surface areas, for instance short extended pretraining and interpretability, arriving upon a smart solution that only used the score-ranking function.

This final architecture used 2 planners and 10 workers, each worker using 100x parallelism, for a total of about 1000 concurrent LLM calls. We ran it for a total of 30 minutes, during which our score climbed from 0.507 to 0.524. The final result of the system was a simple change, changing the score of every mutation from the log of the model's probability of the mutation:

si(a)=logpi(a)s_i(a) = \log p_i(a)

to a score that accounts for the confidence of the model:

si(a)=logb:pi(b)pi(a)pi(b)log{b:pi(b)pi(a)}20s_i(a) = \log \sum_{b:p_i(b)\le p_i(a)} p_i(b) - \log \frac{\left|\{b:p_i(b)\le p_i(a)\}\right|}{20}

The details of this finding are out of this post's scope. The finding was discovered by a worker investigating the scoring function, which was a surface area assigned by the planner model.

Conclusions

Our overarching takeaway is that scaling the number of concurrent inference calls dramatically speeds up the discovery of significant solutions, turning searches that once took hours or days into ones that finish in minutes. We believe extreme agent parallelization is the future of research: it took our system less than 30 minutes to surface a novel, genuinely interesting finding that advanced the state-of-the-art. Research must move from benchmark-driven to open-ended work, where systems define and pursue their own promising directions rather than optimizing a single fixed metric.

Contributors

  • Emmett Bicker
  • Olivia Long