In this paper, a 7B language model trained with reinforcement learning learns to orchestrate larger frontier models like GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro.
It does so by writing natural-language subtasks, assigning each to one of the workers, and specifying which previous outputs that worker sees in context.
The resulting system outperforms every individual frontier model on benchmarks including GPQA Diamond, LiveCodeBench, and AIME25, while averaging about three model calls per question—fewer than the multi-agent pipelines and self-reflection loops it beats.
The work provides evidence that prompt engineering and pipeline design, currently done by hand in commercial AI products, can be learned end-to-end through reward signals alone.
Read with an AI tutor:
PDF:
显示更多