Everyone is talking about self-improving agents. The harder question is how to measure whether an agent is actually getting better.
That is why we built EvoAgentBench: a benchmark for agent self-evolution.
It tests whether agents can learn from past trajectories, extract reusable skills/memory, and improve on held-out tasks across 5 domains:
- information retrieval
- reasoning and problem decomposition
- software engineering
- code implementation
- knowledge work
917 train tasks. 288 test tasks.
In the included Omni-MATH run, skill injection moved 27B from 21% to 65%, and 397B from 25% to 66%.
EvoAgentBench has 2K+ all-time downloads on Hugging Face. The benchmark card, task splits, evaluation files, and citation are open there.
Self-improving agents need more than vibes. They need measurement.
显示更多