EverMind (@evermind)

2026.06.21 02:08

Everyone is talking about self-improving agents. The harder question is how to measure whether an agent is actually getting better. That is why we built EvoAgentBench: a benchmark for agent self-evolution. It tests whether agents can learn from past trajectories, extract reusable skills/memory, and improve on held-out tasks across 5 domains: - information retrieval - reasoning and problem decomposition - software engineering - code implementation - knowledge work 917 train tasks. 288 test tasks. In the included Omni-MATH run, skill injection moved 27B from 21% to 65%, and 397B from 25% to 66%. EvoAgentBench has 2K+ all-time downloads on Hugging Face. The benchmark card, task splits, evaluation files, and citation are open there. Self-improving agents need more than vibes. They need measurement.

显示更多