注册并分享邀请链接,可获得视频播放与邀请奖励。

EverMind (@evermind) “Everyone is talking about self-improving agents. The harder question is how to m” — TopicDigg

EverMind 的个人资料封面
EverMind 的头像
EverMind
@evermind
Self-evolving memory across Agent and platform.
加入 November 2025
21 正在关注    3.4K 粉丝
Everyone is talking about self-improving agents. The harder question is how to measure whether an agent is actually getting better. That is why we built EvoAgentBench: a benchmark for agent self-evolution. It tests whether agents can learn from past trajectories, extract reusable skills/memory, and improve on held-out tasks across 5 domains: - information retrieval - reasoning and problem decomposition - software engineering - code implementation - knowledge work 917 train tasks. 288 test tasks. In the included Omni-MATH run, skill injection moved 27B from 21% to 65%, and 397B from 25% to 66%. EvoAgentBench has 2K+ all-time downloads on Hugging Face. The benchmark card, task splits, evaluation files, and citation are open there. Self-improving agents need more than vibes. They need measurement.
显示更多