注册并分享邀请链接,可获得视频播放与邀请奖励。

Sumanth (@Sumanth_077) “AI agents trained on a fixed benchmark eventually just memorize it! Once an agen” — TopicDigg

Sumanth 的个人资料封面
Sumanth 的头像
Sumanth
@Sumanth_077
Simplifying LLMs, RAG, Machine Learning & AI Agents for you! • ML Developer Advocate • Shipping Open Source AI apps
加入 July 2021
870 正在关注    76.6K 粉丝
AI agents trained on a fixed benchmark eventually just memorize it! Once an agent learns to pass a fixed set of test scenarios, the benchmark stops teaching it anything new. It's also nothing like the real, messy, unpredictable conditions agents actually operate in. Patronus AI's Generative Simulators take a different approach. Instead of generating just the task, they co-generate three things together every time. 1. The task itself - what the agent actually needs to do. 2. The world dynamics - how the simulated environment behaves and reacts to each action the agent takes. 3. The reward function - how that specific run gets scored as success or failure. The reason all three matter together: if only the task changes but the reward function stays fixed, the agent can still find shortcuts that game the scoring. By generating the task, the environment's behavior, and the grading criteria as one consistent unit, the simulator can keep creating new scenarios and still know how to evaluate them correctly. That's what keeps the environment from going stale. The reported results: 30-40% model lift on long-horizon tasks, a corpus of 1M+ world data artifacts, 85% UI/UX feature parity with real products in their simulated environments, and 5K+ expert contributors across industries validating these simulations. Key capabilities: • Generative Simulators co-generate task, world dynamics, and reward function together • Percival detects 20+ failure modes across agentic traces • Lynx for hallucination detection, GLIDER as a 3B parameter judge model • 30-40% performance lift on long-horizon tasks • 1M+ world data artifacts across domains I've shared the link in the replies!
显示更多
Today, we’re excited to announce our $50M Series B, led by @GreenfieldVC, with participation from @lightspeedvp and @notablecap. 🚀 At Patronus AI, we develop simulations and evals to train and improve AI. The first phase of AI was built on static benchmarks, but that era is over. As agents are used to solve longer and longer tasks, they need to practice in dynamic, living worlds to get better. Simulations are the critical infrastructure powering this next phase. As a company, we’re behind the most influential research and products in AI evaluation, like FinanceBench, Lynx, and Percival. And things have moved at the speed of light since.⚡ We partner with the world's leading frontier AI labs and enterprises, and our revenue has grown more than 15x over the past year. Additionally, today, we’re introducing a preview of the first Digital World Model for AI agent training and simulation: Patronus-DWM. Digital World Models are language diffusion world models that predict realistic environment behaviors and steer agent actions across digital workflows. Just as physical world models predict how objects move through space, we’re developing the equivalent for the digital world: predicting how agents act in digital workflows, then using that to scale the creation of high-quality training data for LLMs. Digital World Models help us push the frontier of ultra long horizon workflows, and unlock a new class of self-improving RL environments. This is our scalable approach to simulating all of the world’s intelligence. The round was also joined by @datadoghq, @SamsungVentures, @gokulr, @factorialcap, and a large cohort of amazing AI leaders across @AnthropicAI, @OpenAI, @GoogleDeepMind, @nvidia, @Recursive_SI, and more.✨ It has been the ride of a lifetime. But we’re just getting started. The best is yet to come. "Do not go gentle into that good night, Rage, rage against the dying of the light" - Dylan Thomas (1954)
显示更多