I'd also like to share a few thoughts from the past six months of building QUEST that didn’t make it into the paper, but might still be useful to the community.
> Mid-training is surprisingly powerful, especially for smaller models.
In our early experiments, we applied mid-training to smaller models like Qwen3-8B and saw surprisingly large gains (~3pp over pure SFT) with only around 100K tasks.
The effect became weaker as model size increased, but for smaller Deep Research agents, carefully designed mid-training tasks can still go a long way.
How should those tasks be designed? That’s a much longer story. If you’re interested, check out the “Unsuccessful Attempts” section in our paper. Honestly, it’s probably my favorite section in the whole paper.
> For agentic RL, infrastructure matters way more.
This might sound like a cliche in 2026, but I still want to yell it loudly.
We spent almost two months just making our RL pipeline stable. Agentic RL is essentially a giant chain of dependencies. The judge model needs to work. The search and retrieval services need to work. The cache needs to work. And all of them need to keep working for days. A tiny bug in any part of the pipeline can ruin an entire training run.
We end up spending a surprising amount of time designing fallback mechanisms for everything. Not because it’s elegant, but because otherwise your training will randomly die at 3 AM after running for days. The secret isn’t fancy RL tricks. The secret is making sure your pipeline survives when things inevitably break.
> Session-level training is probably the reason we can afford long-horizon training.
Everyone working on Deep Research agents knows the pain: context is expensive. Compared to traditional reasoning tasks or short-horizon agents, long-horizon Deep Research training burns through GPU resources much faster.
For most academic groups, GPUs are still the biggest bottleneck. In QUEST, we use session-level training and limit each training instance to 32K tokens. It’s not the most glamorous idea in the paper, but it helps make large-scale long-horizon training much more practical.
> Cache saved us a ridiculous amount in API spend.
We spent a lot of effort building our caching system. At first, it sounded like an engineering optimization. Later, it became a necessity. Many websites visited during data synthesis show up again during RL training. So as the search tool. Without caching, you’re effectively paying for the same information over and over again.
The funny part is that cache becomes even more valuable when your experiments fail. RL runs crash. Services fail. You restart things. But cached results stay.
One funny observation: the more failed runs you have, the higher your cache hit rate gets. It’s a slightly sad but comforting story: every failed run leaves behind a few more cache entries for the next one. By our final training run, the cache hit rate had reached ~40%, which translated into a significant reduction in API costs.
How painful it was to build. How rewarding it was to finish. We hope you’re as excited to meet QUEST as we are to finally share it.
显示更多
🚀Excited to announce QUEST today!
Benchmark numbers will fade, but the know-how behind them endures.
With QUEST:
See how we synthesize diverse Deep Research tasks through a unified framework at scale.
See how we train Deep Research agents through three stages: Mid-Training → SFT → RL.
See how our caching system dramatically reduces API costs during training.
See how our context management enables unbounded deep research.
We’ve released everything we can!
显示更多