What if we model test time adaptive sampling as MDP?
In our recent work, RL-Guided Adaptive sampling, we model the test time sampling as a MDP. Then we train a 4-layer MLP on CPU as controller.
This lightweight framework dynamically balances answer correctness, latency, and computation cost only rely on light statistics! 🚀
@zhengtoong @ruiliu0 @ChengsongH31219 @hongtuzhu1 @HongtuZ20093
📄 Paper:
💻 Code: