NVIDIA just made AI detect objects 10x faster by deleting one step.
It's called LocateAnything, and it removes the biggest bottleneck no one else was fixing in vision-language models.
Normally a model builds each bounding box one coordinate token at a time. 100 objects means thousands of tokens before an answer. NVIDIA scrapped that: their Parallel Box Decoding predicts the whole box in a single forward pass, as one atomic unit.
→ 12.7 boxes/sec on one H100
→ 10x faster than Qwen3-VL
→ +3.8% F1 on LVIS, accuracy up, not down
→ 3B params, runs on one consumer GPU
Treating the box as one unit keeps its coordinates tied together, which is why accuracy climbed instead of falling.
One model handles detection, GUI grounding, OCR, and document understanding, ready for computer-use agents, robotics, and document pipelines.
100% open source, weights, code, demo, and paper all live.
显示更多