Hao Zhang (@HaoZhang623)

2026.06.05 07:58

I have a hypothesis about the current 3D vs non-3D debate. Perhaps the endgame is that these two paths eventually merge. Non-3D approaches currently feel incredibly strong. Scale data, scale compute, scale parameters, and surprisingly quickly: 20 → 40 → 60 → 80. Realism improves. Motion improves. Benchmarks improve. But going from 80 → 100 feels fundamentally different. Physical correctness, multiview consistency, object permanence, contact dynamics, and long-horizon reasoning feel much harder. My intuition is that 0 → 80 is mostly a scaling problem, while 80 → 100 may be a world-state problem. Current large-scale data gives us enormous amounts of observations: pixels, videos, actions. But much less geometry, depth, pose, contact, physical constraints, or object state. As models become stronger, perhaps the bottleneck slowly shifts from “Can models fit the data?” to “Does the data contain enough world state?” This is why I think 3D matters—not necessarily as the final representation, but as data infrastructure. Multiview capture, simulation, synthetic interaction, counterfactual rollouts, state supervision. These systems don’t simply create more data; they create higher information density data. Which creates an interesting possibility: maybe non-3D systems win early because scaling observations is easy, while 3D systems catch up later because scaling world-state is harder. And eventually, perhaps the distinction disappears. Sufficiently strong non-3D systems may need to implicitly learn world structure, while sufficiently strong 3D systems must learn appearance, dynamics, and semantics. Perhaps the real question is not: 3D vs non-3D. But: How do we scale world states. Before the intelligence scale, the data engine needs to scale first.

显示更多