Saining Xie (@sainingxie) “📸latest in our cambrian series: cambrian-p, p for pose. i think pose is probabl”

2026.05.27 02:12

📸latest in our cambrian series: cambrian-p, p for pose. i think pose is probably the minimal sufficient 3d signal (and it’s easy to get!) that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.

显示更多

Jihan Yang@jihanyang13

2026.05.26 23:14

Camera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics in video. We trace this gap to a missing piece: camera pose. Introducing Cambrian-P: a multimodal LLM natively grounded in camera pose. (1/n)

显示更多