I started using the concept in 2016 (e.g. in my NIPS 216 keynote, in which I called it a "world simulator").
I published papers on video prediction in 2016.
This was meant to be a key step to train world models.
Ha&Schmi appeared in 2018.
The slide below is from a talk I gave at Brown in Nov 2017.
Full deck here:
We were hoping to train world models through video prediction.
At the time, we were using generative architectures.
We tried latent-variable models and GAN-style training.
But never quite worked on natural video.
Around 2021, I realized that predicting at the pixel level was not a good idea.
That's when the JEPA concept emerged: find an abstract representation within which predictions are performed.
显示更多