BERT is just a Single Text Diffusion Step! (1/n)
When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018.
The first thought I had was, “can we finetune a BERT-like model to do text generation?”