Nice new paper improving image generation and (generative) unsupervised representation learning uses ViT instead of CNN to improve VQGAN into a new "ViT-VQGAN" image patch tokenizer. Tokens are then fed into a GPT for image generation, or linear probing.
显示更多