The non-obvious crux of the shift is an empirical finding, emergent only at scale, and well-articulated in the GPT-3 paper ( Basically, Transformers demonstrate the ability of "in-context" learning. At run-time, in the activations. No weight updates.
显示更多