“Very, very innovative and dynamic landscape,”“In China for the world!”
This is how Ola Källenius, Chairman of the Board of Management of Mercedes-Benz Group AG, describes the deep collaboration between Mercedes-Benz and China’s tech leaders.
显示更多
📽️ New 4 hour (lol) video lecture on YouTube:
"Let’s reproduce GPT-2 (124M)"
The video ended up so long because it is... comprehensive: we start with empty file and end up with a GPT-2 (124M) model:
- first we build the GPT-2 network
- then we optimize it to train very fast
- then we set up the training run optimization and hyperparameters by referencing GPT-2 and GPT-3 papers
- then we bring up model evaluation, and
- then cross our fingers and go to sleep.
In the morning we look through the results and enjoy amusing model generations. Our "overnight" run even gets very close to the GPT-3 (124M) model. This video builds on the Zero To Hero series and at times references previous videos. You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.
Github. The associated GitHub repo contains the full commit history so you can step through all of the code changes in the video, step by step.
Chapters.
On a high level Section 1 is building up the network, a lot of this might be review. Section 2 is making the training fast. Section 3 is setting up the run. Section 4 is the results. In more detail:
00:00:00 intro: Let’s reproduce GPT-2 (124M)
00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint
00:13:47 SECTION 1: implementing the GPT-2 nn.Module
00:28:08 loading the huggingface/GPT-2 parameters
00:31:00 implementing the forward pass to get logits
00:33:31 sampling init, prefix tokens, tokenization
00:37:02 sampling loop
00:41:47 sample, auto-detect the device
00:45:50 let’s train: data batches (B,T) → logits (B,T,C)
00:52:53 cross entropy loss
00:56:42 optimization loop: overfit a single batch
01:02:00 data loader lite
01:06:14 parameter sharing wte and lm_head
01:13:47 model initialization: std 0.02, residual init
01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms
01:39:38 float16, gradient scalers, bfloat16, 300ms
01:48:15 torch.compile, Python overhead, kernel fusion, 130ms
02:00:18 flash attention, 96ms
02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms
02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping
02:21:06 learning rate scheduler: warmup + cosine decay
02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms
02:34:09 gradient accumulation
02:46:52 distributed data parallel (DDP)
03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)
03:23:10 validation data split, validation loss, sampling revive
03:28:23 evaluation: HellaSwag, starting the run
03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro
03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA
03:59:39 summary, phew, build-nanogpt github repo
显示更多
🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)
On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0).
- It is a direct implementation of the training loop and backpropagation in C/CUDA.
- It compiles and runs instantly. No more "hit run then wait for tens of seconds for unknown reasons", for mountains of inscrutable abstractions to build a Universe.
- It deletes the need for the Python interpreter and a deep learning library.
- It allocates all the memory a single time at the start.
- It's pretty cool.
How:
Getting this to work required us to write a lot of custom CUDA kernels, and doing this manually (instead of using Tensor ops of aten/PyTorch and torch.compile etc.) is a bit like programming in assembly. And you spend quality time looking at more assembly (CUDA PTX/SASS). But this also means we get to hyperoptimize the code and possibly explore optimizations that torch.compile might find difficult to, which is awesome. Examples of optimizations that went in over the last few days:
- we're being clever with our memory consumption in the backward pass, only using a few buffers we need to propagate the gradients, saving memory capacity.
- one fused classifier kernel does the last layer forward pass, the loss, and kicks off the backward pass.
- many improvements to all the kernels involved, including e.g. gains from carefully constraining execution within the autoregressive mask in attention
- cuBLAS(Lt) calls for all heavy lifting matmuls, and fused bias accumulation
Big credits to two CUDA experts who appeared from somewhere on the internet to help this open source project, ngc92 and ademeure. We're hanging out of Github and Discords of CUDAMODE and my NN Zero to Hero.
Next steps:
- more optimizing of our (fp32) kernels, and especially switch to flash attention.
- mixed precision training (fp16 to start).
- multi-gpu training (DDP to start).
- data & evals to set up a proper GPT-2 training runs
- 🚀 repro GPT-2 (1.6B) training run.
- more modern architectures etc. (Llama 3?)
- writing, videos, exercises on building all of this from scratch.
Figure 1: eye candy: timing profile of the kernels (one layer). NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.
显示更多