注册并分享邀请链接,可获得视频播放与邀请奖励。

与「DDP」相关的搜索结果

DDP 贴吧
一个关键词就是一个贴吧,路径全站唯一。
创建贴吧
用户
未找到
包含 DDP 的内容
“Very, very innovative and dynamic landscape,”“In China for the world!” This is how Ola Källenius, Chairman of the Board of Management of Mercedes-Benz Group AG, describes the deep collaboration between Mercedes-Benz and China’s tech leaders.
显示更多
0
31
203
52
转发到社区
2回目撮影券 赤い印のあたり!
곧 만나쟝!! ٩(ˊᗜˋ*)و✨❤️‍🔥   25년 09월 19일(금) 리그오브레전드 - 영혼의 꽃 신드라 25년 09월 21(일) 발로란트 - 클로브   선착순 100명 팬사인회 예정이니까 많관부!!!🥰   #e스포츠# #동대문# #DDP# #GES2025#
显示更多
0
15
202
20
转发到社区
エアリス、お誕生日おめでとう💐🌼🌸 #エアリス誕生祭2025#
0
28
4.4K
277
转发到社区
晴着姿ってほんとうに特別なものだと思いませんか👏 - 洋装編 - #RAphoto#
📽️ New 4 hour (lol) video lecture on YouTube: "Let’s reproduce GPT-2 (124M)" The video ended up so long because it is... comprehensive: we start with empty file and end up with a GPT-2 (124M) model: - first we build the GPT-2 network - then we optimize it to train very fast - then we set up the training run optimization and hyperparameters by referencing GPT-2 and GPT-3 papers - then we bring up model evaluation, and - then cross our fingers and go to sleep. In the morning we look through the results and enjoy amusing model generations. Our "overnight" run even gets very close to the GPT-3 (124M) model. This video builds on the Zero To Hero series and at times references previous videos. You could also see this video as building my nanoGPT repo, which by the end is about 90% similar. Github. The associated GitHub repo contains the full commit history so you can step through all of the code changes in the video, step by step. Chapters. On a high level Section 1 is building up the network, a lot of this might be review. Section 2 is making the training fast. Section 3 is setting up the run. Section 4 is the results. In more detail: 00:00:00 intro: Let’s reproduce GPT-2 (124M) 00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint 00:13:47 SECTION 1: implementing the GPT-2 nn.Module 00:28:08 loading the huggingface/GPT-2 parameters 00:31:00 implementing the forward pass to get logits 00:33:31 sampling init, prefix tokens, tokenization 00:37:02 sampling loop 00:41:47 sample, auto-detect the device 00:45:50 let’s train: data batches (B,T) → logits (B,T,C) 00:52:53 cross entropy loss 00:56:42 optimization loop: overfit a single batch 01:02:00 data loader lite 01:06:14 parameter sharing wte and lm_head 01:13:47 model initialization: std 0.02, residual init 01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms 01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms 01:39:38 float16, gradient scalers, bfloat16, 300ms 01:48:15 torch.compile, Python overhead, kernel fusion, 130ms 02:00:18 flash attention, 96ms 02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms 02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping 02:21:06 learning rate scheduler: warmup + cosine decay 02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms 02:34:09 gradient accumulation 02:46:52 distributed data parallel (DDP) 03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU) 03:23:10 validation data split, validation loss, sampling revive 03:28:23 evaluation: HellaSwag, starting the run 03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro 03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA 03:59:39 summary, phew, build-nanogpt github repo
显示更多
0
413
15.4K
2.2K
转发到社区
🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0). - It is a direct implementation of the training loop and backpropagation in C/CUDA. - It compiles and runs instantly. No more "hit run then wait for tens of seconds for unknown reasons", for mountains of inscrutable abstractions to build a Universe. - It deletes the need for the Python interpreter and a deep learning library. - It allocates all the memory a single time at the start. - It's pretty cool. How: Getting this to work required us to write a lot of custom CUDA kernels, and doing this manually (instead of using Tensor ops of aten/PyTorch and torch.compile etc.) is a bit like programming in assembly. And you spend quality time looking at more assembly (CUDA PTX/SASS). But this also means we get to hyperoptimize the code and possibly explore optimizations that torch.compile might find difficult to, which is awesome. Examples of optimizations that went in over the last few days: - we're being clever with our memory consumption in the backward pass, only using a few buffers we need to propagate the gradients, saving memory capacity. - one fused classifier kernel does the last layer forward pass, the loss, and kicks off the backward pass. - many improvements to all the kernels involved, including e.g. gains from carefully constraining execution within the autoregressive mask in attention - cuBLAS(Lt) calls for all heavy lifting matmuls, and fused bias accumulation Big credits to two CUDA experts who appeared from somewhere on the internet to help this open source project, ngc92 and ademeure. We're hanging out of Github and Discords of CUDAMODE and my NN Zero to Hero. Next steps: - more optimizing of our (fp32) kernels, and especially switch to flash attention. - mixed precision training (fp16 to start). - multi-gpu training (DDP to start). - data & evals to set up a proper GPT-2 training runs - 🚀 repro GPT-2 (1.6B) training run. - more modern architectures etc. (Llama 3?) - writing, videos, exercises on building all of this from scratch. Figure 1: eye candy: timing profile of the kernels (one layer). NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.
显示更多
0
148
5.1K
521
转发到社区