搜索 DDP 相关的推文与用户

2026.03.23 12:41

“Very, very innovative and dynamic landscape,”“In China for the world!” This is how Ola Källenius, Chairman of the Board of Management of Mercedes-Benz Group AG, describes the deep collaboration between Mercedes-Benz and China’s tech leaders.

显示更多

0

31

203

52

转发到社区

双拼推友@LostXtui

2026.02.26 12:18

注意安全

0

88

2.6K

96

转发到社区

JILL‎🤍🩶@JILL_mw

2025.11.24 08:47

2回目撮影券赤い印のあたり！

0

57

3

转发到社区

甜怡ヤンイ댱이🐹✨@DyangYi

2025.09.19 02:06

곧 만나🥹 아무튼 신드라임 #동대문# #DDP# #e스포츠#

0

16

158

10

转发到社区

甜怡ヤンイ댱이🐹✨@DyangYi

2025.09.18 14:00

곧 만나쟝!! ٩(ˊᗜˋ*)و✨❤️‍🔥 　 25년 09월 19일(금) 리그오브레전드 - 영혼의 꽃 신드라 25년 09월 21(일) 발로란트 - 클로브 　 선착순 100명 팬사인회 예정이니까 많관부!!!🥰 　 #e스포츠# #동대문# #DDP# #GES2025#

显示更多

0

15

202

20

转发到社区

ないる🐰🐾@nairuru

2025.07.10 11:16

💜💭

0

35

1.9K

111

转发到社区

ソフィー 🌸@PeachMilky_Cos

2025.02.06 15:32

エアリス、お誕生日おめでとう💐🌼🌸 #エアリス誕生祭2025#

0

28

4.4K

277

转发到社区

るき@BOOTH開始🌟@G_Ale_2000

2024.06.27 12:30

晴着姿ってほんとうに特別なものだと思いませんか👏　－洋装編－ #RAphoto#

0

1

36

2

转发到社区

Andrej Karpathy@karpathy

2024.06.09 23:41

📽️ New 4 hour (lol) video lecture on YouTube: "Let’s reproduce GPT-2 (124M)" The video ended up so long because it is... comprehensive: we start with empty file and end up with a GPT-2 (124M) model: - first we build the GPT-2 network - then we optimize it to train very fast - then we set up the training run optimization and hyperparameters by referencing GPT-2 and GPT-3 papers - then we bring up model evaluation, and - then cross our fingers and go to sleep. In the morning we look through the results and enjoy amusing model generations. Our "overnight" run even gets very close to the GPT-3 (124M) model. This video builds on the Zero To Hero series and at times references previous videos. You could also see this video as building my nanoGPT repo, which by the end is about 90% similar. Github. The associated GitHub repo contains the full commit history so you can step through all of the code changes in the video, step by step. Chapters. On a high level Section 1 is building up the network, a lot of this might be review. Section 2 is making the training fast. Section 3 is setting up the run. Section 4 is the results. In more detail: 00:00:00 intro: Let’s reproduce GPT-2 (124M) 00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint 00:13:47 SECTION 1: implementing the GPT-2 nn.Module 00:28:08 loading the huggingface/GPT-2 parameters 00:31:00 implementing the forward pass to get logits 00:33:31 sampling init, prefix tokens, tokenization 00:37:02 sampling loop 00:41:47 sample, auto-detect the device 00:45:50 let’s train: data batches (B,T) → logits (B,T,C) 00:52:53 cross entropy loss 00:56:42 optimization loop: overfit a single batch 01:02:00 data loader lite 01:06:14 parameter sharing wte and lm_head 01:13:47 model initialization: std 0.02, residual init 01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms 01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms 01:39:38 float16, gradient scalers, bfloat16, 300ms 01:48:15 torch.compile, Python overhead, kernel fusion, 130ms 02:00:18 flash attention, 96ms 02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms 02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping 02:21:06 learning rate scheduler: warmup + cosine decay 02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms 02:34:09 gradient accumulation 02:46:52 distributed data parallel (DDP) 03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU) 03:23:10 validation data split, validation loss, sampling revive 03:28:23 evaluation: HellaSwag, starting the run 03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro 03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA 03:59:39 summary, phew, build-nanogpt github repo

显示更多

0

413

15.4K

2.2K

转发到社区

Andrej Karpathy@karpathy

2024.04.19 18:21

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0). - It is a direct implementation of the training loop and backpropagation in C/CUDA. - It compiles and runs instantly. No more "hit run then wait for tens of seconds for unknown reasons", for mountains of inscrutable abstractions to build a Universe. - It deletes the need for the Python interpreter and a deep learning library. - It allocates all the memory a single time at the start. - It's pretty cool. How: Getting this to work required us to write a lot of custom CUDA kernels, and doing this manually (instead of using Tensor ops of aten/PyTorch and torch.compile etc.) is a bit like programming in assembly. And you spend quality time looking at more assembly (CUDA PTX/SASS). But this also means we get to hyperoptimize the code and possibly explore optimizations that torch.compile might find difficult to, which is awesome. Examples of optimizations that went in over the last few days: - we're being clever with our memory consumption in the backward pass, only using a few buffers we need to propagate the gradients, saving memory capacity. - one fused classifier kernel does the last layer forward pass, the loss, and kicks off the backward pass. - many improvements to all the kernels involved, including e.g. gains from carefully constraining execution within the autoregressive mask in attention - cuBLAS(Lt) calls for all heavy lifting matmuls, and fused bias accumulation Big credits to two CUDA experts who appeared from somewhere on the internet to help this open source project, ngc92 and ademeure. We're hanging out of Github and Discords of CUDAMODE and my NN Zero to Hero. Next steps: - more optimizing of our (fp32) kernels, and especially switch to flash attention. - mixed precision training (fp16 to start). - multi-gpu training (DDP to start). - data & evals to set up a proper GPT-2 training runs - 🚀 repro GPT-2 (1.6B) training run. - more modern architectures etc. (Llama 3?) - writing, videos, exercises on building all of this from scratch. Figure 1: eye candy: timing profile of the kernels (one layer). NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.

显示更多

0

148

5.1K

521

转发到社区

与「DDP」相关的搜索结果