注册并分享邀请链接,可获得视频播放与邀请奖励。

与「124】M-line」相关的搜索结果

124】M-line 贴吧
一个关键词就是一个贴吧,路径全站唯一。
创建贴吧
用户
未找到
包含 124】M-line 的内容
【M-line Music#124】M-line# Special 2023「青春のセレナーデ」「LOVE LIKE CRAZY」/「すっぴん」/M-lineツアー日記 MC 宮本佳林・小片リサ #夏焼雅# #宮本佳林# #小片リサ# #ビタスイ# #佐藤優樹# #宮崎由加# #小関舞# #稲場愛香# #段原瑠々# #石田亜佑美# #ハロプロ# #mlinemusic#
显示更多
0
0
772
257
转发到社区
Example here is the llm.c GPT-3 (124M) training on FineWeb (figure cropped at 250B tokens), we seem to surpass GPT-3 HellaSwag (green line) at ~150B tokens, per paper expected this to be at 300B tokens. Will re-run with FineWeb-Edu. I do want to be a bit careful on conclusions though because HellaSwag is just one eval, mostly targeting English sentences and a multiple choice of their likely continuations in "tricky" settings. It may be that the GPT-2/3 datasets were a lot broader (e.g. more multilingual than FineWeb, or a lot more math/code than FineWeb, etc.). So it's likely we want to expand the set of evals to make more confident statements and comparisons.
显示更多
0
9
393
20
转发到社区
Day 24 of llm.c: we now do multi-GPU training, in bfloat16, with flash attention, directly in ~3000 lines of C/CUDA, and it is FAST! 🚀 We're running ~7% faster than PyTorch nightly, with no asterisks, i.e. this baseline includes all modern & standard bells-and-whistles: mixed precision training, torch compile and flash attention, and manually padding vocab. (Previous comparisons included asterisks like *only inference, or *only fp32 etc.) Compared to the current PyTorch stable release 2.3.0, llm.c is actually ~46% faster. My point in these comparisons is just to say "llm.c is fast", not to cast any shade on PyTorch. It's really amazing that PyTorch trains this fast in a fully generic way, with ability to cook up and run ~arbitrary neural networks and run them on a ton of platforms. I see the goals and pros and cons of these two projects as different, even complementary. Actually I started llm.c with my upcoming education videos in mind, to explain what PyTorch does for you under the hood. How we got here over the last ~1.5 weeks - added: ✅ mixed precision training (bfloat16) ✅ many kernel optimizations, including e.g. a FusedClassifier that (unlike current torch.compile) does not materialize the normalized logits. ✅ flash attention (right now from cudnn) ✅ Packed128 data structure that forces the A100 to utilize 128-bit load (LDG.128) and store (STS.128) instructions. It's now also possible to train multi-GPU - added: ✅ First version of multi-gpu training with MPI+NCCL ✅ Profiling the full training run for NVIDIA Nsight Compute ✅ PR for stage 1 of ZeRO (optimizer state sharding) merging imminently We're still at "only" 3,000 lines of code of C/CUDA. It's getting a bit less simple, but still bit better than ~3 million. We also split off the fp32 code base into its own file, which will be pure CUDA kernels only (no cublas or cudnn or etc), and which I think would make a really nice endpoint of a CUDA course. You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies. Our goal now is to create a reliable, clean, tested, minimal, hardened and sufficiently optimized LLM stack that reproduces the GPT-2 miniseries of all model sizes, from 124M to 1.6B, directly in C/CUDA. A lot more detail on: "State of the Union [May 3, 2024]"
显示更多
0
208
6.5K
618
转发到社区
🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0). - It is a direct implementation of the training loop and backpropagation in C/CUDA. - It compiles and runs instantly. No more "hit run then wait for tens of seconds for unknown reasons", for mountains of inscrutable abstractions to build a Universe. - It deletes the need for the Python interpreter and a deep learning library. - It allocates all the memory a single time at the start. - It's pretty cool. How: Getting this to work required us to write a lot of custom CUDA kernels, and doing this manually (instead of using Tensor ops of aten/PyTorch and torch.compile etc.) is a bit like programming in assembly. And you spend quality time looking at more assembly (CUDA PTX/SASS). But this also means we get to hyperoptimize the code and possibly explore optimizations that torch.compile might find difficult to, which is awesome. Examples of optimizations that went in over the last few days: - we're being clever with our memory consumption in the backward pass, only using a few buffers we need to propagate the gradients, saving memory capacity. - one fused classifier kernel does the last layer forward pass, the loss, and kicks off the backward pass. - many improvements to all the kernels involved, including e.g. gains from carefully constraining execution within the autoregressive mask in attention - cuBLAS(Lt) calls for all heavy lifting matmuls, and fused bias accumulation Big credits to two CUDA experts who appeared from somewhere on the internet to help this open source project, ngc92 and ademeure. We're hanging out of Github and Discords of CUDAMODE and my NN Zero to Hero. Next steps: - more optimizing of our (fp32) kernels, and especially switch to flash attention. - mixed precision training (fp16 to start). - multi-gpu training (DDP to start). - data & evals to set up a proper GPT-2 training runs - 🚀 repro GPT-2 (1.6B) training run. - more modern architectures etc. (Llama 3?) - writing, videos, exercises on building all of this from scratch. Figure 1: eye candy: timing profile of the kernels (one layer). NVIDIA cutlass kernels with solid compute throughput taking up a lot of the running time => nice.
显示更多
0
148
5.1K
521
转发到社区
Didn't tweet nanoGPT yet (quietly getting it to good shape) but it's trending on HN so here it is :) : Aspires to be simplest, fastest repo for training/finetuning medium-sized GPTs. So far confirmed it reproduced GPT-2 (124M). 2 simple files of ~300 lines
显示更多
0
31
2.1K
268
转发到社区
cloudflare 刚扣了我上个月账单,7 美元,我看了下上个月我的所有 vibe 产品流量是 6M,用了将近 200GB 带宽,独立访问量 124万,竟然只用了 7 美元。。。
0
25
701
32
转发到社区
都敢看到4900了 在开始写今晚夜报之前,我还在看bnb最近离谱的行情,已经突破1300美元,这玩意的最新总量是1.4亿枚左右,所以总市值接近1800亿,1800亿是什么概念,横向对比的话就是小米的市值。 很多人误会bnb是binace的股票,不是的,它只是binace发行的一个平台工具,横向对比的话类似于腾讯发行的q币,只不过bnb限总量,q币不限总量。一想到binace的平台币发出了一个小米的市值,就觉得这事太抽象了。 我区块链的价值头寸只囤三种饼,大饼btc,二饼eth,三饼就是bnb,sol都不在我的长线配置里。bnb我原先有1000个,从年初分批减仓,目前卖的只剩下不到500个,粗算了下大概踏空了大概35万刀,但既然制定了计划就会坚定执行,最后的对错只有在周期结束后算账才能算明白。 今晚不写中东那档子的事了,我为了鼓励读者参与互动,在划线评论那里设置了“自动精选”,结果今天登录后台一看发现读者之间吵成一锅粥了。很多人连最基本的摆事实讲道理都做不到,三句话不合就开始扣帽子,搞人身攻击,看了挺倒胃口的。 我这人是无所谓异见或者恶评的,不然也不会给网友那么大的自由发言空间,但我高估了一些人在公共场合发言的素质,吵的乌烟瘴气的会影响到其他读者的阅读体验,所以这事我先收一收。 …… 这两天又到了诺奖的颁奖季,每年有6个奖项:物理学奖、化学奖、生物或医学奖、文学奖、和平奖、经济学奖,其中前5个是诺贝尔1895年遗嘱里就定好了,只有经济学奖是1968年后设立的,奖金由瑞典中央银行单独提供。 诺奖从1901年开始颁发,至今已经124年,每年要发出大量的奖金。比如今年每个奖项的奖金大约是100万美元,6个就是600万美元,我还挺好奇它们的钱会不会用光。 最初诺贝尔留下的遗产是3100万瑞典克朗,这钱在1900年绝对算巨款,购买力大概相当于现在的8-10亿美元。这一百多年诺贝尔基金会一直在坚持理财,目前已经增值至56亿瑞典克朗,大概相当于5.5亿美元,哈哈,勉强跑平通胀 哦不对,它们每年要拿钱出来发奖,还要搞颁奖活动什么的,所以本来绝对是跑赢通胀的。 我查了诺奖基金的投资方向,55%买各国股票,25%持有对冲基金,10%房地产,10%低风险债券及现金。2024年+11.2%,2023年+10.7%,2022年-2%,2021年+18.4%,过去10年平均年化收益+8.2%,大资金能玩到这个业绩水平不差的。 今年目前已经颁发的有医学奖(玛丽·E·布伦科、弗雷德·拉姆斯德尔、坂口志文),有一个日本科学家,他们研究发现了免疫系统的调节性T细胞,利用免疫耐受来研发新药和治疗。 另外还颁发了物理学奖(约翰·克拉克、米歇尔·H·德沃雷、约翰·M·马蒂尼斯),这个我怎么看也看不懂,就知道是量子物理相关的,不知道a股的量子计算板块会不会蹭一下热度。 算上今年这个日本已经有29位诺奖得主,堪称亚洲之光,东亚中日韩的人种智商是差不多的,日本之所以能频繁得奖还是因为他们最早成为发达国家,上世纪七八十年代就已经能大笔资金投入科研,诺奖通常有20年的延后性,所以日本科学家们2000年后开始频繁获奖。 随着中国国力日渐强盛,科研投入持续增加,乐观估计2050年以后中国科学家也会陆续崭露头角。 …… 国际金价今天已经摸过4000美元,最新也停在3990+的位置,一哆嗦就上去了。 高盛今天上调了2026年底的金价预期目标,原先是4300美元,调整后更新为4900美元,上调了14%。高盛认为现在有两股资金在踊跃买金,一个是各类黄金etf的被动流入,另一个就是各国央行都在积极配置黄金头寸。 和这条消息联动的是中国9月末黄金储备报7406万盎司,连续第11个月增持黄金。 以后年轻人结婚能买的金饰会越来越少,不过能买的房子会越来越大,总的算下来还是不亏。 今晚就这些,明天是长假最后一天,可以开始做心理建设,准备好后天上班。不过对股民来说比较好的一点是,节后大概率又能接着挣钱📷
显示更多
# $BC **报告日期**:2026年5月16日 **代币名称**:BC Coin ($BC) **当前价格**:$0.00858 (+6.24% 24h) CA:BCNT4t3rv5Hva8RnUtJUJLnxzeFAabcYp8CghC1SmWin **市值**:$85.8M **区块链**:Solana + TON - **规模**:日活跃用户 12.9 万,日投注额 $9,089 万,日投注次数 733.5 万笔 - **产品矩阵**:1,000+ 老虎机、真人荷官桌、Crash 游戏、原创游戏、体育博彩 - **技术**:基于可验证公平(Provably Fair)系统和审计 RNG,支持即时存取款 - **合规**:持有多国游戏牌照,遵循负责任博彩原则 分发方式 | 零成本空投(无 ICO) ┌─────────────────────────────────────────────┐ │  流动性挖矿    50%   │████████████│  50 亿枚  │ │  社区空投      20%   │█████       │  20 亿枚  │ │  LDP (流动性池) 10%  │███         │  10 亿枚  │ │  顾问          10%   │███         │  10 亿枚  │ │  营销          10%   │███         │  10 亿枚  │ └─────────────────────────────────────────────┘ ``` **链上实际状态(2026年5月实时)**: | 类别 | 数量 | 占比 | 状态 | |------|------|------|------| | 流通供应 | 3,542,200,432 BC | 35.42% | 可自由交易 | | └─ BC Engine 质押 | ~1,293,000,000 BC | 12.93% | 质押锁定 | | └─ 真正自由流通 | ~2,249,200,432 BC | 22.49% | 实际可买卖 | | 锁定供应 | 6,200,000,000 BC | 62.00% | 待释放 | | 已销毁 | 257,799,568 BC | 2.57% | 永久移除 | | **总计** | **10,000,000,000 BC** | **100%** | | ### 3.3 解锁时间表与稀释压力 **已释放的 35.42% 流通量来源**: - 三次大规模流动性挖矿解锁事件(2024-2025年) - 部分 LDP 注入 DEX - 早期空投分发 **待解锁的 62% 锁定供应构成**: | 来源 | 估算数量 | 解锁机制 | 预计时间线 | |------|---------|---------|-----------| | 流动性挖矿剩余 | ~2.0B BC | 分批解锁 | 2025-2027 | | LDP 储备 | 1.0B BC | DEX 流动性注入 | 按需释放 | | 营销储备 | 1.0B BC | 营销活动触发 | 持续释放 | | 顾问分配 | 1.0B BC | **24个月线性解锁** | 2024-2026 | | 分发钱包滞留 | ~0.2B+ BC | 未完全分发 | 不定期 | **稀释压力测算**: ``` 当前流通量: 35.42 亿 BC 待解锁量:   55 亿+ BC(保守估计) 稀释倍数:   155% 的当前流通量 完全稀释市值(FDV): $85.8M × (100/35.42) = $242.2M FDV/流通市值比率: 2.82x 回购与销毁机制 **机制设计**: - **频率**:每周执行 - **资金来源**:平台收入/储备金 - **操作方式**:从公开市场回购 + 智能合约销毁 - **额外销毁源**:早期解押惩罚(EARLY UNSTAKE BURN) **销毁实绩**: - 累计销毁:257,799,568 BC(占总供应 2.57%) - 占流通量比例:7.28% - 2025 年披露销毁量:2.5 亿 BC **通缩 vs 稀释的赛跑**: | 指标 | 数值 | |------|------| | 年销毁速率(估算) | 1-2 亿 BC | | 待解锁总量 | 55 亿+ BC | | 完全对冲所需时间 | **27-55 年** | | 销毁/解锁比率 | **1:27 至 1:55** | 收益测算(基于轮次历史数据) **数据来源**:BC Engine #891-##910# 共 20 轮实际支付数据 | 统计项 | 数值 | |--------|------| | 单轮平均支付池 | 2,700 BCD | | 最低支付池(#891)# | 1,793.89 BCD | | 最高支付池(#896)# | 4,450.36 BCD | | 支付池波动范围 | -33.6% 至 +64.8% | **不同质押规模下的收益模拟**: | 你的质押量 | 占总质押份额 | 单轮收益 | 日收益 | 年化收益 | APY | |-----------|------------|---------|--------|---------|-----| | 129,260 BC (~$1,108) | 0.01196% | 0.323 BCD | $7.75 | $2,829 | **255%** | 早期解押惩罚(通缩加速器) BC Engine 设有 EARLY UNSTAKE BURN 机制: - 提前解押需牺牲部分 $BC - 被惩罚的 $BC 直接进入销毁池 - 这既保护长期质押者利益,又加速通缩 --- ## 平台运营健康度评估 ### 5.1 24 小时核心运营数据 | 指标 | 数值 | 行业对比 | |------|------|---------| | 总投注额 | $90,894,620 | 头部水平 | | 在线用户数 | 129,321 | 非常健康 | | 投注次数 | 7,335,370 | 高频活跃 | | 赢金额 | $89,407,578 | 赔付率 98.4% | | 平台毛利(估算) | ~$1.49M/天 | 强劲 | | House Edge(估算) | ~1.6% | 可持续 | ### 5.2 用户与流动性指标 | 指标 | 数值 | 解读 | |------|------|------| | 持有者数量 | 486,678 | 分布极广,去中心化程度高 | | 24h 交易量 | $747.2K | 流动性偏薄 | | 换手率 | 0.87% | 低换手 = 持有者惜售 | | 质押率 | 36.5% 流通量 | 高质押锁定减少抛压 | 估值分析 ### 6.1 相对估值法 **对标传统赌场股**: | 公司 | 市值 | 年收入 | 市销率 (P/S) | |------|------|--------|-------------| | Evolution Gaming | ~$25B | ~$2B | 12.5x | | Flutter Entertainment | ~$35B | ~$12B | 2.9x | | DraftKings | ~$15B | ~$4B | 3.75x | | ** ($BC)** | **$85.8M** | **~$400M(估算)** | **0.21x** | 即使保守估计 年收入 $300M,$BC 的市销率也仅 **0.29x**,是传统赌场股的 **1/10 到 1/40**。 **对标加密赌场竞品**: | 代币 | 市值 | FDV | FDV/收入 | |------|------|-----|---------| | Rollbit (RLB) | ~$300M | ~$600M | ~10x | | **BC ($BC)** | **$85.8M** | **$242M** | **0.6x** | $BC 的 FDV/收入比率仅为竞品的 **1/16**。 ### 6.2 绝对估值法(收入折现) 假设条件: - 当前年化 in-house 利润:$219M - 增长率:15%(前3年)→ 5%(永续) - 折现率:20%(加密风险溢价) - 代币稀释:未来3年解锁 30 亿 BC **DCF 估值**: ``` ┌──────────────────────────────────────────┐ │  3年预测利润(考虑稀释)                    │ │  Year 1: $219M × 1.15 × 0.85 = $214M    │ │  Year 2: $214M × 1.15 × 0.90 = $222M    │ │  Year 3: $222M × 1.15 × 0.93 = $237M    │ │                                          │ │  终值: $237M × 1.05 / (0.20-0.05) = $1.66B │ │                                          │ │  企业价值: ~$1.2-1.5B(宽区间)            │ │  每 BC 价值: $0.12-0.15(流通调整前)       │ └──────────────────────────────────────────┘ ``` **估值结论**:即使考虑稀释,$BC 的合理价值区间可能在 **$0.05-0.15**,当前 $0.00858 存在显著的低估可能。但这一估值高度依赖平台利润可持续性和稀释管理。 最终结论 **$BC 是一个高风险、高回报的「收益型基础设施代币」**,其投资价值建立在三个支柱上: 1. **超高质押收益(200-380% APY)**:BC Engine 以 BCD 稳定币支付的收益机制设计精巧,在当前低利率环境下极具吸引力 2. **极低的估值基数(0.2x 市销率)**:相对平台收入和行业对标,存在 5-20 倍的重估空间 3. **真实的平台收入支撑**:日投注 $9,100 万、年化利润 $2 亿+,基本面扎实 **但致命风险同样不可忽视**: - **55 亿+ BC 待解锁** = 当前流通量的 155% - **销毁速率仅为解锁速率的 1/27 至 1/55** - **62% 的供应仍掌握在团队/协议手中** @bcgame
显示更多
Remember the llm.c repro of the GPT-2 (124M) training run? It took 45 min on 8xH100. Since then, @kellerjordan0 (and by now many others) have iterated on that extensively in the new modded-nanogpt repo that achieves the same result, now in only 5 min! Love this repo 👏 600 LOC
显示更多
0
50
4.2K
391
转发到社区
📽️ New 4 hour (lol) video lecture on YouTube: "Let’s reproduce GPT-2 (124M)" The video ended up so long because it is... comprehensive: we start with empty file and end up with a GPT-2 (124M) model: - first we build the GPT-2 network - then we optimize it to train very fast - then we set up the training run optimization and hyperparameters by referencing GPT-2 and GPT-3 papers - then we bring up model evaluation, and - then cross our fingers and go to sleep. In the morning we look through the results and enjoy amusing model generations. Our "overnight" run even gets very close to the GPT-3 (124M) model. This video builds on the Zero To Hero series and at times references previous videos. You could also see this video as building my nanoGPT repo, which by the end is about 90% similar. Github. The associated GitHub repo contains the full commit history so you can step through all of the code changes in the video, step by step. Chapters. On a high level Section 1 is building up the network, a lot of this might be review. Section 2 is making the training fast. Section 3 is setting up the run. Section 4 is the results. In more detail: 00:00:00 intro: Let’s reproduce GPT-2 (124M) 00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint 00:13:47 SECTION 1: implementing the GPT-2 nn.Module 00:28:08 loading the huggingface/GPT-2 parameters 00:31:00 implementing the forward pass to get logits 00:33:31 sampling init, prefix tokens, tokenization 00:37:02 sampling loop 00:41:47 sample, auto-detect the device 00:45:50 let’s train: data batches (B,T) → logits (B,T,C) 00:52:53 cross entropy loss 00:56:42 optimization loop: overfit a single batch 01:02:00 data loader lite 01:06:14 parameter sharing wte and lm_head 01:13:47 model initialization: std 0.02, residual init 01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms 01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms 01:39:38 float16, gradient scalers, bfloat16, 300ms 01:48:15 torch.compile, Python overhead, kernel fusion, 130ms 02:00:18 flash attention, 96ms 02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms 02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping 02:21:06 learning rate scheduler: warmup + cosine decay 02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms 02:34:09 gradient accumulation 02:46:52 distributed data parallel (DDP) 03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU) 03:23:10 validation data split, validation loss, sampling revive 03:28:23 evaluation: HellaSwag, starting the run 03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro 03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA 03:59:39 summary, phew, build-nanogpt github repo
显示更多
0
413
15.4K
2.2K
转发到社区