jietang (@jietang) — TopicDigg

jietang@jietang

2026.06.18 12:08

@elonmusk @teortaxesTex won’t take that long

228

4.9K

398

转发到社区

jietang@jietang

2026.06.17 07:36

（Claude、GPT、GLM） GLM-5.2 Tops Artificial Analysis as the #1# Open-Source Model, Ranking Top 3 Globally GLM-5.2 launched and went open-source today, delivering a solid scorecard across multiple authoritative third-party benchmarks and arenas. 📊 Artificial Analysis Intelligence Index A comprehensive evaluation that integrates several authoritative leaderboards spanning coding, reasoning, long context, and more. GLM-5.2 scored 51, ranking among the top of all available models—on par with Claude Opus 4.8—and claiming the #1# spot among open-source models worldwide. 🎨 Code Arena A real-world head-to-head arena focused on front-end code generation, with Elo rankings produced by blind user voting. GLM-5.2 ranked #2# globally with a score of 1,595. 🏆 DesignArena A category arena centered on scenarios that combine design and code. GLM-5.2 took the top spot with a score of 1,360. ⚙️ FrontierSWE A software-engineering benchmark built around the "frontier of human capability," assessing engineering ability across three dimensions: implementation, performance, and research. GLM-5.2 ranked #3# overall. 💪 From front-end development and design-to-code to engineering-grade software tasks, GLM-5.2 consistently lands in the top tier across multiple real-world evaluation scenarios, steadily closing in on the world's strongest models. We'll keep pushing forward in pursuit of an ever-higher ceiling of intelligence.

显示更多

593

转发到社区

jietang@jietang

2026.06.16 23:12

We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include: Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20% Pure Open: An MIT open-source license — no regional limits, technical access without borders Supporting long-horizon tasks starts with making long context engineering-usable: the model must maintain quality across long, messy coding-agent trajectories, not just accept more tokens. A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure. To this end, we substantially expanded 1M-context training for coding-agent scenarios, covering large-scale implementation, automated research, performance optimization, and complex debugging. The result is a long-context system that is not only wide in scope, but solid in execution: a practical substrate for sustained engineering work. This capability is reflected in GLM-5.2's performance on three long-horizon coding benchmarks. FrontierSWE measures whether an agent can complete open-ended technical projects at the scale of hours to tens of hours, spanning systems optimization, large-scale code construction, and applied ML research. On this benchmark, GLM-5.2 trails Opus 4.8 by only 1%, while edging out GPT-5.5 by 1% and Opus 4.7 by 11%. On PostTrainBench, where each agent is given an H100 GPU and evaluated by how much it can improve small models through post-training, GLM-5.2 outperforms both Opus 4.7 and GPT-5.5, ranking second only to Opus 4.8. On SWE-Marathon, an ultra-long-horizon software engineering benchmark covering tasks such as building compilers, optimizing kernels, and developing production-grade services, GLM-5.2 still has room to grow, trailing Opus 4.8 by 13% while remaining second only to the Opus series. Across all three benchmarks, GLM-5.2 is the highest-ranked open-source model, showing that its 1M context has translated into practical long-horizon delivery capability.

显示更多