注册并分享邀请链接,可获得视频播放与邀请奖励。

Arena.ai 的个人资料封面
Arena.ai 的头像

Arena.ai (@arena)

@arena
Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring →
216 正在关注    169.9K 粉丝
Exciting news: GLM-5.2 (Max) ranks #2# in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2# React and #4# HTML sub-leaderboards - Ranks as the top model in nearly all sub categories: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, and Simulations. Congrats @Zai_org for the incredible milestone!
显示更多
0
160
4.3K
492
转发到社区
GLM-5.2 (Max) by @Zai_org ranks #10# on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1# open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem, and terminal tools to complete complex workflows. The leaderboard measures model performance on outcomes relative to the average model using a causal tracing methodology. Compared to 5.1, GLM-5.2 (Max) climbs from #13# to #10#. Its clearest gains are confirmed task success, and user praise vs. complaint. Bash capabilities and tool hallucination remain stable. There is a tradeoff in steerability compared to the previous model (-6.0% vs. +1.2%). GLM-5.2 remains the same price as GLM-5.1, $1.4/$4.4 per input/output MTokens. 1M context window. Huge congrats @Zai_org for the incredible release! See thread for details on how GLM-5.2 (Max) performs across 5 different signals.
显示更多
0
17
527
50
转发到社区
Qwen3.7 Max (20250517) debuts at #4# in Code Arena: Frontend - the top-ranked Chinese lab on the board, surpassing GLM-5.1 and is now on par with Claude Opus 4.6 on agentic web development tasks. Huge congrats to @Alibaba_Qwen on this achievement!
显示更多
0
50
942
91
转发到社区
Qwen3.7 Preview By @Alibaba_Qwen lands on Arena for Text and Vision. In Text Arena, Qwen3.7 Max Preview ranks #13# overall. Alibaba is now the #6# lab in this arena. - #7# Math - #9# Expert - #9# Software & IT - #10# Coding In Vision Arena: Qwen3.7 Plus Preview ranks #16# overall, making Alibaba the #5# lab. Congrats to the @Alibaba_Qwen team on the latest progress!
显示更多
0
32
403
39
转发到社区
US vs China update. Stanford's AI Index put the US–China gap at 2.7%. Here's what two years of real-world use from the Text Arena shows. Gap three years ago: +278. Today: +29. @AnthropicAI's Claude Opus 4.6 Thinking vs. Baidu's @ErnieforDevs Ernie 5.1 at the top. The US has never lost #1#, but the race keeps closing.
显示更多
0
22
292
33
转发到社区