搜索 harness 相关的推文与用户

Cedric Gao@CedricGao

2026.05.15 01:52

@turingou 反过来想，现在稀缺的可能是 real life 的 harness

0

转发到社区

郭宇 guoyu.eth@turingou

2026.05.14 20:40

实话说我有点悲观，我不知道自己还能在 harness 上做些什么，当你想控制的马车几乎能自动驾驶去任何地方，控制它这件事本身就缺乏意义。做实时音频模型的创业公司也是一样，ChatGPT 只需要把自家的实时模型接入到 codex in ChatGPT 就分分钟能完成使用几乎双工的语音通道控制你家里 mac/windows/远程 SSH linux 主机上的 codex 从而控制任何计算资源。我觉得虽然今年还没走到一半，但对 harness startup 他们的故事已经结束了…

显示更多

0

21

246

8

转发到社区

郭宇 guoyu.eth@turingou

2026.05.14 20:25

OpenAI 的估值也许被极大低估了，它是世界上唯一一家拥有顶级 SOTA 模型，最多用户数据后训练，最好的产品化跨平台 harness product（codex）和最充裕算力的 AI 公司。

显示更多

0

13

221

11

转发到社区

郭宇 guoyu.eth@turingou

2026.05.14 20:18

可想而知，在 sandbox 并发的问题解决后，OpenAI 也会支持 codex 的云端化。harness engineering 和继续它的产品比我想象更快地被模型外侧的第一层产品所吞噬，验证了「wanman 三月不发布就不用发布」的悲观设想

显示更多

0

2

9

0

转发到社区

Peter Steinberger 🦞@steipete

2026.05.14 18:03

mcporter 0.11.0 is live I use mcporter mainly as more stable browser automation cli these days and for agents to test MCPs without having to restart. I do love that code mode is slowly being adopted by harnesses so this will be less needed.

显示更多

0

10

199

10

转发到社区

郭宇 guoyu.eth@turingou

2026.05.14 14:13

理解不了 claude 为啥要禁止 claude -p 使用订阅 plan，做这种事情有啥意义，想绕开的还是能绕开，不想绕开的可能转头就用 codex cli 去跑自动化 harness，两边吃力不讨好，当务之急不是赶紧提升自己的算力吗？

显示更多

0

19

71

1

转发到社区

Cline@cline

2026.05.13 15:13

Introducing the Cline SDK. We rebuilt the Cline harness for our extension and CLI from scratch using all the lessons learned since creating one of the world's first coding agents in 2024, and are open sourcing it for others to build with today. npm i @cline/sdk 🧵

显示更多

0

174

1.4K

4.9K

转发到社区

郭宇 guoyu.eth@turingou

2026.05.12 08:52

围绕前沿 SOTA 模型的第一层 harness 产品已经做的非常好了，以至于我现在做新产品的时候已经很少使用 plan 模式，换句话说，现在 codex/cc 可以自动的理解意图来执行目标。现在刚到 2026 年 5 月，短短四个月过去，harness 的进展惊人，但在云端 harness 仍有关键的多处问题未得到解决。

显示更多

0

6

54

1

转发到社区

郭宇 guoyu.eth@turingou

2026.05.11 19:28

wanman 终于开始走向了自托管 git + 自托管 runner，harness 工程如果完全把人排除在外，很多东西的速度都成了瓶颈。把这个做完之后，wanman run 执行普通工作也会有自带版本管理，从这个角度看，知识工作和软件工程其实是一码事儿

显示更多

0

8

29

0

转发到社区

Artificial Analysis@ArtificialAnlys

2026.05.11 15:49

Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers Analysis of results: ➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58. ➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models. ➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering. ➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost. ➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks. ➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens. ➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most. ➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains. This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.

显示更多

0

109

1.4K

161

转发到社区

与「harness」相关的搜索结果