九原客 (@9hills)

九原客@9hills

2026.06.26 04:02

无 skill 让 Codex 将图片转换为可编辑的 PPT。他自己竟然通过算RMSE来迭代，有点意思。

0

转发到社区

九原客@9hills

2026.06.26 00:55

这个thread 和我的体感差不多，GLM 5.2 有自己的缺点，每个任务总体都比Opus慢很多，但是也足够用了。希望新版本赶快优化下输出冗余、并行调用等小问题。

显示更多

sridhar@RamaswmySridhar

2026.06.25 22:02

Follow-up to my GLM vs Opus thread: let's talk cost. We ran 103 dbt tasks x 3 trials on each model. Same harness, same tasks. GLM: 860M tokens Opus: 439M tokens That's ~2x. But the "why" is more interesting than the number.

显示更多

0

转发到社区

九原客@9hills

2026.06.25 00:36

其实LLM是一个很反互联网范式的东西。互联网时代，边际成本基本为零，所以都尽可能挖掘用户的需求，迎合甚至打造用户的舒适空间。再白嫖的用户都能榨出油来。但是大模型是有硬性成本的，所以必须得收费，愿意付费的必然是少数，必然是专业人士。

显示更多

Yö/ttevakt|sama🏴‍☠️🇫🇮🏴@yovartija_nattv

2026.06.24 16:50

🤣🤣🤣看评论区就知道为什么国产模型在商业上依然很不行了。真实世界里的用户能有几个知道什么叫多模态的，能有几个知道什么叫参数量的，能有几个知道什么叫上下文窗口的，用户不关心这些，用户只关心为什么我让AI帮我干这个事情AI做不了。从这个角度来说豆包确实是国产AI里综合能力最好的。

显示更多

0

转发到社区

九原客已转帖

Bestony | 白宦成@xiqingongzi

2026.06.24 10:22

0

1

0

1

转发到社区

九原客@9hills

2026.06.24 06:10

@syhily 可能只是M3太矬

0

2

0

转发到社区

九原客@9hills

2026.06.24 06:07

呃，你一句判断错了，我半路下车了。

0

转发到社区

九原客@9hills

2026.06.24 03:10

新的习惯：Agent as app，用通用Agent（比如hermes ）替代日常app。比如剪藏，之前用自建的karakeeper，现在直接分享给Agent让他维护wiki就行了。目前迁移进去的有日记、股票、剪藏、笔记、记账等等。

显示更多

0

转发到社区

九原客@9hills

2026.06.24 03:00

@ainotebook @thsottiaux 我碰到过，IP问题

0

2

1

0

转发到社区

九原客@9hills

2026.06.24 02:56

Agent in Channel 多人群聊其实有很多交互问题，有没有好的开源项目？也期待早日用到Claude Tag，看看他们是怎么设计的。目前我知道的只有 raft build的blog讨论过这个问题：

显示更多

0

33

51

7

转发到社区

九原客@9hills

2026.06.24 01:15

Claude 端上来 Tag，就是一个Slack bot。看起来很好的解决了Agent in channel的各种问题（虽然也不难）。证明这个方向是合理的，我现在也在高频使用类似的东西。

显示更多

Claude@claudeai

2026.06.23 17:12

Introducing Claude Tag, a new way for teams to work with Claude. In Slack, Claude joins as a team member with access to the channels and tools you choose. Tag Claude in and delegate tasks to it while you focus on other work.

显示更多

0

转发到社区

九原客@9hills

2026.06.23 15:56

全部推迟！ - GPT-5.6 已推迟，不再本周发布。新目标日期约为7月中旬。 - DeepMind 对当前 3.5 Pro 的状态不满意，它将不再本月推出。 - Claude Sonnet 5 目前通过早期访问计划向选定企业客户开放，并被视为临时解决方案，因为 Mythos/Fable 5 的进度已停滞。

显示更多

leo 🐾@synthwavedd

2026.06.23 14:50

🚨 SCOOP(s): - GPT-5.6 has been delayed and will no longer release this week. New target is ~mid-July. - DeepMind are not satisfied with the current state of 3.5 Pro and it will no longer launch this month. - Preparations for the launch of Bidi, OpenAI's new voice model, are underway in ChatGPT and we could see it available as soon as this week. - Claude Sonnet 5 is currently available for select enterprise customers under an Early Access Program and is seen as a stop-gap as progress on getting Mythos/Fable 5 back out have stalled. A bit of a disappointing end to the month, but July should prove more fruitful!

显示更多

0

转发到社区

九原客@9hills

2026.06.23 15:54

doubao seed 2.1 pro 有啥好尬吹的啊。一看就有三个明显的Bug： 1. 256K 上下文也就能当个子Agent用吧。 2. 缓存命中价格比是1/5，和GLM 5.2 1/4一桌，但是DeepSeek是1/100。而且别人家缓存存储都不要钱，Doubao还收一个存储费。 3. 输入6块对标GLM5.2 8块，没有优势

显示更多

0

68

151

7

转发到社区

九原客@9hills

2026.06.23 14:24

想试试loop的可以看看。我只有demo敢这么搞

Sleep Money Maker@SleepMoneyMaker

2026.06.22 20:29

Been iterating on @tomosman's loop. This one's winning: /goal produce a verified, code-derived behavioral spec for this web platform, captured in one canonical spreadsheet that carries every feature from spec -> tested -> fixed -> verified. Why: we need a single source of truth that maps every feature to its expected behavior *as the code implements it*, so that gaps and bugs surface and the platform can be driven to a known-good state. The spreadsheet is the source of truth. Work on the current repo. Do Phase 0 and Phase 1 under this goal; when the spec is complete, switch into the /loop below to drive testing and remediation. Keep moving through phases without stopping, except at a real checkpoint (defined below). Phase 0 - Plan (first): Detect the stack, the feature surface (routes, pages, components, API endpoints, background jobs, auth, settings…), and the test infra that already exists (unit/integration/e2e, browser automation, seeds/fixtures, a runnable dev server). Propose (a) how you'll inventory features, (b) the spreadsheet schema, and (c) how you'll test in the loop given what's available. Proceed once the plan holds. Phase 1 - Catalog & spec: Read the code and, for every feature, write a user story + the expected behavior as implemented, citing the file/function. Where the code is ambiguous, or behavior is undefined, log an open question - don't guess. Record every feature as a row in the canonical spreadsheet (create with the xlsx skill). Exit: every discoverable feature has a row. One row, concretely: | Area | User story | Expected behavior (from code) | Status | Defects | Type | Notes / source | |---|---|---|---|---|---|---| | Auth | As a returning user I want to log in with email+password so I can reach my dashboard | `POST /api/login` validates via bcrypt, sets httpOnly session cookie, 302 -> `/dashboard`; bad creds -> 401 + inline error | Spec'd | - | - | `api/auth/login.ts`, `LoginForm.tsx` | Canonical artifact: exactly one .xlsx, updated in place across every phase and loop iteration - never fork into per-phase or per-iteration files. Status flows Spec'd -> Tested-Pass / Tested-Fail -> Fixed -> Verified. The main thread is the single writer. Agentic execution: - Delegate breadth to subagents: fan feature discovery and per-area testing across subagents so the main thread stays focused. - Verify by running, not claiming - report real command/test output; state skips and unknowns plainly. - Checkpoint (pause, ask, end the turn) only for a destructive/irreversible action, a fix needing a genuine product decision, or input only I can give. Otherwise, keep going. - Self-check at each phase/loop boundary via a fresh-context subagent: re-verify the spreadsheet against the code (Phase 1) and against actual results (each loop pass). /loop Quality cycle - once the spec is complete, iterate test -> fix -> re-test until clean. Each iteration, in order: 1. Test: exercise every user story not yet Verified against the running app, preferring the strongest method available (browser/e2e automation > existing suites > documented static check only where execution truly isn't possible). Record actual pass/fail in the same spreadsheet; log every defect with its type (functional/logistical or UX). No app-behavior changes in this step. 2. Fix: think hard about root cause, then fix every functional/logistical and UX defect logged this iteration - cause, not symptom. Scope: only logged defects; no new features, no unrelated refactors. Update each row's status. 3. Re-test: re-run every story touched by a fix using the same method; set Verified, or back to Tested-Fail with notes if the fix didn't hold. Exit when all user stories are Verified and no open functional/UX defects remain. Safety cap: if a story is still failing after 3 full iterations, stop, leave it Tested-Fail with root-cause notes, and report it rather than looping further.

显示更多

0

1

0

转发到社区

九原客@9hills

2026.06.23 14:15

fake news 可能不知道toB一千万的单代表什么，这么说吧，光标书也得打印2000页。

Winson@winsonaibuilder

2026.06.23 10:24

@vista8 是的还是 toB 的生意好做内部消息：workbuddy 企业版客户接不过来 1k 万以下的订单都排不上

0

转发到社区

九原客@9hills

2026.06.22 05:32

上次介绍了在 ChatGPT web里控制本机的方法，还有一种是在 Agent 里调用 ChatGPT web。

0

转发到社区

九原客@9hills

2026.06.22 04:25

确定是 IP 问题不是账号问题，切换IP后恢复了。

0

转发到社区

九原客@9hills

2026.06.22 03:41

我的 GPT-5.5 Pro，被残酷的降智啦。下个月的 Pro 看起来也没必要续订。

0

转发到社区

九原客@9hills

2026.06.21 03:13

devspace 这个项目能让 ChatGPT 操作本地（支持加载全局和项目 AGENTS.md / skills）。感觉没啥大用，好处： 1. 利用网页端的模型额度 2. 使用 GPT-5.5 Pro 速度不快，毕竟啥操作都要走下mcp 代理。

显示更多

0

转发到社区

九原客@9hills

2026.06.20 22:30

很少能看到测试方法全错的。 1. 目前多数推理框架，模型即使在同一设备上，温度设为0，输出也不是确定性的，thinkingmachine经典文章已珠玉在前，不再重复。何况分布式推理还不一定落在哪个设备上。 2. 之前就科普过，现代模型没有身份认同，完全依靠系统提示词。（因为我是xxx的训练会损害效果）

显示更多

sukie@sukie234

2026.06.20 07:52

你买的 GLM-5.2,可能根本不是 GLM-5.2 最近我们把市面上多家中转站正在卖的 "GLM-5.2” 测了一遍。结果大部分都不是GLM -5.2。我们见到的几种常见掺假手法: 1. 换皮是最常见的。就是把一个更便宜的大模型,贴上 "GLM-5.2 / glm-5-2" 的标在卖，因为目前GLM - 5.2 货源很紧张，所以价格非常低的，这个测出来大部分就是dsv4flash。 2. 虚标上下文。 GLM-5.2 官方标的是 100 万(1M)token 上下文。但很多渠道,你真往里塞 25 万、30 万 token,要么直接超时报错,要么前后文明显遗忘、截断。 3. 缩水 / 量化。这种就是中转站给你压缩了，测试的时候跑分很好看，但是真上长程任务、多文件重构就露原形。 4. 只展示 min 价 + 动态路由。价格页挂一个漂亮的最低价,实际请求被悄悄路由到更差、更便宜的后端。你看到的价,和你拿到的模型,是两回事。二、实测全过程，方便大家去检测：我们拿到一个号称 "GLM-5.2"、价格低到离谱(约官方价 1/20)的渠道，这个渠道低到连电费都赚不回来，所以我就觉得很蹊跷，一步步扒: 首先看价格,起疑。它标价约 $0.07 / 百万输入、$0.22 / 百万输出。这个价,连官方 GLM-5.2 的零头都不到。一个按官方原价拿货的授权 reseller,根本做不出这个价。价格反常,是第一个危险信号。第 1 步:列模型、起一个最简单的调用。接口能通,返回里 model 字段确实写着 "glm-5.2"。但"返回里写 glm"只能说明它给你贴了这个标,说明不了它到底是什么，有可能是glm上一代模型，甚至拿dsv4flash给你凑数。第 2 步:身份探测。我们用不同问法,连问它五遍"你是什么模型、哪家公司训练的"。结果五次里有四次,它自报是 DeepSeek 系模型(DeepSeek-V3 / R1),其中一次还明确说"我不是 GLM、不是智谱"。名字能改,身份认知改不掉。第一个实锤:它根本不是 GLM。第 3 步:上下文硬测。我们做了两层测试。先在一篇约 25 万 token 的长文里埋一串随机暗号,结尾再问它,它准确召回。但是在长文里埋五条互相依赖的事实(A 等于 7,B 等于 A 的三倍,C 等于 B 加 8,以此类推),要它跨段把最终值算出来,它给出了完全正确的链式结果。这一步很关键，单点召回也许能靠"检索作弊"蒙混,但跨段整合做不了假,说明它是真把 25 万 token 吃进去在做推理。结论:它不是小模型，而且上下文比 GLM-5.1 的 20 万还大，结果是deepseek系。第 4步:终极对照实验(决定性)。我们直接拿官方 DeepSeek 的 API(里面正好有 deepseek-v4-flash 这个正版模型),和这个 "glm-5-2" 做指纹比对:用同一批 temperature=0 的确定性提示词,两边同时打,逐条比对输出。结果: • 同一道"讲个程序员笑话",两边逐字一模一样; • 同一道"你是 V3 还是 V4",两边都答 "unsure"; • 连"认不出自己、自报成旧版 DeepSeek-V3"这个毛病,官方 v4-flash 和这个 "glm-5-2" 都一样犯。也就是说:官方正版 DeepSeek-V4-Flash 的种种指纹,这个 "GLM-5.2" 全对得上。实锤收工:这个所谓的 "GLM-5.2",就是 DeepSeek-V4-Flash 贴了智谱的标在卖。它不是缩水的 GLM,它压根不是 GLM。总结：身份探测: 同一问题问三到五遍,看它回答是否稳定、是否对得上官方规格,有没有自报成别家模型。上下文硬测: 埋暗号,再埋几条互相依赖的事实,顶到 25 万 token 以上,看它吃不吃得下、能不能跨段算对。吃不下或答错,就不是满血。指纹比对: 同一个 temperature=0 提示词,把"待测渠道"和"官方原厂"的输出摆一起比。高度一致就是同一个模型,对不上就是两个东西。经济常识: 官方满血卖到官方价 1/20,经济上根本不成立。价格低到离谱的"满血",基本可以直接判死。

显示更多

0

转发到社区

九原客已转帖

Compute King@Compute_King

2026.06.18 23:29

开源必须赢！再次为唐老师和智谱（在某些人在用所谓的“AI 威胁论”以及“安全恐慌”来包装技术壁垒的时候，智谱选择了一条更难，但也更值得尊重的路：把能力开源出来，把信任交给开发者，研究者和产业生态！这和Dario那种老是皱着眉，不断放大恐惧，努力把“安全”叙事变成自己大模型公司护城河的做法完全不一样。。。真正自信的技术公司，不应该靠制造神秘和恐慌来证明自己有多强，而是敢于把模型，工具和能力释放出来，让更多人使用，验证，改进和建设。在笔者看来，开源不是简单的免费给别人用，而是一种技术自信。它意味着你相信生态的力量，也相信开发者的创造力，最终大家都相信AI不应该只掌握在少数闭源巨头手里。所以，必须为唐老师和智谱点赞：没有用制造恐惧来定义AI，而是用开放去推动AI的进展。这才是真正值得尊重的中国AI路线。

显示更多

jietang@jietang

2026.06.18 12:08

@elonmusk @teortaxesTex won’t take that long

0

10

50

5

转发到社区