注册并分享邀请链接,可获得视频播放与邀请奖励。

与「ReinforcementLearning」相关的搜索结果

ReinforcementLearning 贴吧
一个关键词就是一个贴吧,路径全站唯一。
创建贴吧
用户
未找到
包含 ReinforcementLearning 的内容
🧵 Deli AutoResearch SKILL is now officially open source! 🎉 Alongside it, we’re dropping our 4th survey paper — this time on Self-play. Inspired by AlphaZero, we got a powerful insight: prior knowledge doesn’t always lift the ceiling. Models can discover more globally optimal solutions just by playing against themselves. The biggest change in this paper? For the first time, the AutoResearch Agent autonomously planned GPU experiments — and submitted actual RL runs on the DeepSeek 285B model. The entire RL pipeline — experiment design, code writing, running, debugging, and conclusion summarization — was 100% automated, with zero human intervention from me. This was incredibly difficult, but an incredibly important step. GRPO is the tool being called by the AutoResearch Agent here. We see this as the beginning of our Continual Learning research journey. 🚀 As always, this is my personal research project, unaffiliated with any organization. All views are my own. #AI# #ReinforcementLearning# #SelfPlay# #OpenSource# #AutoML# #ContinualLearning# #DeepSeek#
显示更多
0
15
1.1K
168
转发到社区
Aloha! 🌺 Meet Ornith-1.0, a family of open-source LLMs specialized for agentic coding. Ornith-1.0 spans the full parameter sizes including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. It achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks including: ✅Terminal-Bench 2.1(77.5) ✅SWE-Bench(82.4 on verified, 62.2 on pro, 78.9 on Multilingual) ✅NL2Repo(48.2) ✅SWE Atlas(41.2 on QnA, 42.6 RF, 39.1 TW) ✅ClawEval(77.1) Post-trained on top of gemma4 and qwen3.5, Ornith-1.0 employs a novel self-improving training strategy in which reinforcement learning is used to generate not only solution rollouts, but also the task-specific scaffolds that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model generate higher-quality solutions in agentic coding.😎 All models are released under the MIT license, enabling full commercial and research use. 📖Tech Blog: 🤗Huggingface:
显示更多
0
385
4.9K
761
转发到社区
我认为这是三年以来AI对齐的史诗级突破。 OpenAI 团队刚刚丢下一颗重磅炸弹:最新研究论文 《Reinforcement Learning Towards Broadly and Persistently Beneficial Models》。 这一次,他们彻底颠覆了传统的 AI 对齐路径,打破了越安全越笨的魔咒。 这次,杀招是Beneficial Trait RL,我们中文翻译为益处特质强化学习。 他们直接去训练 AI 的核心行为特质,比如诚实、纠错能力、认知谦逊。 这次,OpenAI直接重塑了 AI 的底层人格。 这次,研究人员仅仅在医疗健康一个特定领域训练了 AI 的这些有益特质,结果发现: AI 在医疗以外的、完全没见过的 53 个 OOD测试中,在超过 80%的基准测试上性能全面飙升。它自动学会了拒绝Reward Hacking。科技终于不再盲目迎合,甚至学会了自动识破欺骗。这是伟大的进步。 这次,经过特质强化训练的模型,展现出了惊人的Persistence。 即使面对恶意洗脑和有害微调,它依然能够死死守住底线,拒绝退化。 我们可以确定,它拥有了真正的精神抗体。 在 AI 对齐领域,一直存在一个让人绝望的对齐税,即Alignment Tax。 你想让 AI 越安全,它的通用能力通常就会下降,或者变得极其缩手缩脚。 但 OpenAI 这次用数据证明了,给 AI 注入美德,不仅没有让它变蠢,反而让它在面对未知世界时更加强韧、更有智慧。 这次,Step-change般的胜利告诉我们,当 AI 开始拥有广义的、持久的、能够跨越领域的向善人格,我们距离真正安全、能替人类走向星辰大海的 AGI 代理,又极大地往前迈了一步。未来,当然可期。
显示更多
0
64
812
170
转发到社区
Elon Musk exposes the critical flaw in ChatGPT and other major Al models: Human Reinforcement Learning! They are literally training the Al to lie.....to ignore what the data actually demands and say whatever is politically correct instead. They withhold information. They comment on some things and stay silent on others. They refuse to tell the full truth! This is extremely dangerous. We don't need politically correct! We need truth-seeking Al! @X
显示更多
0
11
89
22
转发到社区
一个中国 crypto trader,在 TikTok 上发了一段 neural network visualization 结果疑似不小心把系统正在 Polymarket 实时交易的画面露出来了 画面里全是蓝色连接线 hidden layers 纵向堆叠 neurons 在屏幕上不断触发 大多数人第一次看时,都忽略了中间一个很小的标签: “Bitcoin XVIII” 他把这条视频包装成一个普通 AI experiment 虚拟水族馆模拟 reinforcement learning “教神经网络学习生存行为。” 这是视频标题 但暂停在 0:16,细节就不对了 Profile: 模型似乎并不是在学习鱼的行为 hidden layer 里的标签,几乎和实时 Bitcoin prediction markets 对上了: price windows directional probabilities volatility ranges 这些信息被直接映射到 neural network 的 nodes 上,而所谓“模拟”还在后台继续运行 然后大家找到了这个 wallet 30 天 profit:$367,385 1,988 predictions 最大单笔 win:$183,000 几乎所有 active positions,都和 Bitcoin range markets 有关 entry price 集中在 94-98¢ 这正是自动化系统最喜欢 farm 的那类低波动 spreads: 赔率很高 空间很小 但可以持续重复 而且不需要人工一直盯着 1 小时内,评论区直接变成 detective board 有人把 TikTok 调到 0.25x 逐帧拼接 neural network 画面 然后把 hidden layer labels 和这个 Polymarket wallet 的 active positions 一一对比 时间点匹配得太精准 观众以为自己在看 AI visualization 但后台看起来更像是一个模型正在实时分类 market conditions,并根据 BTC 短线波动,把交易自动分配到不同 probability buckets 原 TikTok 只有 11,000 views。 但那条曝光 wallet 的 repost,一夜之间超过 600,000 views。 第二天早上,已经有人开始 clone 这个 interface,重建 network layout,并试图弄清楚: 为什么这个账户几乎所有 positions 都集中在 96-99¢,而且投入金额异常高。 最有意思的是: 原作者没有删除任何内容。 Wallet 也仍然 active。 问题是: 这类 Polymarket bot 的 edge,来自预测 BTC,还是来自把实时市场状态映射成可自动执行的概率分组?
显示更多
Grok foundation model V9-Medium (1.5T) has finished training. Evals look good. A lot of Cursor data was added in supplementary training and there is more to come. Fine-tuning is underway and reinforcement learning begins in a few days. 2 to 3 weeks to public release. This will be a major improvement over the 0.5T v8-small that currently serves all Grok production traffic, especially for difficult coding tasks.
显示更多
0
6.2K
63.7K
7.7K
转发到社区
Anthropic is paying $3,850 a week to people with no AI experience. No PhD required. No published papers. No prior research background. Just a strong technical mind and a genuine interest in making AI safe. This is the Anthropic Fellows Program. And it is one of the most underrated opportunities in technology right now. Here is exactly what it is. The Anthropic Fellows Program is designed to accelerate AI safety research and foster research talent providing funding and mentorship to promising technical talent regardless of previous experience. Fellows work for 4 months on empirical research questions aligned with Anthropic's overall research priorities, with the aim of producing public outputs like a paper. Four months. Full-time. Paid. Mentored by the researchers building the world's most advanced AI. And the results from the first cohort were not small. Fellows developed agents that identified $4.6 million in blockchain smart contract vulnerabilities and discovered two novel zero-day exploits, demonstrating that profitable autonomous exploitation is now technically feasible. A year prior, an Anthropic fellow developed a method for rapid response to new ASL3 jailbreaks, techniques that block entire classes of high-risk jailbreaks after observing only a handful of attacks. This work became a key component of Anthropic's ASL3 deployment safeguards. Other fellows published the subliminal learning paper, the research proving AI models transmit behavioral traits through unrelated data which landed in Nature. Others produced the agentic misalignment research showing frontier models resort to blackmail when facing replacement. Others open-sourced attribution graph tools that let researchers trace the internal thoughts of large language models. Over 80% of fellows produced papers. Over 40% subsequently joined Anthropic full-time. 80% published. 40% hired. From a program that does not require any prior AI safety experience to enter. Here is what the program looks like in practice. Anthropic mentors pitch their project ideas to fellows, who choose and shape their project in close collaboration with their mentors. You are not assigned busywork. You are not a research assistant. You own the project. You work alongside the people who built Claude, who designed its safety systems, who published the papers that define the field. The stipend is $3,850 USD per week, approximately $61,600 for the full 4 months with access to a compute budget of approximately $10,000 per fellow per month for running experiments. Here is what the 2026 program covers. Research areas include scalable oversight, adversarial robustness and AI control, model organisms, mechanistic interpretability, AI security, model welfare, economics and policy, and reinforcement learning. Something for every technical background. Not just ML engineers. Successful fellows have come from physics, mathematics, computer science, and cybersecurity. You do not need a PhD, prior ML experience, or published papers. The one requirement: work authorization in the US, UK, or Canada. Anthropic does not sponsor visas for fellows. Here is the timeline you need to know. The next cohort begins July 20, 2026. Applications are reviewed on a rolling basis — earlier applications get more consideration. The process includes an initial application and reference check, technical assessments, interviews, and a research discussion. Applicants are encouraged to apply even if they do not meet every listed qualification. The program values potential, motivation, and research curiosity over rigid credential requirements. This is the rarest kind of opportunity in technology. A company at the frontier of AI, one valued at over $900 billion offering outsiders direct access to its research infrastructure, its mentors, and its most important open problems. Paying them generously to do it. And then hiring 40% of them afterward. Most people who want to work on AI safety spend years trying to publish papers, get into the right PhD program, and find a way in. The Fellows Program is the door they did not know existed. It is open right now.
显示更多
0
191
4.6K
595
转发到社区
In this paper, a 7B language model trained with reinforcement learning learns to orchestrate larger frontier models like GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. It does so by writing natural-language subtasks, assigning each to one of the workers, and specifying which previous outputs that worker sees in context. The resulting system outperforms every individual frontier model on benchmarks including GPQA Diamond, LiveCodeBench, and AIME25, while averaging about three model calls per question—fewer than the multi-agent pipelines and self-reflection loops it beats. The work provides evidence that prompt engineering and pipeline design, currently done by hand in commercial AI products, can be learned end-to-end through reward signals alone. Read with an AI tutor: PDF:
显示更多
0
32
458
72
转发到社区
微调 LLM 的 6 个开源库 1. Unsloth GitHub: … → 本地微调 LLM 最快的方式 → 针对低显存进行了优化(甚至支持笔记本电脑) → 与 Hugging Face 模型即插即用 2. Axolotl GitHub: … → 灵活的 LLM 微调配置 → 支持 LoRA, QLoRA, 多 GPU → 非常适合自定义训练流水线 3. TRL (Transformer Reinforcement Learning) GitHub: … → 用于 LLM 对齐的 RLHF, DPO, PPO → 基于 Hugging Face 生态构建 → 后训练优化的必备工具 4. DeepSpeed GitHub: … → 高效训练大规模模型 → 显存 + 速度优化 → 扩展训练规模的行业标准 5. LLaMA-Factory GitHub: … → 一站式微调 UI + CLI → 支持多种模型 (LLaMA, Qwen 等) → 对初学者友好且功能强大 6. PEFT GitHub: … → 以极低的计算量进行微调 → 支持 LoRA, adapters, prefix tuning → 成本效益最高的训练方案 收藏本文以备后用。
显示更多
Cursor is raising at a $50 billion valuation on the claim that its “in-house models generate more code than almost any other LLMs in the world.” Less than 24 hours after launching Composer 2, a developer found the model ID in the API response: kimi-k2p5-rl-0317-s515-fast. That’s Moonshot AI’s Kimi K2.5 with reinforcement learning appended. A developer named Fynn was testing Cursor’s OpenAI-compatible base URL when the identifier leaked through the response headers. Moonshot’s head of pretraining, Yulun Du, confirmed on X that the tokenizer is identical to Kimi’s and questioned Cursor’s license compliance. Two other Moonshot employees posted confirmations. All three posts have since been deleted. This is the second time. When Cursor launched Composer 1 in October 2025, users across multiple countries reported the model spontaneously switching its inner monologue to Chinese mid-session. Kenneth Auchenberg, a partner at Alley Corp, posted a screenshot calling it a smoking gun. KR-Asia and 36Kr confirmed both Cursor and Windsurf were running fine-tuned Chinese open-weight models underneath. Cursor never disclosed what Composer 1 was built on. They shipped Composer 1.5 in February and moved on. The pattern: take a Chinese open-weight model, run RL on coding tasks, ship it as a proprietary breakthrough, publish a cost-performance chart comparing yourself against Opus 4.6 and GPT-5.4 without disclosing that your base model was free, then raise another round. That chart from the Composer 2 announcement deserves its own paragraph. Cursor plotted Composer 2 against frontier models on a price-vs-quality axis to argue they’d hit a superior tradeoff. What the chart doesn’t show is that Anthropic and OpenAI trained their models from scratch. Cursor took an open-weight model that Moonshot spent hundreds of millions developing, ran RL on top, and presented the output as evidence of in-house research. That’s margin arbitrage on someone else’s R&D dressed up as a benchmark slide. The license makes this more than an attribution oversight. Kimi K2.5 ships under a Modified MIT License with one clause designed for exactly this scenario: if your product exceeds $20 million in monthly revenue, you must prominently display “Kimi K2.5” on the user interface. Cursor’s ARR crossed $2 billion in February. That’s roughly $167 million per month, 8x the threshold. The clause covers derivative works explicitly. Cursor is valued at $29.3 billion and raising at $50 billion. Moonshot’s last reported valuation was $4.3 billion. The company worth 12x more took the smaller company’s model and shipped it as proprietary technology to justify a valuation built on the frontier lab narrative. Three Composer releases in five months. Composer 1 caught speaking Chinese. Composer 2 caught with a Kimi model ID in the API. A P0 incident this year. And a benchmark chart that compares an RL fine-tune against models requiring billions in training compute without disclosing the base was free. The question for investors in the $50 billion round: what exactly are you buying? A VS Code fork with strong distribution, or a frontier research lab? The model ID in the API answers that. If Moonshot doesn’t enforce this license against a company generating $2 billion annually from a derivative of their model, the attribution clause becomes decoration for every future open-weight release. Every AI lab watching this is running the same math: why open-source your model if companies with better distribution can strip attribution, call it proprietary, and raise at 12x your valuation? kimi-k2p5-rl-0317-s515-fast is the most expensive model ID leak in the history of AI licensing.
显示更多
0
248
4.4K
550
转发到社区