Thrilled to share that my single-author paper "Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation" has been accepted at ICLR! While I love collaborating, it's incredibly rewarding to see a solo project through to publication. Onwards and upwards! 🚀 #
ICLR2024# #
MachineLearning# #
research#
显示更多
Roundhill with a filing for a Neocloud ETF.. which according AI is a "specialized cloud infrastructure provider that focuses almost entirely on GPU-as-a-Service (GPUaaS) to power AI and machine learning workloads"
显示更多
Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more
When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance.
The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use:
➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro
➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility
➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers
Analysis of results:
➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58.
➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models.
➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering.
➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost.
➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks.
➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens.
➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most.
➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains.
This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.
显示更多
I still give the book Understanding Deep Learning by Simon J.D. Prince a good recommendation, but chapter 21: Deep learning and Ethics was sloppy. It could have been a chapter to really dig in on case studies, but it was just the basic public news story level coverage of bias and such, like:
“In AI, it can be pernicious when this deviation depends on illegitimate factors that impact an output. For example, gender is irrelevant to job performance, so it is illegitimate to use gender as a basis for hiring a candidate. Similarly, race is irrelevant to criminality, so it is illegitimate to use race as a feature for recidivism prediction.”
If they had stuck with “illegitimate”, then it would have been a question of societal choices, but “irrelevant” is a question about data, and your priors shouldn’t be so strong that data can’t move them.
I would like to see a book or course walk through a machine learning problem with the input features being presented as something like car choices: color, style, doors, horsepower, etc. Do lots of analysis over representation, training, and generalization, then swap the feature labels to socially charged ones.
What makes generalization credible in one situation but not the other?
显示更多
Sufficiently advanced agentic coding is essentially machine learning: the engineer sets up the optimization goal as well as some constraints on the search space (the spec and its tests), then an optimization process (coding agents) iterates until the goal is reached.
The result is a blackbox model (the generated codebase): an artifact that performs the task, that you deploy without ever inspecting its internal logic, just as we ignore individual weights in a neural network.
This implies that all classic issues encountered in ML will soon become problems for agentic coding: overfitting to the spec, Clever Hans shortcuts that don't generalize outside the tests, data leakage, concept drift, etc.
I would also ask: what will be the Keras of agentic coding? What will be the optimal set of high-level abstractions that allow humans to steer codebase 'training' with minimal cognitive overhead?
显示更多
In today's episode of programming horror...
In the Python docs of random.seed() def, we're told
"If a is an int, it is used directly." [1]
But if you seed with 3 or -3, you actually get the exact same rng object, producing the same streams. (TIL). In nanochat I was using the sign as a (what I thought was) clever way to get different rng sequences for train/test splits. Hence gnarly bug because now train=test.
I found the CPython code responsible in cpython/Modules/_randommodule.c [2], where on line 321 we see in a comment:
"This algorithm relies on the number being unsigned. So: if the arg is a PyLong, use its absolute value." followed by
n = PyNumber_Absolute(arg);
which explicitly calls abs() on your seed to make it positive, discarding the sign bit.
But this comment is actually wrong/misleading too. Under the hood, Python calls the Mersenne Twister MT19937 algorithm, which in the general case has 19937 (non-zero) bits state. Python takes your int (or other objects) and "spreads out" that information across these bits. In principle, the sign bit could have been used to augment the state bits. There is nothing about the algorithm that "relies on the number being unsigned". A decision was made to not incorporate the sign bit (which imo was a mistake). One trivial example could have been to map n -> 2*abs(n) + int(n < 0).
Finally this leads us to the contract of Python's random, which is also not fully spelled out in the docs. The contract that is mentioned is that:
same seed => same sequence.
But no guarantee is made that different seeds produce different sequences. So in principle, Python makes no promises that e.g. seed(5) and seed(6) are different rng streams. (Though this quite commonly implicitly assumed in many applications.) Indeed, we see that seed(5) and seed(-5) are identical streams. And you should probably not use them to separate your train/test behaviors in machine learning. One of the more amusing programming horror footguns I've encountered recently. We'll see you in the next episode.
[1]
[2]
显示更多
There have been a lot of crazy many-camera rigs created for the purpose of capturing full spatial video.
I recall a conversation at Meta that was basically “we are going to lean in as hard as possible on classic geometric computer vision before looking at machine learning algorithms”, and I was supportive of that direction. That was many years ago, when ML still felt like unpredictable alchemy, and of course you want to maximize your use of the ground truth!
Hardcore engineering effort went into camera calibration, synchronization, and data processing, but it never really delivered on the vision. No matter how many cameras you have, any complex moving object is going to have occluded areas, and “holes in reality” stand out starkly to a viewer not exactly at one of the camera points.
Even when you have good visibility, the ambiguities in multi camera photogrammetry make things less precise than you would like. There were also some experiments to see how good you could make the 3D scene reconstruction from the Quest cameras using offline compute, and the answer was still “not very good”, with quite lumpy surfaces. Lots of 3D reconstructions look amazing scrolling by in the feed on your phone, but not so good blown up to a fully immersive VR rendering and put in contrast to a high quality traditional photo.
You really need strong priors to drive the fitting problem and fill in coverage gaps. For architectural scenes, you can get some mileage out of simple planar priors, but modern generative AI is the ultimate prior.
Even if the crazy camera rigs fully delivered on the promise, they still wouldn’t have enabled a good content ecosystem. YouTube wouldn’t have succeeded if every creator needed a RED Digital Cinema camera.
The (quite good!) stereoscopic 3D photo generation in Quest Instagram is a baby step towards the future. There are paths to stereo video and 6DOF static, then eventually to 6DOF video.
Make everything immersive, then allow bespoke tuning of immersive-aware media.
显示更多
进入陌生领域时,Deep Research 的三大法宝
当你踏入一个全新的学科领域时——尤其是理工科或技术类领域——结构化学习至关重要。根据我的经验,以下三类 Deep Research 工具最为高效:
1.Textbook(理论 + 结构)
一本设计良好的 Textbook 是理论学习的主干,能够系统地引入核心概念,每一章环环相扣、层层递进。如果你是初学者,建议直接生成一个为期 3 或 6 个月的 course-style textbook(模仿美国大学学期制或 bootcamp 节奏),这样的结构更符合大脑的学习节奏,让知识自然沉淀。
2.Lab-Only Workbook(实践 + 操作)
实操训练是掌握技能的关键。一套 Lab-Only workbook 专注于实作演练,如编程、建模、仿真、数据分析、系统原型搭建等。它不再解释概念,而是让你“手到脑到”,将抽象知识变为真实能力。
3.Chronicles(历史 + 脉络)
Chronicles 记录某个领域的历史演进、关键技术变革与战略拐点。例如,Machine Learning 如何从统计学发展而来,Blockchain 如何通过密码学演化至今。这类编年史不仅提供背景理解,还能帮助你看清该领域的演化路径和未来趋势。
这三类工具——Textbook、Lab-Only、Chronicles——合起来,构成一个完整的 Deep Research 系统,在概念理解、技能实践和历史定位之间建立平衡,帮助你从“零基础”构建出“系统认知”。
我每天晚上睡觉以前就布置一堆任务下去。
显示更多
If you're a machine learning engineer passionate about European defense, and if you believe (like I do) that mass-produced semi-autonomous drones are the future of warfare, send your resume to talent
@harmattan.ai.
Harmattan is a very unique startup that is completely vertically integrated. It owns the end-to-end R&D and manufacturing pipeline for its drones -- hardware, software, even most components, down to the sensors. You'll get to work on cutting-edge systems and invent the future of defense.
显示更多
过去两天的思考。我是从头开始和GPT切磋我的商业计划。其实4o挺不错的。他确定我要做的是11-18岁的一个CS知识宇宙,我给他规定了三个要点
1)project centric
2) interlinked
3) 循序渐进且无重复
他建议我做成一个“宇宙”,并且我们敲定了Gallacy这个词,其实是我们自己造的一个词=galaxy+knowledge legacy。 我很满意。然后他帮我进行知识分类,完全使用宇宙星体和人造星体,比如project=planet, 还有虫洞,彗星,卫星等等。分别作为知识的类别体系。营造一种从中心发散的体系感觉。
然后他先创造了20个project planet, 看了一下,我也很满意。然后我建议他我们要创造一个恒星star的类别,每个恒星的知识量相当于能够打开一扇工业之门的大类。比如一些CS 专业的高级课程,比如machine learning,blockchain就是一个恒星。
然后我们开始搞一些细化的讨论。就是我要如何管理和呈现这个体系。他推荐了数据库的设计,API,和这个宇宙UI的呈现。浅浅的做了一下,我忘记做graphics有多烦了,后悔死,花了那么长时间就搓了4个球出来。
具体的deep research我感觉真的需要很深入的研究性,严谨性项目能用上。这个过程其实要不停的在模型之间积累够精炼,和正确的文本,才能产生下一步的brainstorm,商业计划书,备忘录,和readme。其实代码暂时都是次要的。
我现在没把他当工具,我把他当成商业伙伴。
显示更多