Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more
When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance.
The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use:
➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro
➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility
➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers
Analysis of results:
➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58.
➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models.
➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering.
➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost.
➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks.
➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens.
➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most.
➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains.
This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.
显示更多
I still give the book Understanding Deep Learning by Simon J.D. Prince a good recommendation, but chapter 21: Deep learning and Ethics was sloppy. It could have been a chapter to really dig in on case studies, but it was just the basic public news story level coverage of bias and such, like:
“In AI, it can be pernicious when this deviation depends on illegitimate factors that impact an output. For example, gender is irrelevant to job performance, so it is illegitimate to use gender as a basis for hiring a candidate. Similarly, race is irrelevant to criminality, so it is illegitimate to use race as a feature for recidivism prediction.”
If they had stuck with “illegitimate”, then it would have been a question of societal choices, but “irrelevant” is a question about data, and your priors shouldn’t be so strong that data can’t move them.
I would like to see a book or course walk through a machine learning problem with the input features being presented as something like car choices: color, style, doors, horsepower, etc. Do lots of analysis over representation, training, and generalization, then swap the feature labels to socially charged ones.
What makes generalization credible in one situation but not the other?
显示更多
In today's episode of programming horror...
In the Python docs of random.seed() def, we're told
"If a is an int, it is used directly." [1]
But if you seed with 3 or -3, you actually get the exact same rng object, producing the same streams. (TIL). In nanochat I was using the sign as a (what I thought was) clever way to get different rng sequences for train/test splits. Hence gnarly bug because now train=test.
I found the CPython code responsible in cpython/Modules/_randommodule.c [2], where on line 321 we see in a comment:
"This algorithm relies on the number being unsigned. So: if the arg is a PyLong, use its absolute value." followed by
n = PyNumber_Absolute(arg);
which explicitly calls abs() on your seed to make it positive, discarding the sign bit.
But this comment is actually wrong/misleading too. Under the hood, Python calls the Mersenne Twister MT19937 algorithm, which in the general case has 19937 (non-zero) bits state. Python takes your int (or other objects) and "spreads out" that information across these bits. In principle, the sign bit could have been used to augment the state bits. There is nothing about the algorithm that "relies on the number being unsigned". A decision was made to not incorporate the sign bit (which imo was a mistake). One trivial example could have been to map n -> 2*abs(n) + int(n < 0).
Finally this leads us to the contract of Python's random, which is also not fully spelled out in the docs. The contract that is mentioned is that:
same seed => same sequence.
But no guarantee is made that different seeds produce different sequences. So in principle, Python makes no promises that e.g. seed(5) and seed(6) are different rng streams. (Though this quite commonly implicitly assumed in many applications.) Indeed, we see that seed(5) and seed(-5) are identical streams. And you should probably not use them to separate your train/test behaviors in machine learning. One of the more amusing programming horror footguns I've encountered recently. We'll see you in the next episode.
[1]
[2]
显示更多
There have been a lot of crazy many-camera rigs created for the purpose of capturing full spatial video.
I recall a conversation at Meta that was basically “we are going to lean in as hard as possible on classic geometric computer vision before looking at machine learning algorithms”, and I was supportive of that direction. That was many years ago, when ML still felt like unpredictable alchemy, and of course you want to maximize your use of the ground truth!
Hardcore engineering effort went into camera calibration, synchronization, and data processing, but it never really delivered on the vision. No matter how many cameras you have, any complex moving object is going to have occluded areas, and “holes in reality” stand out starkly to a viewer not exactly at one of the camera points.
Even when you have good visibility, the ambiguities in multi camera photogrammetry make things less precise than you would like. There were also some experiments to see how good you could make the 3D scene reconstruction from the Quest cameras using offline compute, and the answer was still “not very good”, with quite lumpy surfaces. Lots of 3D reconstructions look amazing scrolling by in the feed on your phone, but not so good blown up to a fully immersive VR rendering and put in contrast to a high quality traditional photo.
You really need strong priors to drive the fitting problem and fill in coverage gaps. For architectural scenes, you can get some mileage out of simple planar priors, but modern generative AI is the ultimate prior.
Even if the crazy camera rigs fully delivered on the promise, they still wouldn’t have enabled a good content ecosystem. YouTube wouldn’t have succeeded if every creator needed a RED Digital Cinema camera.
The (quite good!) stereoscopic 3D photo generation in Quest Instagram is a baby step towards the future. There are paths to stereo video and 6DOF static, then eventually to 6DOF video.
Make everything immersive, then allow bespoke tuning of immersive-aware media.
显示更多
GenCast is a diffusion model, similar to the machine learning models which also power generative AI. 🎨
We trained it on 40 years of historical data from
@ecmwf, which included variables such as temperature, wind speed, and pressure at various altitudes - enabling it to learn global weather patterns.
显示更多
Congratulations to the first graduates from the AI for Science Master’s program at
@AIMSacza 🎓
Last year, we partnered with AIMS to provide full scholarships, equipment and compute to students, giving them access to advanced studies in mathematics, AI and machine learning . 📚
显示更多
Exciting news: today is my first day
@Path_AI as a Sr. Technology Advocate leading Machine Learning and Open Source Advocacy 🥳.
PathAI is using ML to improve cancer patient outcomes and I'm thrilled to be part of the mission!
显示更多