注册并分享邀请链接,可获得视频播放与邀请奖励。

Ravid Shwartz Ziv 的个人资料封面
Ravid Shwartz Ziv 的头像

Ravid Shwartz Ziv (@ziv_ravid)

@ziv_ravid
AI researcher | Meta | NYU. Working on compression, representation learning, and memory. I have an AI podcast!
3.4K 正在关注    12.1K 粉丝
I read the GLM-5.2 report and saw they use IndexShare, which is a cool, simple trick. Regular attention makes every token look at every other token, which is the quadratic cost everyone keeps trying to kill. Sparse attention is a workaround where each token only looks at a small set of relevant tokens instead of all of them. In DSA the way you pick that set is a small "indexer" that scores the keys and keeps the top-k. The indexer stays cheap but still picks well because it's trained to imitate the real attention distribution with a KL loss, and ranking which tokens matter turns out to be a much easier job than computing the exact attention, so it can run in FP8. The problem is that the indexer is itself quadratic, and it runs at every layer. so at 1M context most of your compute goes into deciding what to attend to, not into the attention. The trick with IndexShare is that instead of running it every layer, you share one indexer across a group of 4 layers and let the other 3 reuse that selection. they got 2.9x fewer FLOPs per token at 1M! the idea is betting the set of tokens worth attending to barely changes from one layer to the next, so recomputing it every layer is wasted work. This idea of sharing across different layers is not new, of course. things like HySparse or Kascade do similar reuse but keep a few real dense-attention layers around to compute the "true" selection. GLM takes it one step further and reuses the output of an indexer that was already an approximation, and it holds up because the model is trained that way from mid-training, not switched on at inference. Super simple!
显示更多