Beyond faster training, does Muon learn better features than Adam?
🚀 Ans: Yes. Muon learns features that are more robust to input corruptions and transfer better to downstream tasks.
This advantage is reflected in hidden states:
1⃣larger logit margins → stronger robustness
2⃣higher effective rank → richer, more transferable representations
Paper Link:
A thread 🧵
显示更多