注册并分享邀请链接,可获得视频播放与邀请奖励。

与「mlp」相关的搜索结果

mlp 贴吧
一个关键词就是一个贴吧,路径全站唯一。
创建贴吧
用户
未找到
包含 mlp 的内容
Is it possible to make electronics out of food? And why would anyone want to do that? Find out on this week’s “Babbage” podcast
What if we model test time adaptive sampling as MDP? In our recent work, RL-Guided Adaptive sampling, we model the test time sampling as a MDP. Then we train a 4-layer MLP on CPU as controller. This lightweight framework dynamically balances answer correctness, latency, and computation cost only rely on light statistics! 🚀 @zhengtoong @ruiliu0 @ChengsongH31219 @hongtuzhu1 @HongtuZ20093 📄 Paper: 💻 Code:
显示更多
朝から夢を掴みに来ました
0
12
386
4
转发到社区
It all started with Hermione Granger in Harry Potter… Didn’t know it back then, but that was the beginning.
#あす卒2026# ついに明日千秋楽です!!!! ぜひ、ご予定空いてる方はすみだパーク倉へ❤️‍🔥 がんばるぞー!!!
0
6
862
82
转发到社区
舞台 #リアニ2026# 3日目! 今日もまっすぐ 木葉ちゃん 頑張りました!🍃 明日は休演日! ゆっくり休んで残りの公演も走り続けるぞ💨 本日も来てくださった皆様 ありがとうございました!🤍💚
显示更多
0
17
949
99
转发到社区
#PaperADay# 10 LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics The comments on #PaperADay# 3 recommended this paper as the state of the art JEPA paper, and it does look much better! They acknowledge that much of the prior JEPA research is ad-hoc and full of heuristics, but here they make strong theoretical claims of optimality and provide proofs (which I did not read). The first claim is that isotropic gaussian is the unique optimal embedding distribution for both linear and nonlinear probing, minimizing worst-case risk across downstream tasks. I would have taken that on faith with just a “sounds good to me”, but they go into it with details and examples. Actually getting an isotropic gaussian in high dimensions is easier said than done. They present Sketched Isotropic Gaussian Regularization (SIGReg) as a well behaved loss function to achieve this after analyzing a number of different statistical tests, and they claim it beats the curse of dimensionality with linear scalability. The final loss is just a blend factor to weight the JEPA prediction loss against the SIGReg isotropy loss. This is the one tunable hyperparameter for LeJEPA. Despite the P in JEPA, they don’t use predictor networks here, they just directly compare view embeddings for the JEPA loss. Predictor networks could still be useful for video sequences, especially when conditioned with action information for agents / robots. Each training image is augmented to produce 2 global views and 6 local views with different spatial scales but the same set of color and geometric transformations. The loss is the average MSE between the average of the global view embeddings and each of the local view embeddings. I don’t have a good feel for the tradeoffs in their view transforms, which still seem very much in the ad-hoc space, but they will determine the nature of what gets filtered out of the representation. Learning what doesn’t matter is critical, but the specification of “matters” is only implicit in the view transformations. LeJEPA itself is architecture independent – anything that digests a batch of samples from a dataset into vectors can be used. Vision transformers, MLP, ConvNets, etc. The specific augmentations for views would be input modality specific, but the LeJEPA algorithm could work on audio, images, video, or other things. They show that the LeJEPA loss on a large foundation model is very indicative of downstream task performance, both directly, and with a heuristic to improve the predictive power of the loss farther. They also show that it can be used to train from scratch on small datasets with as few as 1000 samples and achieve better results than probing a conventional general foundation model. I was pleased to see sample code blocks in the paper instead of greek-laden pseudocode, as well as a github repo. Appendix D has interesting details on generating good coverage of unit hyperspheres with low discrepancy samples by transforming Sobol sequences, but this is only for their theoretical analysis, and they show you are better off just making new random hypervectors every batch, with even 16 random vectors outperforming a fixed set of thousands. Some questions: In the discussion of non-linear probing, only kNN and kernel methods are mentioned, presumably for their theoretical analysis tractability, but would an MLP generally perform better? A JEPA embedding is not fully reversible like NICE or a RevNet, so how does it react to inputs that are far outside the training set? Will novel inputs map to unique embeddings, or could they be collapsed onto the codes from the training set? How would the embeddings evolve in a continuous learning environment, as novel inputs are added to the training mix? Can a JEPA be overtrained – is lower training loss always better, or would there be an optimal early stopping point?
显示更多
0
23
311
27
转发到社区
MLP レインボーダッシュ🌈🎥
0
7
812
79
转发到社区