近来机器学习模型呈现出一种向大模型发展的趋势,模型参数越来越多,但依然具有很好的泛化性能。一些研究者认为泛化性能得益于随机梯度下降算法(SGD)所带来的随机噪声。但最近一篇 ICLR 2022 的投稿《Stochastic Training is Not Necessary for Generalization》通过大量实验证实全批量的梯度下降算法(GD)可以达到与 SGD 不相上下的测试准确率,且随机噪声所带来的隐式正则化效应可以由显式的正则化替代。
![](https://image.jiqizhixin.com/uploads/editor/83a9f7b5-f0ed-4226-8830-95978029b3cd/640.png)
![](https://image.jiqizhixin.com/uploads/editor/8f165845-c58a-43e3-a9b5-b6836b8335c6/640.png)
![](https://image.jiqizhixin.com/uploads/editor/9f76b35b-a1ab-4cdb-9b1b-dc3073b1483d/640.png)
![](https://image.jiqizhixin.com/uploads/editor/8764b4a2-f944-476a-83c0-dbca75eb6660/640.png)
![](https://image.jiqizhixin.com/uploads/editor/bf4c7cb2-8956-4496-9d3e-d8c6edc4ebff/640.png)
![](https://image.jiqizhixin.com/uploads/editor/bbd11788-f054-4865-aaf3-93f9b4f8ddaf/640.png)
![](https://image.jiqizhixin.com/uploads/editor/72b29b13-1452-475b-9b07-7dbb1c6a6009/640.png)
![](https://image.jiqizhixin.com/uploads/editor/bf018f95-972a-4607-8ce2-63587d35d86d/640.png)
![](https://image.jiqizhixin.com/uploads/editor/ea780c38-33a1-44ea-a113-d6dad528bf70/640.png)
![](https://image.jiqizhixin.com/uploads/editor/d7a520d1-8f00-49bc-9d91-0a4f0d161299/640.png)
![](https://image.jiqizhixin.com/uploads/editor/adcc1949-98d1-4613-a7ea-88e14dc66bfe/640.png)
![](https://image.jiqizhixin.com/uploads/editor/25b76b43-b46c-475a-8615-cad2aa4403b6/640.png)
![](https://image.jiqizhixin.com/uploads/editor/49ed753e-3349-46c1-8367-a9336e8651f0/640.png)
![](https://image.jiqizhixin.com/uploads/editor/3b01d3e9-6fad-4bc9-a908-9375b64723c0/640.png)
![](https://image.jiqizhixin.com/uploads/editor/1b708439-fff5-4195-98a4-6b2bf80617bd/640.png)
https://www.zhihu.com/question/494388033?utm_source=wechat_session&utm_medium=social&utm_oi=56560353017856&utm_content=group3_supplementQuestions&utm_campaign=shareopn
https://www.reddit.com/r/MachineLearning/comments/pziubx/r_stochastic_training_is_not_necessary_for/