Achieving generalization is one of a core problem in DNNs(Deep Neural Networks). DNNs have extremely large number of parameters, resulting in high model complexity. Therefore, any well-conditioned training problem can be ﬁt with DNNs, but high model complexity makes solution of DNNs underdetermined, meaning DNNs has too many solutions for the target training problem. To reduce the solution space of this underdetermined system, numerous regularization concepts have been proposed. In this work, the ﬂat minima theory is adopted as a constraint of optimization problem. The ﬁrst concept of ﬂat minima is described in [19, 18]. In this paper, we give more concrete theoretical explanations on why ﬂat minima works better. A classic viewpoint of generalization is described in output robustness with respect to input perturbations. We analyze the ﬂatness of loss surfaces through the lens of robustness to input perturbations and advocate that gradient descent should be guided to reach ﬂatter region of loss surfaces to achieve generalization. By doing so, we show the relation of learning rate and generalization.
Furthermore, we developed a method which can discover ﬂatter minima to improve the optimization of DNNs. Whereas optimizing deep neural networks using stochastic gradient descent has shown great performances in practice, the rule for setting step size (i.e. learning rate) of gradient descent is not well studied. Although it appears that some intriguing learning rate rules such as ADAM  have since been developed, they concentrated on improving convergence, not on improving generalization capabilities. Recently, the improved generalization property of the ﬂat minima was revisited, and this research guides us towards promising solutions to many current optimization problems. We suggest a learning rate rule for escaping sharp regions of loss surfaces and propose a concept of learning rate scheduling called peak learning stage. Based on peak learning stage, we propose an adaptive-perparameter version of learning rate scehduling called Adapeak.
Finally, we demonstrate the capacity of our approach by performing numerous experiments. To experimentally verify our theories, we performed many perturbation analysis on both input space and weight space. DNNs are extensively high-dimensional model, so it is hard to observe the ﬂatness of its weight space. Therefore, we evaluate the subspace of high-dimensional loss surfaces and propose some eﬀective methods for selecting subspaces of high-dimensional loss surfaces to estimate the generalization capability of the DNN model.