It has been recently observed that probabilistic ideas could be useful in deep learning. For instance, stochastic gradient descent (SGD) enables a deep neural network to learn a task efficiently, and dropout prevents co-adaptation of neurons through random subnetworks. Despite their wide adoption, our understanding of their role in high dimensional parameter spaces is limited. In this dissertation, we analyze SGD from a geometrical perspective by inspecting the stochasticity of the norms and directions of minibatch gradients. We claim that the directional uniformity of minibatch gradients increases over the course of SGD. Furthermore, we formulate that dropout regularizes learning to minimize the deviation from the origin and that the strength of regularization adapts along the optimization trajectory. Inspired by this theoretical analysis of dropout, we propose a new regularization technique "mixout" useful in transfer learning. Mixout greatly improves both finetuning stability and average performance of pretrained large-scale language models. In the case of training from scratch, we introduce a variant of mixout preventing generator forgetting to avoid mode collapse in GANs.