Deep neural networks are typically trained with backpropagation; a technique in which the input data is processed in the forward pass to compute a loss function and then the weights of the networks are updated according to the gradients which traverse in the backward direction. The activations of the network are to be kept in memory and should wait for the gradient signal before it can update the network parameters, which incurs a substantial memory and latency burden in the training process. Recently, there has been a focus on exploring alternative ways of training neural networks. A potential alternative to backpropagation is training neural networks layer-wise using auxiliary loss functions. Although the technique shows competitive results on small datasets with light networks and small number of decoupled blocks, it suffers significantly in terms of performance for large number of decoupled blocks in neural network. This limited performance is mainly related to ineffective information propagation, shortsightedness of the greedy objective and information collapsing. In this thesis, a new technique of layer-wise training of neural networks is presented which outperforms the current state-of-the-art techniques, specially as the number of decoupled blocks increases. The proposed technique works by periodically distilling the knowledge of last layer through the auxiliary networks attached to each layer. Thorough experimentation with various networks and different configurations demonstrate the advantage of using periodic knowledge distillation to achieve a significant increase in the performance of decoupled training of neural networks.