You might notice that I haven't emphasized the latest benchmark-beating paper. My reason for this is that a good
theory ought to be scalable which means that it should be capable of explaining why deep models generalise and we
should be able to bootstrap these explanations for more complex models(ex. sequences of deep models(aka RNNs)).
This is how all good science is done.
Dropout Rademacher Complexity of Deep Neural Networks(Wei Gao 2014)
Distribution-Specific Hardness of Learning Neural Networks(Shamir 2017)
Lessons from the Rademacher Complexity for Deep Learning(Sokolic 2016)
A mathematical theory of Deep Convolutional Neural Networks for Feature Extraction(Wiatowski 2016)
Spectral Representations for Convolutional Neural Networks(Rippl 2015)
Electron-Proton dynamics in deep learning(Zhang 2017)
Empirical Risk Minimization for Learning Theory(Vapnik 1991)
The Loss Surfaces of Multilayer Networks(Y LeCun et al. 2015)
Understanding Synthetic Gradients and Decoupled Neural Interfaces(W. Czarnecki 2017)
Dataset Shift(Storkey 2013)
Risk vs Uncertainty(I. Osband 2016)
The loss surface of deep and wide neural networks(Q. Nguyen 2017)