- SGD [Book]
- Momentum [Book]
- RMSProp [Book]
- AdaGrad [Link]
- ADAM [Link]
- AdaBound [Link] [Github]
- ADAMAX [Link]
- NADAM [Link]
- ADAMW [Link]
- AdaLOMO Link
- All optimizers list Awesome-Optimizer
- BatchNorm [Link]
- Weight Norm [Link]
- Spectral Norm [Link]
- Cosine Normalization [Link]
- L2 Regularization versus Batch and Weight Normalization Link
- WHY GRADIENT CLIPPING ACCELERATES TRAINING: A THEORETICAL JUSTIFICATION FOR ADAPTIVITY Link
- Convex Neural Networks [Link]
- Breaking the Curse of Dimensionality with Convex Neural Networks [Link]
- UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION [Link]
- Optimal Control Via Neural Networks: A Convex Approach. [Link]
- Input Convex Neural Networks [Link]
- A New Concept of Convex based Multiple Neural Networks Structure. [Link
- SGD Converges to Global Minimum in Deep Learning via Star-convex Path [Link]
- A Convergence Theory for Deep Learning via Over-Parameterization Link
- Curriculum Learning [Link]
- SOLVING RUBIK’S CUBE WITH A ROBOT HAND Link
- Noisy Activation Function [Link]
- Mollifying Networks [Link]
- Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks Link Talk
- Automated Curriculum Learning for Neural Networks Link
- On The Power of Curriculum Learning in Training Deep Networks Link
- On-line Adaptative Curriculum Learning for GANs Link
- Parameter Continuation with Secant Approximation for Deep Neural Networks and Step-up GAN Link
- HashNet: Deep Learning to Hash by Continuation. [Link]
- Learning Combinations of Activation Functions. [Link]
- Learning and development in neural networks: The importance of starting small (1993) Link
- Flexible shaping: How learning in small steps helps Link
- Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning Link
- RETHINKING CURRICULUM LEARNING WITH INCREMENTAL LABELS AND ADAPTIVE COMPENSATION Link
- Parameter Continuation Methods for the Optimization of Deep Neural Networks Link
- Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection [Link (https://www.aclweb.org/anthology/W18-6314.pdf)
- Reinforcement Learning based Curriculum Optimization for Neural Machine Translation Link
- EVOLUTIONARY POPULATION CURRICULUM FOR SCALING MULTI-AGENT REINFORCEMENT LEARNING Link
- ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS Link
- NEIGHBOURHOOD DISTILLATION: ON THE BENEFITS OF NON END-TO-END DISTILLATION Link
- LEARNING TO EXECUTE Link
- Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing Link
- Data Parameters: A New Family of Parameters for Learning a Differentiable Curriculum Link
- Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search Link
- Continuation Methods and Curriculum Learning for Learning to Rank Link
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Link
- QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS[Link]
- The Loss Surfaces of Multilayer Networks [Link]
- Visualizing the Loss Landscape of Neural Nets [Link]
- The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens [Link]
- How regularization affects the critical points in linear networks.[Link]
- Local minima in training of neural networks [Link]
- Necessary and Sufficient Geometries for Gradient Methods Link
- Fine-grained Optimization of Deep Neural Networks Link
- SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONS Link
- Deep Equilibrium Models Link
- Bifurcations of Recurrent Neural Networks in Gradient Descent Learning [Link]
- On the difficulty of training recurrent neural networks [Link]
- Understanding and Controlling Memory in Recurrent Neural Networks [Link]
- Dynamics and Bifurcation of Neural Networks [Link]
- Context Aware Machine Learning [Link]
- The trade-off between long-term memory and smoothness for recurrent networks [Link]
- Dynamical complexity and computation in recurrent neural networks beyond their fxed point [Link]
- Bifurcations in discrete-time neural networks : controlling complex network behaviour with inputs [Links]
- Interpreting Recurrent Neural Networks Behaviour via Excitable Network Attractors [Link]
- Bifurcation analysis of a neural network model Link
- A Differentiable Physics Engine for Deep Learning in Robotics Link
- Deep learning for universal linear embeddings of nonlinear dynamics Link
- Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations Link
- Analysis of gradient descent learning algorithms for multilayer feedforward neural networks Link
- A dynamical model for the analysis and acceleration of learning in feedforward networks Link
- A bio-inspired bistable recurrent cell allows for long-lasting memory Link
- Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation [Link (https://www.frontiersin.org/articles/10.3389/fncom.2017.00024/full)
- Adding One Neuron Can Eliminate All Bad Local Minima Link
- Deep Learning without Poor Local Minima Link
- Elimination of All Bad Local Minima in Deep Learning Link
- How to escape saddle points efficiently. Link
- Depth with Nonlinearity Creates No Bad Local Minima in ResNets Link
- Sharp Minima Can Generalize For Deep Nets Link
- Asymmetric Valleys: Beyond Sharp and Flat Local Minima Link
- A Reparameterization-Invariant Flatness Measure for Deep Neural Networks Link
- A Simple Weight Decay Can Improve Generalization Link
- Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions Link
- The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens Link
- Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Link
- Flatness is a False Friend Link
- Are_Saddles_Good_Enough_for_Deep_Learning Link
- Deep learning course notes Link
- On the importance of initialization and momentum in deep learning Link
- The Break-Even Point on Optimization Trajectories of Deep Neural Networks Link
- THE EARLY PHASE OF NEURAL NETWORK TRAINING Link
- One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers Link
- PCA-Initialized Deep Neural Networks Applied To Document Image Analysis Link
- Understanding the difficulty of training deep feedforward neural networks Link
- Unitary Evolution of RNNs Link
- RETHINKING THE HYPERPARAMETERS FOR FINE-TUNING Link
- Momentum Residual Neural Networks Link
- Smooth momentum: improving lipschitzness in gradient descent Link
- Momentum-based Weight Interpolation of Strong Zero-Shot Models for Continual Learning link
- ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMALink
- Revisiting Small Batch Training for Deep Neural Networks Link
- LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS Link
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Link
- DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE Link
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Link
- Avoiding pathologies in very deep networks Link
- Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice Link
- SKIP CONNECTIONS ELIMINATE SINGULARITIES Link
- How degenerate is the parametrization of neural networks with the ReLU activation function? Link
- Theory of Deep Learning III: explaining the non-overfitting puzzle Link
- Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks Link
- Understanding Deep Learning: Expected Spanning Dimension and Controlling the Flexibility of Neural Networks Link
- The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens Link
- PYHESSIAN: Neural Networks Through the Lens of the Hessian Link
- A CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR DEEP LINEAR NEURAL NETWORKS Link
- A Convergence Theory for Deep Learning via Over-Parameterization Link
- Convergence Analysis of Homotopy-SGD for Non-Convex Optimization Link
- Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. Link
- Learning a Multitask Curriculum for Neural Machine Translation. Link
- Self-paced Curriculum Learning. Link
- Curriculum Learning of Multiple Tasks. Link
- A Primal-Dual Formulation for Deep Learning with Constraints Link
- Object-Oriented Curriculum Generation for Reinforcement Learning Link
- Teacher-Student Curriculum Learning Link
- Curriculum Learning: A Survey Link
- A Comprehensive Survey on Curriculum Learning Link
- https://www.offconvex.org/
- An overview of gradient descent optimization algorithms [Link]
- Review of second-order optimization techniques in artificial neural networks backpropagation Link
- Linear Algebra and data Link
- Why Momentum really works?[Blog]
- Optimization [Book]
- Optimization for deep learning: theory and algorithms Link
- Generalization Error in Deep Learning Link
- Automatic Differentiation in Machine Learning: a Survey Link
- Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey Link
- Automatic Curriculum Learning For Deep RL: A Short Survey Link
- The Generalization Mystery: Sharp vs Flat Minima Link
If you've found any informative resources that you think belong here, be sure to submit a pull request or create an issue!
- Or send me 2-4 dollars on my venmo account @HARSHNILESH-PATHAK