This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but Iβve done my best to select the papers that I think are novel or significant.
My opinions are by no means the final word on these topics. Please create an issue or pull request if you have a suggestion.
- Landmark Papers in Machine Learning
- Key
- Association Rule Learning
- Datasets
- Decision Trees
- Deep Learning
- AlexNet (image classification CNN)
- Convolutional Neural Network
- DeepFace (facial recognition)
- Generative Adversarial Network
- GPT
- Inception (classification/detection CNN)
- Long Short-Term Memory (LSTM)
- Residual Neural Network (ResNet)
- Transformer (sequence to sequence modeling)
- U-Net (image segmentation CNN)
- VGG (image recognition CNN)
- Ensemble Methods
- Games
- Optimization
- Miscellaneous
- Natural Language Processing
- Neural Network Components
- Recommender Systems
- Regression
- Software
- Supervised Learning
- Statistics
- Credits
Icon | |
---|---|
π | Paper behind paywall. In some cases, I provide an alternative link to the paper if it comes directly from one of the authors. |
π | Freely available version of paywalled paper, directly from the author. |
π½ | Code associated with the paper. |
ποΈ | Precursor or historically relevant paper. This may be a fundamental breakthrough that paved the way for the concept in question to be developed. |
π¬ | Iteration, advancement, elaboration, or major popularization of a technique. |
π | Blog post or something other than a formal publication. |
π | Website associated with the paper. |
π₯ | Video associated with the paper. |
π | Slides or images associated with the paper. |
Papers proceeded by βSee alsoβ indicate either additional historical context or else major developments, breakthroughs, or applications.
-
Mining Association Rules between Sets of Items in Large Databases (1993), Agrawal, Imielinski, and Swami, @CiteSeerX.
-
See also: The GUHA method of automatic hypotheses determination (1966), HΓ‘jek, Havel, and Chytil, @Springer π ποΈ.
- The Enron Corpus: A New Dataset for Email Classification Research (2004), Klimt and Yang, @Springer π / @author π.
- See also: Introducing the Enron Corpus (2004), Klimt and Yang, @author.
- ImageNet: A large-scale hierarchical image database (2009), Deng et al., @IEEE π / @author π.
- See also: ImageNet Large Scale Visual Recognition Challenge (2015), @Springer π / @arXiv π + @author π.
- Induction of Decision Trees (1986), Quinlan, @Springer.
- ImageNet Classification with Deep Convolutional Neural Networks (2012), @NIPS.
- Gradient-based learning applied to document recognition (1998), LeCun, Bottou, Bengio, and Haffner, @IEEE π / @author π.
- See also: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (1980), Fukushima, @Springer ποΈ.
- See also: Phoneme recognition using time-delay neural networks (1989), Waibel, Hanazawa, Hinton, Shikano, and Lang, @IEEE ποΈ.
- See also: Fully Convolutional Networks for Semantic Segmentation (2014), Long, Shelhamer, and Darrell, @arXiv.
- DeepFace: Closing the Gap to Human-Level Performance in Face Verification (2014), Taigman, Yang, Ranzato, and Wolf, Facebook Research.
- Improving Language Understanding by Generative Pre-Training (2018) aka GPT, Radford, Narasimhan, Salimans, and Sutskever, @OpenAI + @Github π½ + @OpenAI π.
- See also: Language Models are Unsupervised Multitask Learners (2019) aka GPT-2, Radford, Wu, Child, Luan, Amodei, and Sutskever, @OpenAI π¬ + @Github π½ + @OpenAI π.
- See also: Language Models are Few-Shot Learners (2020) aka GPT-3, Brown et al., @arXiv + @OpenAI π.
- Going Deeper with Convolutions (2014), Szegedy et al., @ai.google + @Github π½.
- See also: Rethinking the Inception Architecture for Computer Vision (2016), Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna, @ai.google π¬.
- See also: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), Szegedy, Ioffe, Vanhoucke, and Alemi, @ai.google π¬.
- Long Short-term Memory (1995), Hochreiter and Schmidhuber, @CiteSeerX.
- Deep Residual Learning for Image Recognition (2015), He, Zhang, Ren, and Sun, @arXiv.
- Attention Is All You Need (2017), Vaswani et al., @NIPS.
- U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), Ronneberger, Fischer, Brox, @Springer π / @arXiv π.
- Very Deep Convolutional Networks for Large-Scale Image Recognition (2015), Simonyan and Zisserman, @arXiv + @author π + @ICLR π + @YouTube π₯.
-
A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting (1997βpublished as abstract in 1995), Freund and Schapire, @CiteSeerX.
-
See also: Experiments with a New Boosting Algorithm (1996), Freund and Schapire, @CiteSeerX π¬.
- Bagging Predictors (1996), Breiman, @Springer.
- Greedy function approximation: A gradient boosting machine (2001), Friedman, @Project Euclid.
- See also: XGBoost: A Scalable Tree Boosting System (2016), Chen and Guestrin, @arXiv π¬ + @GitHub π½.
- Random Forests (2001), Breiman and Schapire, @CiteSeerX.
- Mastering the game of Go with deep neural networks and tree search (2016), Silver et al., @Nature.
- IBM's deep blue chess grandmaster chips (1999), Hsu, @IEEE π.
- See also: Deep Blue (2002), Campbell, Hoane, and Hsu, @ScienceDirect π.
- Adam: A Method for Stochastic Optimization (2015), Kingma and Ba, @arXiv.
- Maximum likelihood from incomplete data via the EM algorithm (1977), Dempster, Laird, and Rubin, @CiteSeerX.
- Stochastic Estimation of the Maximum of a Regression Function (1952), Kiefer and Wolfowitz, @ProjectEuclid.
- See also: A Stochastic Approximation Method (1951), Robbins and Monro, @ProjectEuclid ποΈ.
- Learning the parts of objects by non-negative matrix factorization (1999), Lee and Seung, @Nature π.
- The PageRank Citation Ranking: Bringing Order to the Web (1998), Page, Brin, Motwani, and Winograd, @CiteSeerX.
- Building Watson: An Overview of the DeepQA Project (2010), Ferrucci et al., @AAAI.
- Latent Dirichlet Allocation (2003), Blei, Ng, and Jordan, @JMLR
- Indexing by latent semantic analysis (1990), Deerwater, Dumais, Furnas, Landauer, and Harshman, @CiteSeerX.
- Efficient Estimation of Word Representations in Vector Space (2013), Mikolov, Chen, Corrado, and Dean, @arXiv + @Google Code π½.
- Learning representations by back-propagating errors (1986), Rumelhart, Hinton, and Williams, @Nature π.
- See also: Backpropagation Applied to Handwritten Zip Code Recognition (1989), LeCun et al., @IEEE ππ¬ / @author π.
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), Ioffe and Szegedy @ICML via PMLR.
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, @JMLR.
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), Cho et al, @arXiv.
- The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958), Rosenblatt, @CiteSeerX.
- Using collaborative filtering to weave an information tapestry (1992), Goldberg, Nichols, Oki, and Terry, @CiteSeerX.
- Application of Dimensionality Reduction in Recommender System - A Case Study (2000), Sarwar, Karypis, Konstan, and Riedl, @CiteSeerX.
- See also: Learning Collaborative Information Filters (1998), Billsus and Pazzani, @CiteSeerX ποΈ.
- See also: Netflix Update: Try This at Home (2006), Funk, @author π π¬.
- Collaborative Filtering for Implicit Feedback Datasets (2008), Hu, Koren, and Volinsky, @IEEE π / @author π.
- Regularization and variable selection via the Elastic Net (2005), Zou and Hastie, @CiteSeer.
- Regression Shrinkage and Selection Via the Lasso (1994), Tibshirani, @CiteSeerX.
- See also: Linear Inversion of Band-Limited Reflection Seismograms (1986), Santosa and Symes, @SIAM ποΈ.
- MapReduce: Simplified Data Processing on Large Clusters (2004), Dean and Ghemawat, @ai.google.
- TensorFlow: A system for large-scale machine learning (2016), Abadi et al., @ai.google + @author π.
- Torch: A Modular Machine Learning Software Library (2002), Collobert, Bengio and MariΓ©thoz, @Idiap + @author π.
- See also: Automatic differentiation in PyTorch (2017), Paszke et al., @OpenReview π¬+ @Github π½.
- Nearest neighbor pattern classification (1967), Cover and Hart, @IEEE π.
- See also: E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation (1989), Silverman and Jones, @JSTOR π.
- Support Vector Networks (1995), Cortes and Vapnik, @Springer.
- Bootstrap Methods: Another Look at the Jackknife (1979), Efron, @Project Euclid.
- See also: Problems in Plane Sampling (1949), Quenouille, @Project Euclid ποΈ.
- See also: Notes on Bias Estimation (1958), Quenouille, @JSTOR ποΈ.
- See also: Bias and Confidence in Not-quite Large Samples (1958), Tukey, @Project Euclid π¬.
A special thanks to Alexandre Passos for his comment on this Reddit thread, as well as the responders to this Quora post. They provided many great papers to get this list off to a great start.