2020-09-21 |
Administrivia. Introduction to the discipline of Machine Learning. Supervised learning.
|
- M. I. Jordan T. M. Mitchell (2015). Machine learning: Trends,perspectives, and prospects. Science 349.6245:255-260.
- Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2), 8-12.
|
2020-09-22 |
Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares and its statistical analysis. Gauss-Markov theorem.
|
|
2020-09-25 |
Bias-variance decomposition. Regularization. Ridge regression. Lasso. Regularization paths. Maximum likelihood principle.
|
- B06 3.2; HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2.; B12 Ch. 8.
|
2020-09-28 |
MLE and KL-Divergence. Very short introduction to Bayesian learning. Ridge regression and Lasso as MAP learning.
|
|
2020-09-28 |
Practice on ridge regression, tuning the ridge regularizer, bias-variance decomposition.
|
|
2020-09-29 |
Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier. Limitations of linear discriminant analysis. Discriminant vs. generative classifiers. Motivation for the logistic function. Logistic regression and log-loss (cross-entropy loss).
|
- HTF09 4.4, 6.6.3; A18 2.2.3; B06 4.1.4, 4.2, 4.3.2
|
2020-10-02 |
Gradient computation for logistic regression. Optimization with the Netwon method. Naive Bayes classifier (Bernoulli/Gaussian). Naive Bayes and logistic regression are a discriminative/generative conjugate pair. Learning curves.
|
- HTF09 4.4, 6.6.3; A18 2.2.3; B12 10.1, 10.2
|
2020-10-05 |
Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression. Gradient calculations.
|
- S18 3.2, 4.4, 7.1; GBC16 6.2.2.3
|
2020-10-05 |
Practice on cross-entropy and learning curves
|
|
2020-10-06 |
Maximum margin hyperplane as a constrained optimization problem. Ordinary convex problems and Karush-Kuhn-Tucker theory. Dual form for the maximum-margin hyperplane. KKT conditions and support vectors. Soft constraints (support vector machine). Dual SVM problem. Hinge loss.
|
|
2020-10-09 |
Recap on SVM. Kernel methods. Feature maps. Polynomial and RBF kernels. Mercer's theorem.
|
|
2020-10-12 |
More on kernel methods. Representer theorem. Kernel ridge regression. Support vector regression. Reproducing kernel Hilbert spaces. Multiclass SVM.
|
- Vovk, V. (2013). Kernel Ridge Regression. In Empirical Inference (pp. 105–116). Springer, Berlin, Heidelberg.
- Smola, Alex J., and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing 14.3 (2004): 199-222.
- Schoelkopf, B., & Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press. Chapter 2.
|
2020-10-13 |
No class today
|
|
2020-10-16 |
Learning theory. PAC learning. Agnostic learning and bounds for the estimation error. Learning with infinite function classes: VC-dimension, VC bounds.
|
|
2020-10-19 |
No class today
|
|
2020-10-20 |
Weak learners. Boosting. Adaboost. Boosting decision stumps. Analysis of Adaboost and exponential loss.
|
- Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal of Japanese Society For Artificial Intelligence, 14(771-780), 1612.
- Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR 2001.
|
2020-10-23 |
Practice on boosted decision stumps
|
|
2020-10-23 |
Bagging. Random forests. Out-of-bag estimate of the prediction loss. Attribute relevance. CART. Introduction to additive models and gradient boosting.
|
- HTF09 9.2.2, 10.3-10.10, 10.12.1, 15
- "Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2), 337–407."
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
|
2020-10-26 |
Gradient boosting. Introduction to representations and representation learning.
|
- Olshausen, Bruno A., and David J. Field (1996). Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, no. 6583, 607–609.
- Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Int. Conf. on Machine Learning (pp. 609–616).
- Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on (pp. 263–272). IEEE.
|
2020-10-31 |
Artificial neurons and their biological inspiration. Expressiveness of shallow and deep neural networks. Universality for Boolean functions. Universal approximation. Activation functions. VC dimension. Defining the optimization problem for learning.
|
- Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 315–323).
- Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. Journal of Machine Learning Research 20:1-17.
- K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.
|
2020-11-02 |
Maximum likelihood training of neural networks. Gradient computations. Algorithmic differentiation (forward and reverse mode). Backpropagation. The structure of the optimization problem for neural networks. Saddle points.
|
- GBC16 6.5, 8.1, 8.2, 8.3.1
- Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 27 (pp. 2933–2941).
- "Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2015). Automatic differentiation in machine learning: a survey. ArXiv Preprint ArXiv:1502.05767."
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
- Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11, 625–660.
|
2020-11-03 |
No class today
|
|
2020-11-06 |
Tradeoffs of large scale learning. Stochastic gradient descent. Weight initialization for neural networks. Momentum. Nesterov accelerated gradient.
|
- "LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural Networks: Tricks of the Trade, 421–436. Springer, 2012."
- "Bottou, Léon. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, 421–436. Springer, 2012."
- Sutskever, Ilya, James Martens, George Dahl, and Geoffrey Hinton (2013). On the Importance of Initialization and Momentum in Deep Learning. ICML-13, 1139–1147.
|
2020-11-09 |
No class today
|
|
2020-11-10 |
Adagrad. RMSProp. Adam. Gradient clipping. Batch Normalization. Effects of L2 regularization. Early stopping. Dropout.
|
- GBC16 8.5, 8.7, 7.1, 7.8, 7.12; A18 3.5, 4.4, 4.6
- Duchi, John, Elad Hazan and Yoram Singer (2011). Adaptive Subgradient Methods forOnline Learning and Stochastic Optimization. J. Mach. Learn. Res. 12:2121-2159.
- "Kingma, Diederik, and Jimmy Ba (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980"
- "Ioffe, Sergey, and Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift."
- "Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958."
- Reddi, S. J., Kale, S., & Kumar, S. (2019). On the Convergence of Adam and Beyond. ArXiv:1904.09237
- Wager, S., Wang, S., & Liang, P. S. (2013). Dropout Training as Adaptive Regularization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (pp. 351–359).
|
2020-11-13 |
No class today
|
|
2020-11-16 |
Practice on Tensorflow (v1 and v2) and Keras.
|
|
2020-11-20 |
Convolutional networks. Variants of the convolutional operator. Pooling. Modules (subnetworks). Highway networks. Resudual networks. Densely connected networks.
|
- Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proc. AAAI.
- G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten and K. Weinberger (2019). Convolutional Networks with Dense Connectivity. IEEE Trans. on Pattern Analysis and Machine Intelligence.
- Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway Networks. In Proc. ICML 2015 Deep Learning Workshop.
- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015). Deep Residual Learning for Image Recognition.
|
2020-11-23 |
Sequence learning. Overview of problems and methods. Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Brief introduction to conditional random fields.
|
|
2020-11-24 |
The expectation-maximization algorithm. Mixture models. Introduction to recurrent neural networks.
|
- HTF09 8.5; B11 11.1, 11.2, 12; GBC16 19.2, 10.1, 10.2
|
2020-11-27 |
Recurrent networks. Vanishing gradients. Gated recurrent units. Attention mechanisms. Recurrent encoder-decoder with attention. Hierarchical attention.
|
- "Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232."
- Graves, Alex. (2012). Sequence transduction with recurrent neural networks. In ICML 2012.
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.
- Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate.
- Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proc. ACL 2016, 1480–1489.
|
2020-12-01 |
Multi-headed attention mechanism. Transformer networks. Introduction to generative models and autoencoders.
|
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 6000–6010.
- Alexander Rush (2018). The Annotated Transformer.
|
2020-12-04 |
Variational autoencoders. Brief introduction to generative adversarial networks. Hyperparameter Optimization. Grid and random search. Sequential model-based and Bayesian approaches.
|
- "Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. J. of the American Statistical Association, 112(518), 859–877."
- Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proc. ICLR
- Matthias Feurer and Frank Hutter (2019). Hyperparameter Optimization. In F. Hutter et al. Automated Machine Learning, Springer.
|