2021-09-20 |
Administrivia. Introduction to Machine Learning and the fundamental paradigms.
|
|
2021-09-22 |
Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares. Unbiasedness. Gauss-Markov theorem.
|
|
2021-09-23 |
Bias-variance decomposition. Regularization. Ridge regression.
|
- B06 3.1, 3.2; HTF09 3.4.1, 7.1, 7.2, 7.3
|
2021-09-27 |
Lasso. Regularization paths. Maximum likelihood principle. MLE and KL-Divergence. Very short introduction to Bayesian learning. Ridge regression and Lasso as MAP learning.
|
|
2021-09-29 |
Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier. Limitations of linear discriminant analysis. Discriminant vs. generative classifiers. Motivation for the logistic function.
|
- HTF09 4.4, 6.6.3; A18 2.2.3; B06 4.1.4, 4.2, 4.3.2
|
2021-09-30 |
Logistic regression and log-loss (cross-entropy loss). Gradient computation for logistic regression. Optimization. Netwon method. Discriminative/generative conjugate pairs. Learning curves.
|
- HTF09 4.4, 6.6.3; A18 2.2.3
|
2021-10-04 |
Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression.
|
|
2021-10-06 |
Gradient calculations for softmax regression. Poisson regression. Maximum margin hyperplane. Dual form of the the maximum-margin hyperplane optimization problem. KKT conditions and support vectors. Soft constraints (support vector machine). Dual SVM problem. Hinge loss.
|
|
2021-10-07 |
Kernel methods. Feature maps. Polynomial and RBF kernels. Mercer's theorem. Closure properties of kernels. Representer's theorem. Kernel ridge regression.
|
- Schoelkopf, B., & Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press. Chapters 1, 2, 7.
- Schölkopf, B., Herbrich, R., & Smola, A. (2001). A Generalized Representer Theorem. COLT/EuroCOLT.
- Shervashidze, N., Schweitzer, P., Leeuwen, E.J., Mehlhorn, K., & Borgwardt, K.M. (2011). Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res., 12, 2539-2561.
- Vovk, V. (2013). Kernel Ridge Regression. In Empirical Inference (pp. 105–116). Springer, Berlin, Heidelberg.
|
2021-10-11 |
Multiclass SVM. Support vector regression. Multi-instance learning.
|
- Rifkin, R.M., & Klautau, A. (2004). In Defense of One-Vs-All Classification. J. Mach. Learn. Res., 5, 101-141.
- Smola, Alex J., and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing 14.3 (2004): 199-222.
- Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. NIPS 01 (pp. 561–568).
|
2021-10-13 |
Kernel density estimation. Novelty detection and the ν-trick. Learning curves. Qualitative behavior of the excess error and its decomposition.
|
- Schölkopf, B., J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation 13, no. 7 (July 1, 2001): 1443–71.
- Tax, D.M., & Duin, R.P. (2004). Support Vector Data Description. Machine Learning, 54, 45-66.
- Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to Statistical Learning Theory. In Advanced Lectures on Machine Learning. Springer.
|
2021-10-14 |
Classic learning theory. PAC learning. Agnostic learning and bounds for the estimation error. Learning with infinite function classes: VC-dimension, VC bounds.
|
|
2021-10-18 |
No free lunch theorem. Weak learners. Boosting. Adaboost.
|
|
2021-10-20 |
Analysis of Adaboost and exponential loss. Boosting decision stumps. Example, face detection. Bagging. Random forests.
|
|
2021-10-21 |
Gradient boosting. Introduction to representations and representation learning.
|
- HTF09 10, GCB16 5.11, 13.4
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.
-
- Olshausen, Bruno A., and David J. Field (1996). Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, no. 6583, 607–609.
- Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Int. Conf. on Machine Learning (pp. 609–616).
- Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on (pp. 263–272). IEEE.
|
2021-10-28 |
Artificial neurons and their biological inspiration. Activation functions. Expressiveness of shallow and deep neural networks. Universality for Boolean functions. Universal approximation.
|
- GCB16 Ch. 6.1, 6.3, 6.4, A18 1.
- Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 315–323).
- K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.
|
2021-11-04 |
VC-dimension of neural networks. Training of neural networks, losses and objective function. Gradient computations. Algorithmic differentiation (forward mode).
|
- GBC16 6.5, 8.1, 8.2, 8.3.1
- Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. Journal of Machine Learning Research 20:1-17.
- Gebremedhin, A.H. & Walther, A. (2020) An introduction to algorithmic differentiation. WIREs Data Mining and Knowledge Discovery, 10 (1): e1334. https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1334
- "Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2015). Automatic differentiation in machine learning: a survey. ArXiv Preprint ArXiv:1502.05767."
|
2021-11-10 |
Reverse mode AD and Backpropagation. Gradient descent and stochastic gradient descent. Tradeoffs under a time-budget constraint.
|
- "LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural Networks: Tricks of the Trade, 421–436. Springer, 2012."
- "Bottou, L., Curtis, F.E. & Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60 (2): 223–311"
|
2021-11-11 |
Momentum. Adagrad. RMSProp. Adam. Gradient clipping. Global minima and non-convexity of overparameterized systems.
|
- GBC16 8.5. 10.11.1, A18 3.5
- Duchi, John, Elad Hazan and Yoram Singer (2011). Adaptive Subgradient Methods forOnline Learning and Stochastic Optimization. J. Mach. Learn. Res. 12:2121-2159.
- "Kingma, Diederik, and Jimmy Ba (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980"
- Cooper, Y. (2021) Global Minima of Overparameterized Neural Networks. SIAM Journal on Mathematics of Data Science, 3 (2): 676–691.
- Liu, C., Zhu, L. & Belkin, M. (2021) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv:2003.00307
|
2021-11-15 |
Overparameterized systems. Polyak-Lojasiewicz condition and tangent kernel. Convergence of GD. Implicit regularization.
|
- Liu, C., Zhu, L. & Belkin, M. (2021) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv:2003.00307
- Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(1), 2822–2878.
- Hardt, M. (2021). Generalization in Overparameterized Models. In T. Roughgarden (Ed.), Beyond the Worst-Case Analysis of Algorithms (pp. 486–505). Cambridge University Press.
|
2021-11-18 |
Regularizers (weight decay, dropout, data augmentation). Batch normalization. Convolutions.
|
- Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. International Conference on Learning Representations.
- "Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958."
- "Ioffe, Sergey, and Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift."
|
2021-11-22 |
Convolutional networks. N-d signals. Variants of the convolutional operator (strides, dilation, transposed). Pooling. Modules (subnetworks). Highway and residual networks. U-net.
|
- Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proc. AAAI.
- "Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (pp. 234–241)"
- Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive into Deep Learning. ArXiv:2106.11342. Chapters 6 and 7.
- Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway Networks. In Proc. ICML 2015 Deep Learning Workshop.
- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015). Deep Residual Learning for Image Recognition.
|
2021-11-25 |
Sequence learning. Overview of problems and methods. Recurrent networks. Latching. Vanishing gradients. Gated recurrent units. Long short-term memory networks.
|
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning and Representation Learning Workshop.
- "Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232."
|
2021-11-29 |
Temporal convolutional networks. Time-delayed neural networks. Attention mechanisms. Recurrent encoder-decoder with attention.
|
- Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling.
- Graves, Alex. (2012). Sequence transduction with recurrent neural networks. In ICML 2012.
- Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. 3rd International Conference on Learning Representations, ICLR 2015.
|
2021-12-06 |
Attention and self-attention layers. Multi-headed attention. Transformers. Introduction to hyperparameter optimization. Grid and random search. Sequential model-based optimization.
|
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 6000–6010.
- Alexander Rush (2018). The Annotated Transformer.
- Matthias Feurer and Frank Hutter (2019). Hyperparameter Optimization. In F. Hutter et al. Automated Machine Learning, Springer.
|
2021-12-13 |
Bayesian optimization for hyperparameter optimization. Gaussian Processes. Expected improvement. Examples. Supervised learning on graphs. Graph convolutional networks.
|
- C.E. Rasmussen (2003). Gaussian Processes in Machine Learning.
- Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt: A Python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1), 014008.
- Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In Proc. of the 5th Int. Conf. on Learning Representations.
- You, J., Ying, Z., & Leskovec, J. (2020). Design Space for Graph Neural Networks. Advances in Neural Information Processing Systems, 33, 17009–17021.
|