B024317 - Machine Learning

Office Hours

Wednesday, 10:45-12:45.

Until further notice, it will be via telco. Please on the day before and I will reply with a tentative meeting time.

Learning Objectives

The course covers some of the most important aspects of modern machine learning including:

You will be able to understand, design, and apply several machine learning techniques for supervised learning and (to a lesser extent) for unsupervised learning. You will be able to reproduce some of the innovative solution described in the recent literature and apply them to closely related problems. The course will serve as a foundation for further study in the many engineering and scientific areas where state-of-the-art solutions are based on (deep) learning algorithms.

Prerequisites

Good knowledge of a programming language (preferably Python), and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of fundamental ideas in supervised learning, probabilistic graphical models, optimization and statistics would be very useful but not strictly necessary.

Suggested readings

I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2016 (free PDF).

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer, 2009 (free PDF).

Charu C. Aggarwal Neural Networks and Deep Learning. Springer, 2018 (free PDF from Unifi IP).

Chris Bishop Pattern Recognition and Machine Learning. Springer, 2006 (free PDF).

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014 (free PDF).

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar Foundations of Machine Learning. MIT Press, Second Edition, 2018 (free PDF).

L. Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013 (very useful if you need to improve your general background in statistics).

D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012 (free PDF).

Hal Daume III A Course in Machine Learning 2017 (free PDF).

Additional reading materials (papers or book chapters) are often listed on the side of each lecture.

Assessment

9 credits:

There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts, the underlying theory, and the necessary background that you will usually find in the textbooks.

You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.

6 credits:

Same as above except topics are limited to those covered in the first 2/3 of the course and you will not be asked to reimplement the methods or to reproduce experimental results.

Schedule and reading materials

Date	Topics	Readings/Handouts
2021-09-20	Administrivia. Introduction to Machine Learning and the fundamental paradigms.	HTF09 1, 2.1, 2.2. M. I. Jordan T. M. Mitchell (2015). Machine learning: Trends,perspectives, and prospects. Science 349.6245:255-260.
2021-09-22	Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares. Unbiasedness. Gauss-Markov theorem.	HTF09 3.1, 3.2
2021-09-23	Bias-variance decomposition. Regularization. Ridge regression.	B06 3.1, 3.2; HTF09 3.4.1, 7.1, 7.2, 7.3
2021-09-27	Lasso. Regularization paths. Maximum likelihood principle. MLE and KL-Divergence. Very short introduction to Bayesian learning. Ridge regression and Lasso as MAP learning.	B12 Ch. 8, HTF09 4.1-4.4
2021-09-29	Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier. Limitations of linear discriminant analysis. Discriminant vs. generative classifiers. Motivation for the logistic function.	HTF09 4.4, 6.6.3; A18 2.2.3; B06 4.1.4, 4.2, 4.3.2
2021-09-30	Logistic regression and log-loss (cross-entropy loss). Gradient computation for logistic regression. Optimization. Netwon method. Discriminative/generative conjugate pairs. Learning curves.	HTF09 4.4, 6.6.3; A18 2.2.3 "A.Y. Ng and M.I. Jordan. On Discriminative vs. Generative classifiers. A comparison of logistic regression and naive Bayes. In NIPS 2001."
2021-10-04	Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression.	SGBC16 6.2
2021-10-06	Gradient calculations for softmax regression. Poisson regression. Maximum margin hyperplane. Dual form of the the maximum-margin hyperplane optimization problem. KKT conditions and support vectors. Soft constraints (support vector machine). Dual SVM problem. Hinge loss.	GBC16 6.2.2.3; HTF09 4.5 Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121-167.
2021-10-07	Kernel methods. Feature maps. Polynomial and RBF kernels. Mercer's theorem. Closure properties of kernels. Representer's theorem. Kernel ridge regression.	HTF09 5.8 Schoelkopf, B., & Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press. Chapters 1, 2, 7. Schölkopf, B., Herbrich, R., & Smola, A. (2001). A Generalized Representer Theorem. COLT/EuroCOLT. Shervashidze, N., Schweitzer, P., Leeuwen, E.J., Mehlhorn, K., & Borgwardt, K.M. (2011). Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res., 12, 2539-2561. Vovk, V. (2013). Kernel Ridge Regression. In Empirical Inference (pp. 105–116). Springer, Berlin, Heidelberg.
2021-10-11	Multiclass SVM. Support vector regression. Multi-instance learning.	HTF09 5.8 Rifkin, R.M., & Klautau, A. (2004). In Defense of One-Vs-All Classification. J. Mach. Learn. Res., 5, 101-141. Smola, Alex J., and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing 14.3 (2004): 199-222. Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. NIPS 01 (pp. 561–568).
2021-10-13	Kernel density estimation. Novelty detection and the ν-trick. Learning curves. Qualitative behavior of the excess error and its decomposition.	Schölkopf, B., J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation 13, no. 7 (July 1, 2001): 1443–71. Tax, D.M., & Duin, R.P. (2004). Support Vector Data Description. Machine Learning, 54, 45-66. Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to Statistical Learning Theory. In Advanced Lectures on Machine Learning. Springer.
2021-10-14	Classic learning theory. PAC learning. Agnostic learning and bounds for the estimation error. Learning with infinite function classes: VC-dimension, VC bounds.	Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to Statistical Learning Theory. In Advanced Lectures on Machine Learning. Springer.
2021-10-18	No free lunch theorem. Weak learners. Boosting. Adaboost.	HTF09 10.1, 10.4 Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal of Japanese Society For Artificial Intelligence, 14(771-780), 1612.
2021-10-20	Analysis of Adaboost and exponential loss. Boosting decision stumps. Example, face detection. Bagging. Random forests.	HTF09 10.1, 10.4, 15 Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR 2001. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
2021-10-21	Gradient boosting. Introduction to representations and representation learning.	HTF09 10, GCB16 5.11, 13.4 Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. Olshausen, Bruno A., and David J. Field (1996). Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, no. 6583, 607–609. Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Int. Conf. on Machine Learning (pp. 609–616). Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on (pp. 263–272). IEEE.
2021-10-28	Artificial neurons and their biological inspiration. Activation functions. Expressiveness of shallow and deep neural networks. Universality for Boolean functions. Universal approximation.	GCB16 Ch. 6.1, 6.3, 6.4, A18 1. Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 315–323). K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.
2021-11-04	VC-dimension of neural networks. Training of neural networks, losses and objective function. Gradient computations. Algorithmic differentiation (forward mode).	GBC16 6.5, 8.1, 8.2, 8.3.1 Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. Journal of Machine Learning Research 20:1-17. Gebremedhin, A.H. & Walther, A. (2020) An introduction to algorithmic differentiation. WIREs Data Mining and Knowledge Discovery, 10 (1): e1334. https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1334 "Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2015). Automatic differentiation in machine learning: a survey. ArXiv Preprint ArXiv:1502.05767."
2021-11-10	Reverse mode AD and Backpropagation. Gradient descent and stochastic gradient descent. Tradeoffs under a time-budget constraint.	GBC16 8 "LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural Networks: Tricks of the Trade, 421–436. Springer, 2012." "Bottou, L., Curtis, F.E. & Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60 (2): 223–311"
2021-11-11	Momentum. Adagrad. RMSProp. Adam. Gradient clipping. Global minima and non-convexity of overparameterized systems.	GBC16 8.5. 10.11.1, A18 3.5 Duchi, John, Elad Hazan and Yoram Singer (2011). Adaptive Subgradient Methods forOnline Learning and Stochastic Optimization. J. Mach. Learn. Res. 12:2121-2159. "Kingma, Diederik, and Jimmy Ba (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980" Cooper, Y. (2021) Global Minima of Overparameterized Neural Networks. SIAM Journal on Mathematics of Data Science, 3 (2): 676–691. Liu, C., Zhu, L. & Belkin, M. (2021) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv:2003.00307
2021-11-15	Overparameterized systems. Polyak-Lojasiewicz condition and tangent kernel. Convergence of GD. Implicit regularization.	Liu, C., Zhu, L. & Belkin, M. (2021) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv:2003.00307 Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(1), 2822–2878. Hardt, M. (2021). Generalization in Overparameterized Models. In T. Roughgarden (Ed.), Beyond the Worst-Case Analysis of Algorithms (pp. 486–505). Cambridge University Press.
2021-11-18	Regularizers (weight decay, dropout, data augmentation). Batch normalization. Convolutions.	Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. International Conference on Learning Representations. "Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958." "Ioffe, Sergey, and Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift."
2021-11-22	Convolutional networks. N-d signals. Variants of the convolutional operator (strides, dilation, transposed). Pooling. Modules (subnetworks). Highway and residual networks. U-net.	GBC16 9. Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proc. AAAI. "Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (pp. 234–241)" Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive into Deep Learning. ArXiv:2106.11342. Chapters 6 and 7. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway Networks. In Proc. ICML 2015 Deep Learning Workshop. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015). Deep Residual Learning for Image Recognition.
2021-11-25	Sequence learning. Overview of problems and methods. Recurrent networks. Latching. Vanishing gradients. Gated recurrent units. Long short-term memory networks.	GBC16 10 Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning and Representation Learning Workshop. "Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232."
2021-11-29	Temporal convolutional networks. Time-delayed neural networks. Attention mechanisms. Recurrent encoder-decoder with attention.	GBC16 10 Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Graves, Alex. (2012). Sequence transduction with recurrent neural networks. In ICML 2012. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. 3rd International Conference on Learning Representations, ICLR 2015.
2021-12-06	Attention and self-attention layers. Multi-headed attention. Transformers. Introduction to hyperparameter optimization. Grid and random search. Sequential model-based optimization.	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 6000–6010. Alexander Rush (2018). The Annotated Transformer. Matthias Feurer and Frank Hutter (2019). Hyperparameter Optimization. In F. Hutter et al. Automated Machine Learning, Springer.
2021-12-13	Bayesian optimization for hyperparameter optimization. Gaussian Processes. Expected improvement. Examples. Supervised learning on graphs. Graph convolutional networks.	C.E. Rasmussen (2003). Gaussian Processes in Machine Learning. Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt: A Python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1), 014008. Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In Proc. of the 5th Int. Conf. on Learning Representations. You, J., Ying, Z., & Leskovec, J. (2020). Design Space for Graph Neural Networks. Advances in Neural Information Processing Systems, 33, 17009–17021.

Date

Topics

Readings/Handouts

2021-09-20

Administrivia. Introduction to Machine Learning and the fundamental paradigms.

HTF09 1, 2.1, 2.2.

M. I. Jordan T. M. Mitchell (2015). Machine learning: Trends,perspectives, and prospects. Science 349.6245:255-260.

2021-09-22

Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares. Unbiasedness. Gauss-Markov theorem.

HTF09 3.1, 3.2

2021-09-23

Bias-variance decomposition. Regularization. Ridge regression.

B06 3.1, 3.2; HTF09 3.4.1, 7.1, 7.2, 7.3

2021-09-27

Lasso. Regularization paths. Maximum likelihood principle. MLE and KL-Divergence. Very short introduction to Bayesian learning. Ridge regression and Lasso as MAP learning.

B12 Ch. 8, HTF09 4.1-4.4

2021-09-29

Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier. Limitations of linear discriminant analysis. Discriminant vs. generative classifiers. Motivation for the logistic function.

HTF09 4.4, 6.6.3; A18 2.2.3; B06 4.1.4, 4.2, 4.3.2

2021-09-30

Logistic regression and log-loss (cross-entropy loss). Gradient computation for logistic regression. Optimization. Netwon method. Discriminative/generative conjugate pairs. Learning curves.

HTF09 4.4, 6.6.3; A18 2.2.3

"A.Y. Ng and M.I. Jordan. On Discriminative vs. Generative classifiers. A comparison of logistic regression and naive Bayes. In NIPS 2001."

2021-10-04

Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression.

SGBC16 6.2

2021-10-06

Gradient calculations for softmax regression. Poisson regression. Maximum margin hyperplane. Dual form of the the maximum-margin hyperplane optimization problem. KKT conditions and support vectors. Soft constraints (support vector machine). Dual SVM problem. Hinge loss.

GBC16 6.2.2.3; HTF09 4.5

Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121-167.

2021-10-07

Kernel methods. Feature maps. Polynomial and RBF kernels. Mercer's theorem. Closure properties of kernels. Representer's theorem. Kernel ridge regression.

HTF09 5.8

2021-10-11

Multiclass SVM. Support vector regression. Multi-instance learning.

HTF09 5.8

2021-10-13

Kernel density estimation. Novelty detection and the ν-trick. Learning curves. Qualitative behavior of the excess error and its decomposition.

2021-10-14

Classic learning theory. PAC learning. Agnostic learning and bounds for the estimation error. Learning with infinite function classes: VC-dimension, VC bounds.

Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to Statistical Learning Theory. In Advanced Lectures on Machine Learning. Springer.

2021-10-18

No free lunch theorem. Weak learners. Boosting. Adaboost.

HTF09 10.1, 10.4

Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal of Japanese Society For Artificial Intelligence, 14(771-780), 1612.

2021-10-20

Analysis of Adaboost and exponential loss. Boosting decision stumps. Example, face detection. Bagging. Random forests.

HTF09 10.1, 10.4, 15

2021-10-21

Gradient boosting. Introduction to representations and representation learning.

HTF09 10, GCB16 5.11, 13.4

2021-10-28

Artificial neurons and their biological inspiration. Activation functions. Expressiveness of shallow and deep neural networks. Universality for Boolean functions. Universal approximation.

GCB16 Ch. 6.1, 6.3, 6.4, A18 1.

2021-11-04

VC-dimension of neural networks. Training of neural networks, losses and objective function. Gradient computations. Algorithmic differentiation (forward mode).

GBC16 6.5, 8.1, 8.2, 8.3.1

2021-11-10

Reverse mode AD and Backpropagation. Gradient descent and stochastic gradient descent. Tradeoffs under a time-budget constraint.

GBC16 8

2021-11-11

Momentum. Adagrad. RMSProp. Adam. Gradient clipping. Global minima and non-convexity of overparameterized systems.

GBC16 8.5. 10.11.1, A18 3.5

2021-11-15

Overparameterized systems. Polyak-Lojasiewicz condition and tangent kernel. Convergence of GD. Implicit regularization.

2021-11-18

Regularizers (weight decay, dropout, data augmentation). Batch normalization. Convolutions.

2021-11-22

Convolutional networks. N-d signals. Variants of the convolutional operator (strides, dilation, transposed). Pooling. Modules (subnetworks). Highway and residual networks. U-net.

GBC16 9.

2021-11-25

Sequence learning. Overview of problems and methods. Recurrent networks. Latching. Vanishing gradients. Gated recurrent units. Long short-term memory networks.

GBC16 10

2021-11-29

Temporal convolutional networks. Time-delayed neural networks. Attention mechanisms. Recurrent encoder-decoder with attention.

GBC16 10

2021-12-06

Attention and self-attention layers. Multi-headed attention. Transformers. Introduction to hyperparameter optimization. Grid and random search. Sequential model-based optimization.

2021-12-13

Bayesian optimization for hyperparameter optimization. Gaussian Processes. Expected improvement. Examples. Supervised learning on graphs. Graph convolutional networks.

Note

Full text of linked papers is normally accessible when connecting from a UNIFI IP address. Use the proxy proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the campus network.

B024317 - Machine Learning - Fall 2021

MSc degree in Computer Engineering, University of Florence

Contacts