Contacts

Paolo Frasconi, DINFO, via di S. Marta 3, 50139 Firenze

email: (please do not use my address @unifi.it, it was forcibly moved to gmail by the central administration and it has all sorts of problems: messages may be replied with delay or not replied at all).

Office Hours

Tuesday, 10:45-12:45

Until further notice, it will be on Skype. Please with your Skype ID on the day before and I will reply with a tentative meeting time.

Learning Objectives

In this class you will learn about some fundamental statistical learning algorithms and a number of deep learning techniques. You will learn about the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems. Broad topics that are covered include: Generalized linear models, kernel methods, ensemble techniques and boosting, core deep learning methodologies, sequence learning and recurrent networks, relational learning.

Prerequisites

A good knowledge of a programming language (preferably Python), and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of fundamental ideas in supervised learning, probabilistic graphical models, optimization and statistics would be very useful but not strictly necessary.

Assessment

9 credits:

There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts, the underlying theory, and the necessary background that you will usually find in the textbooks.

You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.

6 credits:

Same as above except topics are limited to those covered in the first 2/3 of the course and you will not be asked to reimplement the methods or reproducing experimental results.

Schedule and reading materials

For videos please go to the Moodle-WebEx connector (UniFI credentials required).

Date	Topics	Readings/Handouts
2020-09-21	Administrivia. Introduction to the discipline of Machine Learning. Supervised learning.	HTF09 1, 2.1, 2.2. M. I. Jordan T. M. Mitchell (2015). Machine learning: Trends,perspectives, and prospects. Science 349.6245:255-260. Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2), 8-12.
2020-09-22	Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares and its statistical analysis. Gauss-Markov theorem.	HTF09 3.1, 3.2
2020-09-25	Bias-variance decomposition. Regularization. Ridge regression. Lasso. Regularization paths. Maximum likelihood principle.	B06 3.2; HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2.; B12 Ch. 8.
2020-09-28	MLE and KL-Divergence. Very short introduction to Bayesian learning. Ridge regression and Lasso as MAP learning.	B12 Ch. 8, HTF09 4.1-4.4
2020-09-28	Practice on ridge regression, tuning the ridge regularizer, bias-variance decomposition.	Code for this practice session
2020-09-29	Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier. Limitations of linear discriminant analysis. Discriminant vs. generative classifiers. Motivation for the logistic function. Logistic regression and log-loss (cross-entropy loss).	HTF09 4.4, 6.6.3; A18 2.2.3; B06 4.1.4, 4.2, 4.3.2
2020-10-02	Gradient computation for logistic regression. Optimization with the Netwon method. Naive Bayes classifier (Bernoulli/Gaussian). Naive Bayes and logistic regression are a discriminative/generative conjugate pair. Learning curves.	HTF09 4.4, 6.6.3; A18 2.2.3; B12 10.1, 10.2 A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In Proc. AAAI 1998. "A.Y. Ng and M.I. Jordan. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. In NIPS 2001."
2020-10-05	Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression. Gradient calculations.	S18 3.2, 4.4, 7.1; GBC16 6.2.2.3
2020-10-05	Practice on cross-entropy and learning curves
2020-10-06	Maximum margin hyperplane as a constrained optimization problem. Ordinary convex problems and Karush-Kuhn-Tucker theory. Dual form for the maximum-margin hyperplane. KKT conditions and support vectors. Soft constraints (support vector machine). Dual SVM problem. Hinge loss.	GBC16 6.2.2.3; HTF09 4.5 Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121-167.
2020-10-09	Recap on SVM. Kernel methods. Feature maps. Polynomial and RBF kernels. Mercer's theorem.	HTF09 5.8 Schoelkopf, B., & Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press. Chapters 1 and 2.
2020-10-12	More on kernel methods. Representer theorem. Kernel ridge regression. Support vector regression. Reproducing kernel Hilbert spaces. Multiclass SVM.	HTF09 5.8 Vovk, V. (2013). Kernel Ridge Regression. In Empirical Inference (pp. 105–116). Springer, Berlin, Heidelberg. Smola, Alex J., and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing 14.3 (2004): 199-222. Schoelkopf, B., & Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press. Chapter 2.
2020-10-13	No class today
2020-10-16	Learning theory. PAC learning. Agnostic learning and bounds for the estimation error. Learning with infinite function classes: VC-dimension, VC bounds.	Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to Statistical Learning Theory. In Advanced Lectures on Machine Learning. Springer.
2020-10-19	No class today
2020-10-20	Weak learners. Boosting. Adaboost. Boosting decision stumps. Analysis of Adaboost and exponential loss.	HTF09 10.1, 10.4 Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal of Japanese Society For Artificial Intelligence, 14(771-780), 1612. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR 2001.
2020-10-23	Practice on boosted decision stumps	Code for this practice session on Moodle
2020-10-23	Bagging. Random forests. Out-of-bag estimate of the prediction loss. Attribute relevance. CART. Introduction to additive models and gradient boosting.	HTF09 9.2.2, 10.3-10.10, 10.12.1, 15 "Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2), 337–407." Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
2020-10-26	Gradient boosting. Introduction to representations and representation learning.	GCB16 5.11, 13.4 Olshausen, Bruno A., and David J. Field (1996). Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, no. 6583, 607–609. Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Int. Conf. on Machine Learning (pp. 609–616). Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on (pp. 263–272). IEEE.
2020-10-31	Artificial neurons and their biological inspiration. Expressiveness of shallow and deep neural networks. Universality for Boolean functions. Universal approximation. Activation functions. VC dimension. Defining the optimization problem for learning.	GCB16 Ch. 6, A18 1.5. Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 315–323). Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. Journal of Machine Learning Research 20:1-17. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.
2020-11-02	Maximum likelihood training of neural networks. Gradient computations. Algorithmic differentiation (forward and reverse mode). Backpropagation. The structure of the optimization problem for neural networks. Saddle points.	GBC16 6.5, 8.1, 8.2, 8.3.1 Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 27 (pp. 2933–2941). "Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2015). Automatic differentiation in machine learning: a survey. ArXiv Preprint ArXiv:1502.05767." Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11, 625–660.
2020-11-03	No class today
2020-11-06	Tradeoffs of large scale learning. Stochastic gradient descent. Weight initialization for neural networks. Momentum. Nesterov accelerated gradient.	GBC16 8 "LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural Networks: Tricks of the Trade, 421–436. Springer, 2012." "Bottou, Léon. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, 421–436. Springer, 2012." Sutskever, Ilya, James Martens, George Dahl, and Geoffrey Hinton (2013). On the Importance of Initialization and Momentum in Deep Learning. ICML-13, 1139–1147.
2020-11-09	No class today
2020-11-10	Adagrad. RMSProp. Adam. Gradient clipping. Batch Normalization. Effects of L2 regularization. Early stopping. Dropout.	GBC16 8.5, 8.7, 7.1, 7.8, 7.12; A18 3.5, 4.4, 4.6 Duchi, John, Elad Hazan and Yoram Singer (2011). Adaptive Subgradient Methods forOnline Learning and Stochastic Optimization. J. Mach. Learn. Res. 12:2121-2159. "Kingma, Diederik, and Jimmy Ba (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980" "Ioffe, Sergey, and Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." "Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958." Reddi, S. J., Kale, S., & Kumar, S. (2019). On the Convergence of Adam and Beyond. ArXiv:1904.09237 Wager, S., Wang, S., & Liang, P. S. (2013). Dropout Training as Adaptive Regularization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (pp. 351–359).
2020-11-13	No class today
2020-11-16	Practice on Tensorflow (v1 and v2) and Keras.	"Abadi, Martín, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [Cs], March 14, 2016." Tensorflow website Code for this practice session on Moodle
2020-11-20	Convolutional networks. Variants of the convolutional operator. Pooling. Modules (subnetworks). Highway networks. Resudual networks. Densely connected networks.	GBC16 9, A18 8.2, 8.4 Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proc. AAAI. G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten and K. Weinberger (2019). Convolutional Networks with Dense Connectivity. IEEE Trans. on Pattern Analysis and Machine Intelligence. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway Networks. In Proc. ICML 2015 Deep Learning Workshop. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015). Deep Residual Learning for Image Recognition.
2020-11-23	Sequence learning. Overview of problems and methods. Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Brief introduction to conditional random fields.	B11 23 Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
2020-11-24	The expectation-maximization algorithm. Mixture models. Introduction to recurrent neural networks.	HTF09 8.5; B11 11.1, 11.2, 12; GBC16 19.2, 10.1, 10.2 Minka, T. (1998). Expectation-Maximization as lower bound maximization. Tech. report. Microsoft Research.
2020-11-27	Recurrent networks. Vanishing gradients. Gated recurrent units. Attention mechanisms. Recurrent encoder-decoder with attention. Hierarchical attention.	GBC16 10 "Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232." Graves, Alex. (2012). Sequence transduction with recurrent neural networks. In ICML 2012. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proc. ACL 2016, 1480–1489.
2020-12-01	Multi-headed attention mechanism. Transformer networks. Introduction to generative models and autoencoders.	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 6000–6010. Alexander Rush (2018). The Annotated Transformer.
2020-12-04	Variational autoencoders. Brief introduction to generative adversarial networks. Hyperparameter Optimization. Grid and random search. Sequential model-based and Bayesian approaches.	"Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. J. of the American Statistical Association, 112(518), 859–877." Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proc. ICLR Matthias Feurer and Frank Hutter (2019). Hyperparameter Optimization. In F. Hutter et al. Automated Machine Learning, Springer.

Note

Full text of linked papers is normally accessible when connecting from a UNIFI IP address. Use the proxy proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the campus network.

B024317 - Machine Learning - Fall 2020

MSc degree in Computer Engineering, University of Florence