B024317 - Machine Learning - Fall 2018

Learning Objectives

In this class you will learn about several fundamental and some advanced algorithms for statistical learning, you will know the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems. Broad topics that are covered include: Generalized linear models, kernel methods, learning in graphical models, ensemble techniques and boosting, unsupervised learning, deep learning.

Prerequisites

A good knowledge of a programming language, and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of optimization techniques and statistics would be useful but not strictly necessary.

Suggested readings

Textbooks:

[GBC16] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2016.

[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer, 2009.

[B12] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.

[S18]O. Simeone (2018), A Brief Introduction to Machine Learning for Engineers, Foundations and Trends R in Signal Processing: Vol. 12, No. 3-4, pp 200–431. DOI: 10.1561/2000000102.

Other texts:

[RN10] Stuart Russell and Peter Norvig.Artificial Intelligence, A Modern approach (3rd revised Edition), Prentice Hall, 2010.

[STC00] John Shawe-Taylor and Nello Cristianini. Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000

Assessment

9 credits

There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts and the underlying theory.

You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.

6 credits

There is a single oral final exam on a subset of topics (e.g. supervised learning, learning theory, grapical models).

Office Hours

Tuesday, 10:45-12:45

Please do not email me about office hours, just check the School of Engineering Website for (unlikely) changes

Schedule and reading materials

Date Topics Readings/Handouts
2018-09-25Administrivia. Introduction to the discipline of Machine Learning. Supervised learning. HTF09 1, 2.1, 2.2.
2018-09-28Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares and its statistical analysis. Gauss-Markov theorem. Bias-variance decomposition. HTF09 3.1, 3.2, 3.4.1,
2018-10-02No class today
2018-10-05Regularization. Ridge regression. Lasso. Regularization paths. The maximum likelihood principle. Examples (Normal data). Kullback-Leibler divergence. Consistency and efficiency of maximum likelihood estimates. HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2. B12 Ch. 8.
2018-10-09Consistency vs. unbiasedness. MLE and KL-Divergence. Introduction to Bayesian learning. Ridge regression and Lasso as MAP learning. Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier. B12 Ch. 8, HTF09 4.1-4.4
2018-10-12Linear discriminant analysis and its limitations. Discriminant vs. generative classifiers. Motivation for the logistic function. Introduction to logistic regression. Logistic regression and Log-loss. Gradient computation. Optimization (Newton, gradient descent). Naive Bayes classifier. Relationship between Naive Bayes and logistic regression. HTF09 4.4, 6.6.3
2018-10-16Learning curves for naive Bayes and logistic regression. Laplace smoothing. Bayesian conjugates. Beta distribution. Laplace smoothing as a regularizer. Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression. S18 3.2, 4.4, 7.1; GBC16 6.2.2.3
2018-10-19Details on softmax regression and gradient calculations. Maximum margin hyperplane as a constrained optimization problem. Ordinary convex problems and Karush-Kuhn-Tucker theory. Dual form for the MMH. KKT conditions and support vectors. Soft constraints. GBC16 6.2.2.3; HTF09 4.5
2018-10-23Support vector machines. Dual problem. Hinge-loss. Sequential minimal optimization. Kernel methods. Feature maps. Polynomial and RBF kernels. HTF09 5.8,
2018-10-26Kernelized perceptron. Valid kernels. Mercer\'s theorem. Representer theorem. Kernel ridge regression. Support vector regression. HTF09 5.8
2018-10-30Learning theory. No free lunch theorems. PAC learning. Agnostic learning. Bounds for the estimation error. Learning with infinite function classes. VC-dimension. Sauer\'s lemma. VC bounds.
2018-11-02No class today (university closed)
2018-11-06More on VC-dimension. Boosting. Adaboost and its analysis. Exponential loss. HTF09 10.1, 10.4
2018-11-09Boosting decision stumps. Adaboost for face detection. Classification and regression trees (CART). Additive models and gradient boosting. HTF09 9.2.2, 10.3-10.10, 10.12.1
2018-11-13Bagging. Random forests. Out-of-bag estimate of the prediction loss. Attribute relevance. HTF09 15
2018-11-13Introduction to representations and representation learning. GCB16 5.11, 13.4
2018-11-16Artificial neurons and their biological inspiration. Deep and shallow neural networks. Expressiveness. Remarks on numerical computation of the loss. Algorithmic differentiation (forward mode). GCB16 Ch. 6
2018-11-20Reverse mode algorithmic differentiation (backpropagation). Greedy layerwise pretraining. Denoising autoencoders. The structure of the optimization problem for neural networks. Tradeoffs of large scale learning. Stochastic gradient descent. GBC16 6.5, 8.1, 8.2, 8.3.1
2018-11-23Optimization for deep learning. Weight initialization. Momentum. Nesterov accelerated gradient. Adagrad. RMSProp. Adam. Gradient clipping. Batch Normalization. GBC16 8
2018-11-26Effects of L2 regularization. Early stopping. Dropout. Convolutional networks. GBC16 7, 9.1-9.3,9.5
2018-11-30Convolutional networks for image recognition. Main ideas in some standard architectures like Alexnet, VGG, Inception (depth, filter size, modules, bottlenecks, output average pooling, crops). Residual and highway networks.
2018-11-30Problems in unsupervised learning. Kernel density estimation. Novelty detection. HTF09 6.1, 6.2, 6.6.1
2018-12-04One-class SVM. Expectation-maximization algorithm and its application to mixture modeling. HTF09 8.5, 14.3.7; GBC16 19.2
2018-12-07Multidimensional scaling and Isomap. t-SNE. Variational autoencoders. HTF09 14.8, 14.9; GBC16 19.1, 19.4, 20.10.3
2018-12-11Sequence learning. Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Conditional random fields. B11 11.1, 11.2
2018-12-14Structural SVM. Recurrent networks. Gated recurrent units. Neural Turing machines. GBC16 10.1,10.2,10.3,10.10,10.12
2018-12-18Relational learning. Convolution kernels. Graph kernels (kernels based on paths, Weisfeiler-Lehman). Recursive neural networks. Convolutional networks for graphs. GBC16 10.6
2018-12-21Hyperparameter Optimization. Grid and random search. Sequential model-based and Bayesian approaches. Gradient based approaches. GBC16 11.4

Note

Full text of linked papers is normally accessible when connecting from a UNIFI IP address. Use the proxy proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the campus network.