In this class you will learn about several fundamental and some advanced algorithms for statistical learning, you will know the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems. Broad topics that are covered include: Generalized linear models, kernel methods, learning in graphical models, ensemble techniques and boosting, unsupervised learning, deep learning.
A good knowledge of a programming language, and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of optimization techniques and statistics would be useful but not strictly necessary.
[GBC16] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2016.
[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer, 2009.
[B12] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
[S18]O. Simeone (2018), A Brief Introduction to Machine Learning for Engineers, Foundations and Trends R in Signal Processing: Vol. 12, No. 3-4, pp 200–431. DOI: 10.1561/2000000102.
[RN10] Stuart Russell and Peter Norvig.Artificial Intelligence, A Modern approach (3rd revised Edition), Prentice Hall, 2010.
[STC00] John Shawe-Taylor and Nello Cristianini. Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000
There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts and the underlying theory.
You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.
There is a single oral final exam on a subset of topics (e.g. supervised learning, learning theory, grapical models).
Please do not email me about office hours, just check the School of Engineering Website for (unlikely) changes
|2018-09-25||Administrivia. Introduction to the discipline of Machine Learning. Supervised learning.||HTF09 1, 2.1, 2.2.|
|2018-09-28||Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares and its statistical analysis. Gauss-Markov theorem. Bias-variance decomposition.||HTF09 3.1, 3.2, 3.4.1,|
|2018-10-02||No class today|
|2018-10-05||Regularization. Ridge regression. Lasso. Regularization paths. The maximum likelihood principle. Examples (Normal data). Kullback-Leibler divergence. Consistency and efficiency of maximum likelihood estimates.||HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2. B12 Ch. 8.|
|2018-10-09||Consistency vs. unbiasedness. MLE and KL-Divergence. Introduction to Bayesian learning. Ridge regression and Lasso as MAP learning. Classification. Fisher (linear) discriminant analysis. Bayes optimal classifier.||B12 Ch. 8, HTF09 4.1-4.4|
|2018-10-12||Linear discriminant analysis and its limitations. Discriminant vs. generative classifiers. Motivation for the logistic function. Introduction to logistic regression. Logistic regression and Log-loss. Gradient computation. Optimization (Newton, gradient descent). Naive Bayes classifier. Relationship between Naive Bayes and logistic regression.||HTF09 4.4, 6.6.3|
|2018-10-16||Learning curves for naive Bayes and logistic regression. Laplace smoothing. Bayesian conjugates. Beta distribution. Laplace smoothing as a regularizer. Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression.||S18 3.2, 4.4, 7.1; GBC16 18.104.22.168|
|2018-10-19||Details on softmax regression and gradient calculations. Maximum margin hyperplane as a constrained optimization problem. Ordinary convex problems and Karush-Kuhn-Tucker theory. Dual form for the MMH. KKT conditions and support vectors. Soft constraints.||GBC16 22.214.171.124; HTF09 4.5|
|2018-10-23||Support vector machines. Dual problem. Hinge-loss. Sequential minimal optimization. Kernel methods. Feature maps. Polynomial and RBF kernels.||HTF09 5.8,|
|2018-10-26||Kernelized perceptron. Valid kernels. Mercer\'s theorem. Representer theorem. Kernel ridge regression. Support vector regression.||HTF09 5.8|
|2018-10-30||Learning theory. No free lunch theorems. PAC learning. Agnostic learning. Bounds for the estimation error. Learning with infinite function classes. VC-dimension. Sauer\'s lemma. VC bounds.|
|2018-11-02||No class today (university closed)|
|2018-11-06||More on VC-dimension. Boosting. Adaboost and its analysis. Exponential loss.||HTF09 10.1, 10.4|
|2018-11-09||Boosting decision stumps. Adaboost for face detection. Classification and regression trees (CART). Additive models and gradient boosting.||HTF09 9.2.2, 10.3-10.10, 10.12.1|
|2018-11-13||Bagging. Random forests. Out-of-bag estimate of the prediction loss. Attribute relevance.||HTF09 15|
|2018-11-13||Introduction to representations and representation learning.||GCB16 5.11, 13.4|
|2018-11-16||Artificial neurons and their biological inspiration. Deep and shallow neural networks. Expressiveness. Remarks on numerical computation of the loss. Algorithmic differentiation (forward mode).||GCB16 Ch. 6|
|2018-11-20||Reverse mode algorithmic differentiation (backpropagation). Greedy layerwise pretraining. Denoising autoencoders. The structure of the optimization problem for neural networks. Tradeoffs of large scale learning. Stochastic gradient descent.||GBC16 6.5, 8.1, 8.2, 8.3.1|
|2018-11-23||Optimization for deep learning. Weight initialization. Momentum. Nesterov accelerated gradient. Adagrad. RMSProp. Adam. Gradient clipping. Batch Normalization.||GBC16 8|
|2018-11-26||Effects of L2 regularization. Early stopping. Dropout. Convolutional networks.||GBC16 7, 9.1-9.3,9.5|
|2018-11-30||Convolutional networks for image recognition. Main ideas in some standard architectures like Alexnet, VGG, Inception (depth, filter size, modules, bottlenecks, output average pooling, crops). Residual and highway networks.|
|2018-11-30||Problems in unsupervised learning. Kernel density estimation. Novelty detection.||HTF09 6.1, 6.2, 6.6.1|
|2018-12-04||One-class SVM. Expectation-maximization algorithm and its application to mixture modeling.||HTF09 8.5, 14.3.7; GBC16 19.2|
|2018-12-07||Multidimensional scaling and Isomap. t-SNE. Variational autoencoders.||HTF09 14.8, 14.9; GBC16 19.1, 19.4, 20.10.3|
|2018-12-11||Sequence learning. Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Conditional random fields.||B11 11.1, 11.2|
|2018-12-14||Structural SVM. Recurrent networks. Gated recurrent units. Neural Turing machines.||GBC16 10.1,10.2,10.3,10.10,10.12|
|2018-12-18||Relational learning. Convolution kernels. Graph kernels (kernels based on paths, Weisfeiler-Lehman). Recursive neural networks. Convolutional networks for graphs.||GBC16 10.6|
|2018-12-21||Hyperparameter Optimization. Grid and random search. Sequential model-based and Bayesian approaches. Gradient based approaches.||GBC16 11.4|
Full text of linked papers is normally accessible when connecting from a UNIFI IP address.
Use the proxy
proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the