In this class you will learn about several fundamental and some advanced algorithms for statistical learning, you will know the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems. Broad topics that are covered include: Generalized linear models, kernel methods, learning in graphical models, ensemble techniques and boosting, unsupervised learning, deep learning.
A good knowledge of a programming language, and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of optimization techniques and statistics would be useful but not strictly necessary.
[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer, 2009.
[GBC16] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2016.
[B12] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
[RN10] Stuart Russell and Peter Norvig.Artificial Intelligence, A Modern approach (3rd revised Edition), Prentice Hall, 2010.
[STC00] John Shawe-Taylor and Nello Cristianini. Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000
There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts and the underlying theory.
You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.
There is a single oral final exam on a subset of topics (e.g. supervised learning, learning theory, grapical models).
|2016-09-26||Administrivia. Introduction to the discipline of Machine Learning. Supervised learning.||HTF09 1, 2.1, 2.2.|
|2016-09-30||Linear regression and ordinary least squares. Statistical analysis. Gauss-Markov theorem. Bias-variance decomposition. Regularizzation, ridge regression.||HTF09 3.1, 3.2, 3.4.1, 7.3|
|2016-10-03||Ridge regression. Geometric interpretation. Lasso. Regularization paths. The maximum likelihood principle. Examples (Bernoulli data, Normal data). Consistency of maximum likelihood estimates. Entropy and Kullback-Leibler divergence. OLS as MLE.||HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2. B12 Ch. 8.|
|2016-10-07||Introduction to Bayesian learning. Ridge regression and Lasso as MAP learning.||B12 Ch. 8.|
|2016-10-07||Classification. Empirical risk minimization. Bayes optimal classifier. Linear and quadratic discriminant analysis.||HTF09 4.1-4.3|
|2016-10-10||Motivation for the logistic function. Logistic regression. Consistency. Optimization (Newton, stochastic gradient descent). Introduction to Naive Bayes.||HTF09 4.4|
|2016-10-14||Comparison between logistic regression and naive Bayes. Laplace smoothing. Multinomial Naive Bayes. Generalized linear models. Softmax regression.||HTF09 6.6.3|
|2016-10-17||Maximal-margin hyperplane as a constrained optimization problem. Dual form. KKT conditions and support vectors. Soft constraints and support vector machines. Hinge-loss.||HTF09 4.5|
|2016-10-21||Kernel methods. Polynomial and RBF kernels. Kernel target alignment. Hyperparameter searching by cross-validation. Valid kernels. Mercer\'s theorem. Representer theorem.||HTF09 5.8|
|2016-10-24||Kernelizing algorithms. Kernel ridge regression. Perceptron and kernel perceptron. Voted perceptron. Reproducing kernel Hilbert spaces. Closure properties of kernels. Structured learning and introduction to convolution kernels.||HTF09 5.8|
|2016-10-26||Convolution kernels. Graph kernels (kernels based on paths, Weisfeiler-Lehman, NSPDK). Multiclass SVM.||HTF09 18.3.3|
|2016-11-04||Estimating conditional probabilities with SVM. Support vector regression. Multiple instance learning.||HTF09 12.3.6|
|2016-11-07||Learning theory. No free lunch theorems. PAC learning. Agnostic learning. Bounds for the estimation error.|
|2016-11-11||Learning with infinite function classes. Countable case. VC-dimension. Sauer\'s lemma and VC bounds. Boosting. Introduction to Adaboost.||HTF09 7.9, 10.1|
|2016-11-14||Analysis of Adaboost. Exponential loss. Learning decision stumps. Decision trees.||10.1, 10.4, 9.2|
|2016-11-18||Bagging and random forests. Kernel density estimation. Novelty detection with one-class SVM.||HTF09 6.1, 6.2, 6.6.1, 15|
|2016-11-21||Learning in graphical models. Parameter learning in directed graphs. Maximum likelihood and the Bayesian approach.||B11 ch. 9|
|2016-11-25||Missing data. The expectation-maximization algorithm. Application to mixture modeling. Gaussian mixture models. k-means as hard EM. Semisupervised learning.||HTF09 8.5, B11 11.1, 11.2|
|2016-11-28||Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Application to speech recognition.||B11 ch. 23|
|2016-12-02||Supervised sequence learning. Conditional random fields. Structured output SVM. Principal component analysis.||HTF09 14.5.1|
|2016-12-01||PCA and singular value decomposition. Kernel PCA. Manifold learning. Multidimensional scaling and Isomap. Sparse coding.||HTF09 14.5, 14.8, 14.9|
|2016-12-12||Deep and shallow neural networks. Expressiveness. Greedy unsupervised pretraining. Denoising autoencoders. Backpropagation. Computation graphs. Activation units. The structure of the optimization problem.|
|2016-12-16||Optimization. Stochastic gradient descents. Tradeoffs of large scale learning. Minibatches. Weight initialization. Momentum. Nesterov accelerated gradient. Adagrad. RMSProp. Adam. Gradient clipping.||GBC16 8.3, 8.4, 8.5|
|2016-12-19||Regularization. Effects of L2 regularization. Early stopping. Batch normalization. Convolutional networks. Residual networks. Brief overview of recurrent networks. Word vectors.||GBC16 7, 9.1, 9.2, 9.3, 10.1, 10.2|
Full text of linked papers is normally accessible when connecting from a UNIFI IP address.
Use the proxy
proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the