B024317 - Machine Learning - Fall 2016

Learning Objectives

In this class you will learn about several fundamental and some advanced algorithms for statistical learning, you will know the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems. Broad topics that are covered include: Generalized linear models, kernel methods, learning in graphical models, ensemble techniques and boosting, unsupervised learning, deep learning.

Prerequisites

A good knowledge of a programming language, and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of optimization techniques and statistics would be useful but not strictly necessary.

Suggested readings

Textbooks:

[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer, 2009.

[GBC16] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2016.

[B12] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.

Other texts:

[RN10] Stuart Russell and Peter Norvig.Artificial Intelligence, A Modern approach (3rd revised Edition), Prentice Hall, 2010.

[STC00] John Shawe-Taylor and Nello Cristianini. Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000

Assessment

9 credits

There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts and the underlying theory.

You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.

6 credits

There is a single oral final exam on a subset of topics (e.g. supervised learning, learning theory, grapical models).

Schedule and reading materials

Date Topics Readings/Handouts
2016-09-26Administrivia. Introduction to the discipline of Machine Learning. Supervised learning. HTF09 1, 2.1, 2.2.
2016-09-30Linear regression and ordinary least squares. Statistical analysis. Gauss-Markov theorem. Bias-variance decomposition. Regularizzation, ridge regression. HTF09 3.1, 3.2, 3.4.1, 7.3
2016-10-03Ridge regression. Geometric interpretation. Lasso. Regularization paths. The maximum likelihood principle. Examples (Bernoulli data, Normal data). Consistency of maximum likelihood estimates. Entropy and Kullback-Leibler divergence. OLS as MLE. HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2. B12 Ch. 8.
2016-10-07Introduction to Bayesian learning. Ridge regression and Lasso as MAP learning. B12 Ch. 8.
2016-10-07Classification. Empirical risk minimization. Bayes optimal classifier. Linear and quadratic discriminant analysis. HTF09 4.1-4.3
2016-10-10Motivation for the logistic function. Logistic regression. Consistency. Optimization (Newton, stochastic gradient descent). Introduction to Naive Bayes. HTF09 4.4
2016-10-14Comparison between logistic regression and naive Bayes. Laplace smoothing. Multinomial Naive Bayes. Generalized linear models. Softmax regression. HTF09 6.6.3
2016-10-17Maximal-margin hyperplane as a constrained optimization problem. Dual form. KKT conditions and support vectors. Soft constraints and support vector machines. Hinge-loss. HTF09 4.5
2016-10-21Kernel methods. Polynomial and RBF kernels. Kernel target alignment. Hyperparameter searching by cross-validation. Valid kernels. Mercer\'s theorem. Representer theorem. HTF09 5.8
2016-10-24Kernelizing algorithms. Kernel ridge regression. Perceptron and kernel perceptron. Voted perceptron. Reproducing kernel Hilbert spaces. Closure properties of kernels. Structured learning and introduction to convolution kernels. HTF09 5.8
2016-10-26Convolution kernels. Graph kernels (kernels based on paths, Weisfeiler-Lehman, NSPDK). Multiclass SVM. HTF09 18.3.3
2016-11-04Estimating conditional probabilities with SVM. Support vector regression. Multiple instance learning. HTF09 12.3.6
2016-11-07Learning theory. No free lunch theorems. PAC learning. Agnostic learning. Bounds for the estimation error.
2016-11-11Learning with infinite function classes. Countable case. VC-dimension. Sauer\'s lemma and VC bounds. Boosting. Introduction to Adaboost. HTF09 7.9, 10.1
2016-11-14Analysis of Adaboost. Exponential loss. Learning decision stumps. Decision trees. 10.1, 10.4, 9.2
2016-11-18Bagging and random forests. Kernel density estimation. Novelty detection with one-class SVM. HTF09 6.1, 6.2, 6.6.1, 15
2016-11-21Learning in graphical models. Parameter learning in directed graphs. Maximum likelihood and the Bayesian approach. B11 ch. 9
2016-11-25Missing data. The expectation-maximization algorithm. Application to mixture modeling. Gaussian mixture models. k-means as hard EM. Semisupervised learning. HTF09 8.5, B11 11.1, 11.2
2016-11-28Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Application to speech recognition. B11 ch. 23
2016-12-02Supervised sequence learning. Conditional random fields. Structured output SVM. Principal component analysis. HTF09 14.5.1
2016-12-01PCA and singular value decomposition. Kernel PCA. Manifold learning. Multidimensional scaling and Isomap. Sparse coding. HTF09 14.5, 14.8, 14.9
2016-12-12Deep and shallow neural networks. Expressiveness. Greedy unsupervised pretraining. Denoising autoencoders. Backpropagation. Computation graphs. Activation units. The structure of the optimization problem.
2016-12-16Optimization. Stochastic gradient descents. Tradeoffs of large scale learning. Minibatches. Weight initialization. Momentum. Nesterov accelerated gradient. Adagrad. RMSProp. Adam. Gradient clipping. GBC16 8.3, 8.4, 8.5
2016-12-19Regularization. Effects of L2 regularization. Early stopping. Batch normalization. Convolutional networks. Residual networks. Brief overview of recurrent networks. Word vectors. GBC16 7, 9.1, 9.2, 9.3, 10.1, 10.2

Note

Full text of linked papers is normally accessible when connecting from a UNIFI IP address. Use the proxy proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the campus network.