In this class you will learn about several fundamental and some advanced algorithms for statistical learning, you will know the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems. Broad topics that are covered include: Generalized linear models, kernel methods, learning in graphical models, ensemble techniques and boosting, unsupervised learning, deep learning.
A good knowledge of a programming language, and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of optimization techniques and statistics would be useful but not strictly necessary.
[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer, 2009.
[GBC16] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2016.
[B12] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
[RN10] Stuart Russell and Peter Norvig.Artificial Intelligence, A Modern approach (3rd revised Edition), Prentice Hall, 2010.
[STC00] John Shawe-Taylor and Nello Cristianini. Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000
There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it. Typically, you will be assigned a set of papers to read and will be asked to reproduce some experimental results. You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts and the underlying theory.
You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.
There is a single oral final exam on a subset of topics (e.g. supervised learning, learning theory, grapical models).
|2017-09-26||Administrivia. Introduction to the discipline of Machine Learning. Supervised learning.||HTF09 1, 2.1, 2.2.|
|2017-09-29||Linear regression as a supervised learning problem. Loss functions for regression. Ordinary least squares and its statistical analysis. Gauss-Markov theorem. Bias-variance decomposition.||HTF09 3.1, 3.2, 3.4.1,|
|2017-10-03||Regularization. Ridge regression. Lasso. Regularization paths. The maximum likelihood principle. Examples (Bernoulli data, Normal data). Kullback-Leibler divergence. Consistency of maximum likelihood estimates. Consistency vs. unbiasedness.||HTF09 3.4, 7.1, 7.2, 7.3, 8.2.2. B12 Ch. 8.|
|2017-10-06||MLE and KL-Divergence. Introduction to Bayesian learning. Ridge regression and Lasso as MAP learning. Classification. Bayes optimal classifier. Generative modeling. Linear discriminant analysis and its limitations. Discriminative modeling. Motivation for the logistic function. Introduction to logistic regression.||B12 Ch. 8, HTF09 4.1-4.4|
|2017-10-10||Logistic regression and Log-loss. Gradient computation. Optimization (Newton, stochastic gradient descent). Naive Bayes classifier. Relationship between Naive Bayes and logistic regression. Discretization of continuous attributes. Laplace smoothing.||HTF09 4.4, 6.6.3|
|2017-10-13||Generalized linear models. Logistic regression and least squares regression as special cases. Softmax regression. Ordinary convex problems and Karush-Kuhn-Tucker theory. Maximum margin hyperplane as a constrained optimization problem.|
|2017-10-17||Maximal-margin hyperplane as a constrained optimization problem. Dual form. KKT conditions and support vectors. Soft constraints and support vector machines. Hinge-loss.||HTF09 4.5|
|2017-10-20||Kernel methods. Feature maps. Advantages and disadvantages of kernels. Polynomial and RBF kernels. Valid kernels. Mercer\'s theorem. Representer theorem.||HTF09 5.8|
|2017-10-24||Kernel ridge regression. Reproducing kernel Hilbert spaces. Closure properties of kernels. Structured learning and convolution kernels. Examples of decompositions (biological sequences, trees, graphs).||HTF09 5.8|
|2017-10-27||Graph kernels (kernels based on paths, Weisfeiler-Lehman, NSPDK). Hash kernels for multitask learning.|
|2017-10-31||Learning theory. No free lunch theorems. PAC learning. Agnostic learning. Bounds for the estimation error. Learning with infinite function classes. VC-dimension. Sauer\'s lemma and VC bounds.|
|2017-11-03||Boosting. Adaboost and its analysis. Exponential loss. Learning decision stumps.||10.1, 10.4|
|2017-11-07||Bagging and random forests. Multiclass classification with binary classifiers. Support vector regression. Introduction to multiple instance learning.||HTF09 6.1, 6.2, 6.6.1, 15|
|2017-11-10||Multi-instance learning with mi-SVM and MI-SVM. Estimating conditional probabilities with SVM. Kernel density estimation. Novelty detection with one-class SVM.||HTF09 6.1, 6.2, 6.6.1|
|2017-11-14||Refresher on graphical models. Semantics of directed, undirected, and factor graphs. Families of inference algorithms. Parameter learning in directed graphs with complete data (maximum likelihood and Bayesian approach). Missing data and expectation-maximization.||B11 ch. 5-6|
|2017-11-17||Convergence of the EM algorithm. Application of EM to mixture modeling. Gaussian mixture models. k-means as hard EM. Hidden Markov models. Baum-Welch procedure. Viterbi decoding. Application to speech recognition.||HTF09 8.5, B11 11.1, 11.2|
|2017-11-21||Dimensionality reduction. PCA, linear autoencoders. Kernel PCA. Multidimensional scaling and Isomap. t-SNE.||HTF09 14.5, 14.8, 14.9|
|2017-11-24||Sparse Coding. Artificial neurons and biological inspiration. Activation units. Deep and shallow neural networks. Expressiveness.||GCB16, Ch. 1, 5.12, 6.1,6.2,6.3.1|
|2017-11-28||Supervised learning for neural networks (classification, regression, multitask). Multitask regularizers. Computing derivatives. Forward and reverse mode algorithmic differentiation. Backpropagation. Greedy layerwise pretraining. Denoising autoencoders.||GBC16 6.3, 6.4|
|2017-12-01||Optimization for deep learning. Stochastic gradient descents. Tradeoffs of large scale learning. Minibatches. Weight initialization. Momentum. Nesterov accelerated gradient. Adagrad. RMSProp. Adam. Gradient clipping.||GBC16 8|
|2017-12-05||Batch normalization. Effects of L2 regularization. Early stopping. Dropout.||GBC16 8.7.1, 7|
|2017-12-12||Intel AI Academy Seminar on Machine Learning and Deep Learning Fundamentals. Room 111 Santa Marta, 14:30|
|2017-12-15||Convolutional networks. Recurrent networks.||GBC16 9.1,9.2,9.3,10.1,10.2|
|2017-12-19||Long short-term memory networks.||GBC16 10.10|
|2017-12-19||Introduction to Tensorflow, Keras, Chainer.|
Full text of linked papers is normally accessible when connecting from a UNIFI IP address.
Use the proxy
proxy-auth.unifi.it:8888 (with your credentals) if you are connecting from outside the