B031278 - Deep Learning

Instructor

Paolo Frasconi, DINFO, via di S. Marta 3, 50139 Firenze

email: .

Office Hours

Wednesday, 14:45-16:45.

Description

The course aims to provide an overview of classic and some current deep learning methodologies. It will tentatively cover the following aspects:

Fundamentals: representation learning (sparse coding, restricted Boltzmann machines, deep autoencoders), computational graphs, expressive power, loss functions, regularizers, optimizers, normalizers, software frameworks.
Deep architectures: convolutional, recurrent, attention and transformers, graph neural networks.
Beyond single task: ~~domain adaptation and generalization~~, multi-task-, transfer-, ~~self-supervised-, meta-, lifelong~~-learning.
Theory: optimization and generalization in overparameterized systems.
Empirical deep learning: experimental design and reproducibility, hyperparameter optimization, ~~neural architecture search~~.
Additional topics: Depending on time available, the course will cover a selection of further topics such as explainable AI, model compression, novelty detection, adversarial learning, fairness.

Learning objectives

You will be able to understand and apply state-of-the-art algorithms and architectures, to understand the relevant methodological details, and to operate according to the current practices. Deep learning is a fast moving field. To be successful in your future career you will need to develop sufficient skills to competently read and understand a large fraction of the current and future literature (yes, a form of meta-learning). Thus, after succesfully completing this course, you should be able to understand, reimplement and evaluate on your own many novel algorithms, with limited help or guidance from a supervisor.

Prerequisites

B031297 - Foundations of Statistical Modeling/Foundations of Statistical Learning is a formal prerequisite. Proficiency on scientific computing with a modern programming language (e.g., NumPy with Python) and familiarity with linear algebra, multivariate calculus, and elementary optimization will be useful.

Assessment

There is a single oral final exam with an associated project. You can choose topic of your project but you should discuss it with me during office hours and I will give you the details of what should be done.

Typically, you will be assigned one or more papers to read and will be asked to work at home to reproduce some (simplified) experimental results or to apply the same method to different data or in a slightly different setting. You are responsible to study the relevant methodological and theoretical prerequisites of these papers (in some cases, studing the references covered in class may be sufficient but in other cases, especially when dealing with the details of the experimental procedures, readings other ancestors in the citation graph may be necessary).

There is no need to submit a report for your work, but you are asked to share with me the code (not the data! --- for that a links is sufficient) you have developed with some short instructions for reproducing your results. Small zip files can be shared by email (please send a https://0x0.st/ link if the zip is over a MByte) but if you prefer to share a git repository, please create a private one on https://codeberg.org/ and share it with me by inviting the user dl.unifi as a member.

You will be required to give a short presentation during the exam. Please ensure that during your presentation you introduce and motivate the problem being addressed in the context of the relevant literature, explain the technical derivation of the methods, and describe in detail the experimental work and the results. You are allowed (but not required) to use multimedia tools to prepare your presentation. You should be prepared to answer general questions about the background literature supporting your paper(s) (for example if the method uses an optimizer, which happens with overwhelming probability, you are supposed to know how it works) and about the details of your experimental work.

You can work in groups of two to carry out the experimental work (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable. In any case, during the exam you will have to answer questions individually.

Schedule and reading materials

Relevant papers and/or sections of the textbook(s) are listed on the right side. [Sections of] papers in the "required" list have been covered in class and should be studied while preparing for the exam. Papers listed as "optional" may be useful to get a better picture of the class topic but you do not need to study them, unless they are directly related to the topic of your project.

Date	Topics	Readings/Handouts
2022-09-21	Administrivia. Overview of the course.	GBC16 Chapter 1
2022-09-22	Limitations of feature engineering. Representation Learning. Biologically inspired features. Sparse coding and self-taught learning. Deep belief networks.	Optional: GBC16 9.10, 20.2, 20.3 Required additional readings: (Bengio et al. 2013); Optional additional readings: (Dalal & Triggs, 2005); (Olshausen & Field, 1996); (Hinton & Salakhutdinov 2006);
2022-09-28	Multilayered perceptrons. Computational graphs. Activation functions. Stacked RBMs and denoising autoencoders.	GBC16 6.1, 14.2.2, 14.5; ZLLS22 5.1, 5.2 Required additional readings: (Bengio et al. 2013); (Nair & Hinton 2010); (Glorot et al. 2011); Optional additional readings: (Hahnloser et al. 2010); (Dahl et al. 2013); (Maas et al. 2013); (He et al. 2015);
2022-09-29	Expressive power of MLP. Boolean functions. Arbitrary functions. Benefits of depth.	GBC16 6.4 Optional additional readings: (Pascanu et al. 2013); (Cybenko 1989);
2022-09-29	Loss functions, empirical error, maximum likelihood.	GBC16 6.2, ZLLS22 3 Optional additional readings: (Huber 1963);
2022-10-05	Canonical form of exponential family distributions. Canonical links, and response functions, loss functions, gradients. Examples: Gaussian, Bernoulli, Poisson, Categorical.	GBC16 6.2; ZLLS22 4.1 Optional additional readings: (Myers & Montgomery, 1997);
2022-10-06	Categorical cross-entropy and the log-sum-exp trick. Automatic differentiation in forward and reverse mode.	GBC16 6.5; ZLLS22 4.4, 5.3 Required additional readings: (Baydin et al. 2017); Optional additional readings: (Rumelhart et al., 1986);
2022-10-12	Optimization for deep learning. Stochastic gradient descent and the tradeoffs of large scale learning.	GBC16 8.1; ZLLS22 12.3, 12.4, 12.5 Required additional readings: (Bottou & Bousquet, 2007); Optional additional readings: (Bottou et al. 2018); (Shallue et al. 2019);
2022-10-13	Weight initialization. Local minima and saddle points. Momentum and Nesterov accelerated gradient. Adagrad. RMSProp.	GBC16 8.2.2, 8.2.3, 8.3, 8.5.1, 8.5.2; ZLLS22 12.6, 12.7, 12.8 Required additional readings: (Dauphin et al. 2014); (Hinton 2012); Optional additional readings: (Poston et al. 1991); (Duchi et al. 2011);
2022-10-19	Adam and AMSGrad. Explicit regularization by penalties. Effects of ridge and L1 regularizers. Early stopping compared to L2.	GBC16 8.5.3, 7.1, 7.2, 7.8, 7.9; ZLLS22 12.10, 5.5 Required additional readings: (Kingma & Ba, 2014); Optional additional readings: (Deddi et al. 2019);
2022-10-20	Noise injection. Dropout. Adaptive regularization effect. More activation functions (SiLU, Swish, GELU). Data augmentation.	GBC16 7.5, 7.12, 7.4; ZLLS22 5.6, 14.4. Optional additional readings: (Wager et al. 2013); (Ramachandran et al., 2018); (Ramachandran et al., 2018); (Hendrycks & Gimpel 2020);
2022-10-26	Normalizers. Data standardization. Batch normalization and re-normalization. Weight normalization. Self-normalizing networks.	GBC16 8.7.1; ZLLS22 8.5 Required additional readings: (Ioffe & Szegedy 2015); (Salimans et al. 2016); Optional additional readings: (Klambauer et al. 2017); (Ioffe 2017);
2022-10-27	Convolutions and convolutional networks for Nd signals. Basic concepts and some variants.	GBC16 9.1, 9.2, 9.5 Required additional readings: (Dumoulin & Visin 2016); Optional additional readings: (Ngiam et al. 2010); (Yu & Koltun 2015);
2022-11-02	Computing convolutions. Pooling and strides. Bottlenecks (1x1 convolutions). Basic blocks in VGG and Inception. Transposed convolutions. Fully convolutional networks, U-net.	GBC16 9.3, 9.5, 9.6; ZLLS22 8.2.1, 8.3.1, 8.4.1 Optional additional readings: (Szegedy et al. 2014); (Simonyan & Zisserman 2015); (Ronneberger et al. 2015);
2022-11-03	Normalization for CNNs (batch, layer, instance, group). Gates. Mixtures of experts. Skip connections: Highway and residual networks.	Required additional readings: (Srivastava et al. 2018); Optional additional readings: (Wu & He 2018); (Jacobs et al. 1991); (He et al. 2015);
2022-11-03	Building and training models in Tensorflow and Keras.	Optional: ZLLS22 13 Required additional readings: (Agrawal et al. 2019); Optional additional readings: (Abadi et al. 2016); (Fragments of demo code);
2022-11-09	Sequence learning tasks with examples. Recurrent neural networks as dynamical systems. Main architecture and bidirectional RNN.	GBC16 10.1, 10.2, 10.3; ZLLS22 10.4
2022-11-10	Vanishing gradients in RNNs. Architectural variants and stacking RNN layers. RNNs with gates (gated recurrent units, long short-term memories).	GBC16 2.2.2, 10.5, 10.7, 10.9, 10.10, 10.11; ZLLS22 9.7, 10.1, 10.2, 10.3 Required additional readings: (Chung et al. 2014); Optional additional readings: (Hochreiter et al. 2001); (Hochreiter & Schmidhuber 1997);
2022-11-16	Language models: from probabilistic models to neural networks. Optimal decoding with beam search.	GBC16 12.4.2, Required additional readings: (Jurafsky & Martins, 2022, Sections 7.5, 8.4, 10.5);
2022-11-17	Recurrent language models. Encoder-decoder architecture for sequence-to-sequence learning. Neural Turing machines. Attention mechanisms for neural machine translation.	GBC16 12.4 Required additional readings: (Jurafsky & Martins, 2022, Sections 9.1, 9.3, 10.2, 10.3); (Bahdanau et al. 2014); Optional additional readings: (Sutskever et al. 2014); (Cho et al. 2014); (Graves et al. 2014);
2022-11-23	Attention and hierarchical attention for sequence classification. End-to-end memory networks and its application to question answering.	Required additional readings: (Yang et al. 2016); (Sukhbaatar et al. 2015); Optional additional readings: (Wiegreffe & Pinter 2019);
2022-11-23	Transformers. Graph neural networks, graph convolutional networks, graph attention networks	GBC16 10.6 Required additional readings: (Vaswani et al. 2017); (Kipf & Welling 2017); (Veličković et al. 2018); Optional additional readings: (Scarselli et al. 2009); Online resources: (Huang et al, 2022);
2022-11-30	Hyperparameter optimization. Definitions and elementary algorithms.	Required additional readings: (Bergstra & Bengio 2012); Optional additional readings: (Yogatama et al. 2015); (Haibe-Kains et al. 2020); (Bergstra et al. 2013);
2022-12-01	Building and training models in Pytorch (CNN, LSTM, attention)	Required additional readings: (Paszke et al. 2019); Optional additional readings: (Fragments of demo code);
2022-12-01	Bayesian (model-based) optimization for tuning hyperparameter.	Required additional readings: (Snoek et al. 2012); (Frazier 2018); Optional additional readings: (Jones et al. 1998);
2022-12-07	Multi-fidelity approaches to hyperparameter optimization. Successive halving. Hyperband. Gradient-based approaches.	Required additional readings: (Li et al. 2018); Optional additional readings: (Jamieson & Talwalkar 2016); (Li et al. 2020); (Franceschi et al. 2017); Online resources: (Tune); (Optuna); (Maher et al. 2022);
2022-12-14	The inadequacy of classic learning theory to understand deep learning. Loss surface in overparameterized systems. Double descent.	Required additional readings: (Zhang et al. 2021); (Belkin et al. 2019); Optional additional readings: (Cooper 2001); (Nakkiran et al. 2019); (Liu et al. 2022);
2022-12-15	Implicit bias of gradient descent. Neural Collapse. Some settings beyond single task: multi-task, transfer, meta, self-supervised, contrastive learning. Some multi-task architectures. Transfer via fine-tuning: ImageNet, BERT, T5.	Required additional readings: (Ma et al. 2018); (Devlin et al. 2018); Optional additional readings: (Soudry et al. 2018); (Papyan et al. 2020); (Raghu et al. 2019); (Misra et al. 2016); (Raffel et al. 2020);

Note

Full text of linked papers is normally accessible when connecting from a UNIFI IP address. Use proxy-auth.unifi.it:8888 (with your credentials) if you are connecting from outside the campus network.

B031278 - Deep Learning. Fall 2022

MSc degree in Artificial Intelligence, University of Florence