B031278 - Deep Learning

Instructor

Paolo Frasconi, DINFO, via di S. Marta 3, 50139 Firenze

email: .

Office Hours

~~Friday 11:15-13:15~~Monday 10:45-12:45 (S. Marta) — Check here for any variations

Description

The course aims to provide an overview of classic and some current deep learning methodologies. It will tentatively cover the following aspects:

Fundamentals: Supervised and unsupervised learning problems. Supervised learning in single-layer networks. Loss functions, optimizers. Deep networks. computational graphs, expressiveness, regularizers, optimizers, normalizers.
Deep architectures: convolutional, recurrent, attention and transformers, graph neural networks.
Software frameworks: Tensor operations, automatic differentiation, manipulating data, implementing deep architectures (see blended teaching below).
Beyond single task: domain adaptation and generalization, multi-task learning, transfer learning
Theory: optimization and generalization in overparameterized systems.
Empirical deep learning: experimental design and reproducibility, hyperparameter optimization, neural architecture search.
Unsupervised deep learning: Autoencoders, anomaly detection, feature extraction, self-supervised learning.

Blended teaching

There will be six online hours (less than one credit) consisting of six 30-minutes video-lectures, mainly covering practical deep learning programming aspects, with a focus on PyTorch and Tensorflow. There will be a question answering session in class, one week after each video-lecture is posted. Please email me in advance at with the questions that you would like to discuss in class.

Learning objectives

You will be able to understand and apply state-of-the-art algorithms and architectures, to understand the relevant methodological details, and to operate according to the current practices. Deep learning is a fast moving field. To be successful in your future career you will need to develop sufficient skills to competently read and understand a large fraction of the current and future literature (yes, a form of meta-learning). Thus, after succesfully completing this course, you should be able to understand, reimplement and evaluate on your own many novel algorithms, with limited help or guidance from a supervisor.

Prerequisites

Multivariate calculus and linear algebra are much needed. Elementary numerical optimization, algorithms and data structures, and proficiency on scientific computing with a modern programming language (e.g., NumPy with Python) will be useful.

Assessment

There is a single oral final exam with an associated project. You can choose topic of your project but you should discuss it with me during office hours and I will give you the details of what should be done.

Typically, you will be assigned one or more papers to read and will be asked to work at home to reproduce some (simplified) experimental results or to apply the same method to different data or in a slightly different setting. You are responsible for studying the relevant methodological and theoretical prerequisites of these papers (in some cases, studing the references covered in class may be sufficient but in other cases, especially when dealing with the details of the experimental procedures, readings other ancestors in the citation graph may be necessary).

There is no need to submit a report for your work, but you are asked to share with me the code (not the data! --- for that a links is sufficient) you have developed with some short instructions for reproducing your results. Small zip files can be shared by email (please send a https://0x0.st/ link if the zip is over a MByte) but if you prefer to share a git repository, please create a private one on https://codeberg.org/ and share it with me by inviting the user dl.unifi as a member.

You will be required to give a short (30 minutes including time for questions) presentation during the exam. Please ensure that during your presentation you introduce and motivate the problem being addressed in the context of the relevant literature, explain the technical derivation of the methods, and describe in detail the experimental work and the results. You are allowed (but not required) to use multimedia tools to prepare your presentation. You should be prepared to answer general questions about the background literature supporting your paper(s) (for example if the method uses an optimizer, which happens with overwhelming probability, you are supposed to know how it works) and about the details of your experimental work.

You can work in groups of two to carry out the experimental work (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable. In any case, during the exam you will have to answer questions individually.

Computational resources

A limited amount of computational resources is available for exam project. Requests should be sent by filling in this form.

Schedule and reading materials

Relevant papers and/or sections of the textbook(s) are listed on the right side. [Sections of] papers in the "required" list have been covered in class and should be studied while preparing for the exam. Papers listed as "optional" may be useful to get a better picture of the class topic but you do not need to study them, unless they are directly related to the topic of your project.

Date	Topics	Readings/Handouts
2025-09-17	Administrivia. Most common forms of learning: supervised, unsupervised, reinforcement. Outline of the course. Historical remarks.	BB24 Chapter 1, GBC16 5.1 Optional additional readings: (Schmidhuber 2015);
2025-09-19	Supervised learning and empirical risk minimization. Optimal (binary) classifier.	BB24 4.1, 4.2, 5.2, 5.3, GBC16 5.5. Optional additional readings: (Ng & Jordan 2001);
2025-09-24	The generative direction. Linar discriminant analysis. Maximum likelihood estimation (MLE). Optimality of hyperplanes. The discriminative direction. Single layer networks. Role of the logistic function. Linking (conditional) MLE and ERM.	BB24 3.2 5.1, GBC16 5.5, 5.7. Optional additional readings: (Li et al. 2023); (Clark & Jaini, 2023);
2025-09-26	Loss functions for regression (L2, L1, Huber). The least squares problem and its solution. Loss function for classification (0-1, hinge, log). Introdiction to generalized linear model. Bernoulli in exp family. Canonical response and link functions.	BB24 4.1, 5.4 Optional additional readings: (Girshick 2015); (Huber 1963);
2025-10-01	Generalized linear models: Logistic regression. Minimizers of the log-loss are maximizers of the conditional likelihood. Brief sketch of two more cases: Gaussian and Poisson. Multiclass classification and softmax regression.	BB24 5.4, 5.1.; ZLLS23 4
2025-10-03	Geometric interpretation of softmax regression. Numerical issues and the log-sum-exp trick. Gradient descent and stochastic gradient descent.	BB24 7.1, 7.2; ZLLS23 4, 12 Required additional readings: (Bottou & Bousquet, 2007); Optional additional readings: (Shalev-Shwartz & Ben-David 2014); (Bottou et al. 2018);
2025-10-08	Comparing GD and SGD. Convergence rates. The tradeoffs of large scale learning. Optimization, estimation, and approximation errors. Minibatches. Effects of batch size. SGD with momentum. Adaptive methods (Adagrad, RMSProp, Adam).	BB24 7.2, 7.3 Required additional readings: (Bottou & Bousquet, 2007); (Hinton 2012); (Kingma & Ba, 2014); Optional additional readings: (Shallue et al. 2019); (Duchi et al. 2011);
2025-10-10	No class today.
2025-10-15	Biologically inspired features. Feature learning and end-to-end learning. Compositionality and deep representations. Layerwise training of deep networks. Denoising autoencoders. Multilayered perceptrons. Rectifiers.	BB24 6, GBC16 6, 6.1, 6.3, 6.4, 14.2 Required additional readings: (Bengio et al. 2013); Optional additional readings: (Lee et al. 2009); (Ringach 2002); (Hinton & Salakhutdinov 2006); (Serre et al. 2005); (Hinton et al. 2006); (Vincent et al. 2008); (Glorot et al. 2011); (Nair & Hinton 2010);
2025-10-16	Video lecture: Introduction to deep learning frameworks. Setting up a development environment. Tensors. Working remotely.	ZLLS23 2.1-2.3 Required additional readings: (Video lecture 1);
2025-10-17	Expressiveness of feedforward networks. Basis function. RBF networks. More activation functions: Leaky ReLU, parametric ReLU. MLPs of ReLUs are linear piecewise functions. Computational graphs and introduction to automatic differentiation.	BB24 8, 9.1 Required additional readings: (Baydin et al. 2017); Optional additional readings: (Poggio & Girosi 1990); (Maas et al. 2013); (He et al. 2015); (Montufar et al. 2014);
2025-10-22	Automatic differentiation in forward and reverse mode. Weight initialization (LeCun, Glorot, He). Regularization. Explicit regularization by penalties. Ridge regularizer.	BB24 8, 7.2.5, 9.1; GBC16 6.5, 7.1, 8.4; ZLLS 2.5, 5.4.2, 5.5 Required additional readings: (Gebremedhin & Walther, 2020); Optional additional readings: (Rumelhart et al., 1986); (Belkin et al. 2019);
2025-10-23	Logistic regression in pure Tensorflow. Tensorboard. Logistic regression in PyTorch. Video lecture.	ZLLS23 4 Required additional readings: (Video lecture 2); (Video lecture 2 src);
2025-10-24	Interpretations of ridge (L2) regularizers. L2 shrinks more "noisy" directions. Bias-variance tradeoff and double descent. Bayesian interpretation and max-a-posteriori. L1, lasso and elastic net (L1+L2).	BB24 9.1, 9.2, 9.3.2; GBC16 7.1, 7.2, 7.8, 7.9 Optional: BB24 11 if you are unfamiliar with graphical models
2025-10-25	Weight decay. The AdamW optimizer. Weight sharing. Early stopping. Dropout. GELU units and their relatioship to dropout.	BB24 7.2.5, 7.4, 9.1.3; GBC16 7.4, 8.4, 8.7.1 Required additional readings: (Srivastava et al. 2014); Optional additional readings: (Loshchilov & Hutter 2018); (Baldi & Sadowski 2014); (Wager et al. 2013); (Hendrycks & Gimpel 2020);
2025-10-31	Data augmentation. Batch and layer normalization. Convolutional networks for Nd signals and their inductive bias. Basic concepts and some variants.	BB24 9.1.3, 7.4, 10.1. GBC16 9.1, 9.2 Required additional readings: (Ioffe & Szegedy 2015); Optional additional readings: (Ba et al. 2016); (Salimans et al. 2016); (Ioffe 2017);
2025-11-04	Automatic differentiation in TensorFlow and in PyTorch. The multilayered perceptron in TensorFlow/Keras. Video lecture.	ZLLS23 2.5, 5 Required additional readings: (Video lecture 3); (Video lecture 3 src);
2025-11-05	Translational equivariance. Convolution on multidimensional signals. Channels. Stacking convolutional layers. Adding a classification head. Strides and pooling. Dilated and transposed convolutions.	BB24 10.1, 10.2; GBC16 9.3-9.5; ZLLS23 7.2-7.6. Required additional readings: (Dumoulin & Visin 2016); Optional additional readings: (Ngiam et al. 2010); (Yu & Koltun 2015); (Radford et al. 2015);
2025-11-07	Bottlenecks (1x1 convolutions). Normalization for CNNs: batch, layer, instance, group. Gates. Mixtures of experts. Skip connections: Highway and residual networks. Efficient net. Semantic segmentation and U-nets. Fully convolutional structures. Class imbalance in segmentation problems and Dice loss.	BB10.4-10.5; ZLLS23 9.4 Required additional readings: (Wu & He 2018); (Ronneberger et al. 2015); Optional additional readings: (He et al. 2015); (Tan & Le 2019); (Jacobs et al. 1991); (Srivastava et al. 2018); (Milletari et al. 2016);
2025-11-11	Dataloaders in PyTorch. Convolutional networks and DenseNet in PyTorch. Video lecture.	ZLLS23 7 Required additional readings: (Video lecture 4); (Video lecture 4 src); (Huang et al. 2017); Optional additional readings: (Walmsley et al. 2021); (Hui et al. 2022);
2025-11-12	Sequence processing problems and some real-world examples. Basic tasks in natural language processing. Embedding layers. Recurrent neural networks and their computational graph. Vanishing/exploding gradients when storing long-term information.	BB24 12.2; JM24 9; GBC16 10.1, 10.2 Optional additional readings: (Hochreiter et al. 2001);
2025-11-14	Stacking RNN layers. Bidirectional RNNs. More on vanishing gradients. Gates in RNNs: LSTM and GRU. The general sequence-to-sequence learning problem. Encoder-decoder architectures. Decoding: Greedy; optimal decoding with Viterbi; beam search; sampling.	BB24 11.3; 12.2; ZLLS23 10; JM24 8.4, 9.1 Required additional readings: (Chung et al. 2014); (Hochreiter & Schmidhuber 1997); (Sutskever et al. 2014);
2025-11-19	The original attention mechanism. Soft dictionaries. Recurrent language models with attention. Machine translation. Attention for sequence classification. Introduction to transformers.	BB24 12.2 Required additional readings: (Bahdanau et al. 2014); Optional additional readings: (Graves et al. 2014); (Yang et al. 2016); (Sukhbaatar et al. 2015);
2025-11-21	Self-attention. Parameterization. Complexity. Multihead self-attention. Handling minibatches: Masking and batch matrix multiplication. Transformers.	BB24 12.1 Required additional readings: (Vaswani et al. 2017); (Rogozhnikov 2022); Optional additional readings: (Dangel 2024);
2025-11-26	Positional encoding. Vision Transformers. Transfer learning. The pretraining/fine-tuning strategy. Fine-tuning. Self-supervised learning: Pretext tasks.	BB24 12.1-12.4, 6.3 Required additional readings: (Vaswani et al. 2017); (Dosovitskiy et al. 2020); Optional additional readings: (Devlin et al. 2018); (Gidaris et al. 2018); (Hun et al. 2016); (Raghu et al. 2019);
2025-12-02	The hyperparameter optimization problem. Elementary algorithm. An introduction to Gaussian processes. Video lecture.	B06 2.3, 3.3; RW06 1,2,4 Required additional readings: (Video lecture 5); Optional additional readings: (Franceschi et al. 2025); Online resources: (GPyTorch); (GPFlow);
2025-11-28	No class today.
2025-12-03	No class today.
2025-12-05	Pretraining and fine-tuning in BERT. Triplet loss for ranking and identification. Contrastive learning. SimCLR. Basic ideas behind domain adaptation. Model-based hyperparameter optimization.	JM24 Ch. 11 Required additional readings: (Devlin et al. 2018); (Van den Oord et al. 2019); (Schroff et al. 2015); (Chen et al. 2020); Optional additional readings: (Hospedales et al. 2021); (Rozantsev et al. 2019); (Tzeng et al. 2017); (Li et al. 2017); (Doersch et al. 2016); (He et al. 2022); (He et al. 2020); (Sohn 2016); (Rozantsev et al. 2019); (Tzeng et al. 2017); (Li et al. 2017);
2025-12-10	Model based hyperparameter optimization, expected improvement. Multi-fidelity approaches to hyperparameter optimization. Successive halving. Hyperband. Brief mention of ASHA and gradient-based approaches. Meta-learning. Algorithms for domain adaptation. Reweighting examples. Mixup.	Required additional readings: (Snoek et al. 2012); (Bergstra et al. 2013); (Li et al. 2018); Optional additional readings: (Jamieson & Talwalkar 2016); (Li et al. 2020); (Yogatama et al. 2015); (Bergstra et al. 2013); (Jones et al. 1998); (Frazier 2018); (Finn et al. 2017); (Franceschi et al. 2018); Online resources: (GPyOpt); (Optuna); (Syne Tune);

Note

Full text of linked papers is normally accessible when connecting from a UNIFI IP address. Use proxy-auth.unifi.it:8888 (with your credentials) if you are connecting from outside the campus network.

B031278 - Deep Learning. Fall 2025

MSc degree in Artificial Intelligence, University of Florence