The "Large Scale Deep Learning Models" course focuses on the methodologies and techniques used to train large models on extensive datasets across various data domains, including images, text, and audio. The course provides in-depth coverage of self-supervised learning approaches, which have become crucial for leveraging vast amounts of unlabeled data. Topics include data preprocessing and augmentation for different modalities, architectural considerations for scaling deep learning models, and strategies for distributed and parallel training.
Instructors: Alexander Shabalin, Ildus Sadrtdinov, Dmitry Kropotov
Classes: on Mondays offline in the classroom EH-4 in time slots 14:15 - 15:30 and 15:45 - 17:00
Telegram chat for questions and discussion: link
Practical assignments: all asssignments are given and checked in the corresponding Teams space. If you don't have access to Teams space, please write directly to one of the instructors or in the course chat.
Assessment Component 1: written examination, Duration: 60 min, Weight: 50 %
Assessment Component 2: programming assignments, Weight: 50 %
Completion: To pass this module, the examination of each module component must be passed with at least 45%.
Date | Number | Topic | Materials |
---|---|---|---|
09.09.24 | 01 | Introduction to the course. Large models, large datasets and self-supervised learning. What to do with a pretrained model? Linear probing, Fine-tuning, in-distribution (ID) and out-of-distribution (OOD) performance. CLIP model, Zero-shot and WiSE-FT (robust weights ensemble). | Fine-tuning distorts features, Comparing pre-training algorithms, CLIP, WiSE-FT, Do ImageNet Classifiers Generalize to ImageNet? |
23.09.24 | 02 | Classical pretext tasks for images: inpainting, colorization, jigsaw puzzles | Exemplar, Context Prediction, Inpainting, Jigsaw Puzzles, Colorization, Rotations, Damaged Jigsaw Puzzles, Task Ensemble |
30.09.24 | 03 | Modern architectures for images: ViT, DeiT, MLP Mixer, Swin, ConvNeXt, Neighborhood Attention Transformer (NAT). Efficient training & inference: Automatic Mixed Precision (AMP), Data-Parallel and Model-Parallel training |
Big Transfer, ViT, DeiT, MLP Mixer, Swin, ConvNeXt, NAT |
07.10.24 | 04 | Contrastive learning for images. Mutual information, SimCLR, MoCo, BYOL, SimSiam, DeepCluster, SwAV. Deriving contrastive loss | SimCLR, MoCo, BYOL, SimSiam, DeepCluster, SwAV |
14.10.24 | 05 | Self-supervised learning for ViT. Masked image modeling. DINO, BEiT, MAE, MaskFeat. Different approaches for improving contrastive learning. | DINO, BEiT, MAE, MaskFeat Dense CL, Supervised CL, DiLo, LooC |
21.10.24 | 06 | Mode connectivity and Linear mode connectivity (LMC). Ensembling: Deep Ensemble (DE), SSE, FGE, cSGLD, KFAC-Laplace, SWAG, SPRO, StarSSE. Model averaging: SWA, model soups. Weight averaging in optimization: Lookahead, Lookaround, WATT | LMC, LMC in transfer learning, DE, DE and loss landscape, DE and distribution shifts, SSE, FGE, cSGLD, KFAC-Laplace, SWAG, SPRO, DE Equivalent, StarSSE, SWA, model soups, Lookahead, Lookaround, WATT |
28.10.24 | 08 | Modern architectures for texts. Recap of transformers. Modern architectures. Transformer training tricks. | Flash attention, FA blogpost, KV-caching, Multi-Query attention, Relative Position Encodings, RoPE, ALiBi, GLU, Mixture of Experts, Pre-normalization, RMSNorm |
04.11.24 | 07 | Pruning, Quantization, Distillation | Pruning, Quantization 1, Quantization 2, Distillation, DistilBERT |
11.11.24 | 09 | Parameter-Efficient Fine-tuning. GPT zero-shot, Prompt Tuning, Adapters, LoRA, BitFit | GPT-3, Prompt Tuning, P-Tuning, Adapters, LoRA, BitFit |
18.11.24 | 10 | Retrieval Augmented Generation | |
25.11.24 | 11 | Text Diffuion Models | |
02.12.24 | 12 | Introduction to audio processing. Text-to-speech (TTS): WaveNet, Tacotron 2, WaveGlow, HiFi-GAN. Automatic Speech Recognition (ASR): CTC Loss, Jasper, Whisper. Self-supervised learning for audio: CPC, Wav2Vec 2.0, HUBERT, Multi-format contrastive learning, BYOL-A, CLAP | WaveNet, Tacotron 2, WaveGlow, HiFi-GAN, CTC Loss, Jasper, Whisper, CPC, Wav2Vec 2.0, HuBERT, Multi-format CL, BYOL-A, CLAP |
Number | Release date | Deadline | Topic |
---|---|---|---|
01 | 15.09.24 | 01.10.24 23:59 | Robust fine-tuning of CLIP |
02 | 01.10.24 | 18.10.24 23:59 | Classical pre-text tasks |
03 | 18.10.24 | 06.11.24 23:59 | Contrastive learning |