Home | Program |
This lecture series consists of 11 weeks with two classes each, these being: an introductory lecture and an application lecture. The introductory lectures aim to put every participant "on the same page" regarding a prerequisite of the following application lecture, whereas the application lectures aim to expose and apply an approach based on the contents explored in the introductory lecture to expand the understanding of Deep Learning methods. In the final week, we will have an additional panel featuring previous lecturers aiming to discuss the frontlines of this new field and wrapping up the content seen in the entirety of the whole event.
All meetings will be transmited and the links will be posted in this page. Don't worry if you miss any meeting, the recordings will be available at Data ICMC YouTube Channel!
Important:
Apr 19, 2021 06:00 PM (GMT -3:00) |
Opening and introductory lecture 1: Introduction to Deep Learning (Portuguese) - Recording Link |
A Leo Sampaio (Universidade de São Paulo) | |
Introduction to the basic principles of Machine Learning (ML); discussion of the importance of data and its representations for learning; introduction to the first Neural Network algorithms in the area; explanation of the concept of convolution and its application in convolutional neural networks; presentation of different deep network architectures (Deep Learning); introduction of the traditional concept of generalization, importance of representations and kernels. |
Apr 23, 2021 02:00 PM (GMT -3:00) |
Lecture 1: Understanding Generalization Requires Rethinking Deep Learning (English) - Recording link |
Boaz Barak and Gal Kaplun (Harvard) | |
The generalization gap of a learning algorithm is the expected difference between its performance on the training data and its performance on fresh unseen test samples. Modern deep learning algorithms typically have large generalization gaps, as they use more parameters than the size of their training set. Moreover the best known rigorous bounds on their generalization gap are often vacuous. In this talk we will see a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a complex representation of the (label free) training data, and then fitting a simple (e.g., linear) classifier to the labels. Such classifiers have become increasingly popular in recent years, as they offer several practical advantages and have been shown to approach state-of-art results. We show that (under the assumptions described below) the generalization gap of such classifiers tends to zero as long as the complexity of the simple classifier is asymptotically smaller than the number of training samples. We stress that our bound is independent of the complexity of the representation that can use an arbitrarily large number of parameters. Our bound holds assuming that the learning algorithm satisfies certain noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) properties. These conditions widely (and sometimes provably) hold across many standard architectures. We complement this result with an empirical study, demonstrating that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and BigBiGAN. The talk will not assume any specific background in machine learning, and should be accessible to a general mathematical audience. Joint work with Yamini Bansal. | |
Show / Hide Biography |
Apr 26, 2021 01:00 PM (GMT -3:00) |
Introductory lecture 2: From parametric models to Gaussian Processes (English) - Recording link / Slides |
Yingzhen Li (Imperial College London) | |
Show / Hide Biography |
Apr 30, 2021 02:00 PM (GMT -3:00) |
Lecture 2: The Wide limit of Neural Networks: NNGP and NTK (English) - Recording link |
Jascha Sohl-Dickstein (Google Brain) | |
As neural networks become wider their accuracy improves, and their behavior becomes easier to analyze theoretically. I will give an introduction to a rapidly growing field -- closely connected to statistical physics -- which examines the learning dynamics and prior over functions induced by infinitely wide, randomly initialized, neural networks. Core results that I will discuss include: that the distribution over functions computed by a wide neural network often corresponds to a Gaussian process with a particular compositional kernel, both before and after training; that the predictions of wide neural networks are linear in their parameters throughout training; and that this perspective enables analytic predictions for how trainability of finite width networks depends on hyperparameters and architecture. These results provide for surprising capabilities -- for instance, the evaluation of test set predictions which would come from an infinitely wide trained neural network without ever instantiating a neural network, or the rapid training of 10,000+ layer convolutional networks. I will argue that this growing understanding of neural networks in the limit of infinite width is foundational for future theoretical and practical understanding of deep learning. Neural Tangents (software library for working with infinite width networks) | |
Show / Hide Biography |
May 3, 2021 05:30 PM (GMT -3:00) |
Introductory lecture 3: Dynamical Systems (Portuguese) - Recording link |
Tiago Pereira (Universidade de São Paulo) |
May 7, 2021 03:00 PM (GMT -3:00) |
Lecture 3: The Catapult phase of Neural Networks (English) - Recording link |
Guy Gur-Ari (Google) | |
Why do large learning rates often produce better results? Why do “infinitely wide” networks trained using kernel methods tend to underperform ordinary networks? In the talk I will argue that these questions are related. Existing kernel-based theory can explain the dynamics of networks trained with small learning rates. However, optimal performance is often achieved at large learning rates, where we find qualitatively different dynamics that converge to flat minima. The distinction between the small and large learning rate phases becomes sharp at infinite width, and is reminiscent of nonperturbative phase transitions that appear in physical systems. | |
Show / Hide Biography |
May 14, 2021 03:00 PM (GMT -3:00) |
Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks (English) - Recording link |
Samuel S Schoenholz (Google Brain) | |
Random neural networks converge to Gaussian Processes in the limit of infinite width. In this talk we will study how signals propagate through these wide and random networks. At large depth, we will show that a phase diagram naturally emerges between an “ordered” phase where all pairs of inputs converge to the same output and a “chaotic” phase where nearby become increasingly dissimilar with depth. We will then consider fluctuations of gradients as they are backpropagated through the network. We will show that the distribution of gradient fluctuations can be controlled via the random distribution of weights used for initialization. We will discuss empirical observations about the relationship between this prior and training dynamics. In his talk, Lechao Xiao will elaborate on this relationship via the Neural Tangent Kernel. | |
Show / Hide Biography |
May 21, 2021 05:00 PM (GMT -3:00) |
Lecture 5: Neural Network Loss Landscape in High Dimensions (English) - Recording link |
Stanislav Fort (Stanford / Google AI) | |
Deep neural networks trained with gradient descent have been extremely successful at learning solutions to a broad suite of difficult problems across a wide range of domains such as vision, gameplay, and natural language, many of which had previously been considered to require intelligence. Despite their tremendous success, we still do not have a detailed, predictive understanding of how these systems work. In my talk, I will focus on recent efforts to understand the structure of deep neural network loss landscapes and how gradient descent navigates them during training. I will discuss how we can use tools from high-dimensional geometry to build a phenomenological model of their large-scale structure, the role of their non-linear nature in the early phases of training, and its effects on ensembling, calibration, and out-of-distribution behavior. | |
Show / Hide Biography |
May 28, 2021 04:00 PM (GMT -3:00) |
Lecture 6: Disentangling Trainability and Generalization in Deep Neural Networks (English) - Recording link |
Lechao Xiao (Google Brain) | |
A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks, the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. A thorough empirical investigation of these theoretical results shows excellent agreement on real datasets. | |
Show / Hide Biography |
May 31, 2021 06:00 PM (GMT -3:00) |
Introductory lecture 7: Introduction to Statistical Mechanics (Portuguese) - Recording link |
Tereza Cristina da Rocha Mendes (Universidade de São Paulo) |
Jun 4, 2021 10:00 AM (GMT -3:00) |
Lecture 7: Explaining Neural Scaling Laws (English) - Recording link |
Jaehoon Lee (Google Brain) | |
For a large variety of models and datasets, neural network performance has been empirically observed to scale as a power-law with model size and dataset size. We would like to understand why these power laws emerge, and what features of the data and models determine the values of the power-law exponents. Since these exponents determine how quickly performance improves with more data and larger models, they are of great importance when considering whether to scale up existing models. In this talk, we’ll survey some of the well-known power-law scaling behavior observed in deep neural networks. Drawing intuition from statistical physics, we observe that a simplifying limit arises as one scales up deep learning models. We’ll talk about a theoretical framework that explains and connects various scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. | |
Show / Hide Biography |
Jun 7, 2021 06:00 PM (GMT -3:00) |
Introductory lecture 8: Introduction to Information Theory (Portuguese) - Recording link |
Eduarda Chagas (Universidade Federal de Minas Gerais) |
Jun 11, 2021 02:00 PM (GMT -3:00) |
Lecture 8: Progress Towards Understanding Generalization in Deep Learning (English) - Recording link |
Gintare Karolina Dziugaite (Element AI) | |
There is, as yet, no satisfying theory explaining why common learning algorithms, like those based on stochastic gradient descent, generalize in practice on overparameterized neural networks. I will discuss various approaches that have been taken to explaining generalization in deep learning, and identify some of the barriers these approaches faced. I will then discuss my recent work on information-theoretic and PAC-Bayesian approaches to understanding generalization in noisy variants of SGD. In particular, I will highlight how we can take advantage of conditioning to obtain sharper data- and distribution-dependent generalization measures. I will also briefly touch upon my work on properties of the optimization landscape and some of the challenges we face incorporating these insights into the theory of generalization. | |
Show / Hide Biography |
Jun 18, 2021 01:30 PM (GMT -3:00) |
Lecture 9: Information-Theoretic Generalization Bounds for Stochastic Gradient Descent (English) - Recording link |
Gergely Neu (Universitat Pompeu Fabra) | |
We study the generalization properties of the popular stochastic gradient descent method for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates. | |
Show / Hide Biography |
Jun 21, 2021 04:00 PM (GMT -3:00) |
Introductory lecture 10: Introduction to Category Theory: Up to Monoidal Categories (Portuguese) - Recording link |
Jose Vitor Paiva Miranda Siqueira (University of Cambridge) |
Jun 25, 2021 04:00 PM (GMT -3:00) |
Lecture 10: Backprop as a Functor (English) - Recording link |
Brendan Fong (MIT / Topos Institute) | |
A supervised learning algorithm searches over a set of functions \(A→B\) parametrised by a space \(P\) to find the best approximation to some ideal function \(f:A→B\). It does this by taking examples \((a,f(a))∈A×B\), and updating the parameter according to some rule. We define a category where these update rules may be composed, and show that gradient descent --- with respect to a fixed step size and an error function satisfying a certain property --- defines a monoidal functor from a category of parametrised functions to this category of update rules. This provides a structural perspective on backpropagation, as well as a broad generalisation of neural networks. | |
Show / Hide Biography |
Jul 2, 2021 02:00 PM (GMT -3:00) |
Lecture 11: Learning Functors using Gradient Descent (English) - Recording link |
Bruno Gavranović (University of Strathclyde) | |
CycleGAN is a general approach to unpaired image-to-image translation that has been getting attention in recent years. Inspired by categorical database systems, we show that CycleGAN is a "schema", i.e. a specific category presented by generators and relations, whose specific parameter instantiations are just set-valued functors on this schema. We show that enforcing cycle-consistencies amounts to enforcing composition invariants in this category. We generalize the learning procedure to arbitrary such categories and show that a special class of functors, rather than functions, can be learned using gradient descent. Using this framework we design a novel neural network system capable of learning to insert and delete objects from images without paired data. We qualitatively evaluate the system on three different datasets and obtain promising results. | |
Show / Hide Biography |
Jul 09, 2021 11:00 AM (GMT -3:00) |
Panel: The Many Paths to Understanding Deep Learning (English) - Recording link link |
Brendan Fong (MIT / Topos Institute), Gintare Karolina (Element AI), Oriol Vinyals (DeepMind), Yasaman Bahri (Google Research) | |
Design by Mike Pierce | © Data ICMC |