Program

This lecture series consists of 11 weeks with two classes each, these being: an introductory lecture and an application lecture. The introductory lectures aim to put every participant "on the same page" regarding a prerequisite of the following application lecture, whereas the application lectures aim to expose and apply an approach based on the contents explored in the introductory lecture to expand the understanding of Deep Learning methods. In the final week, we will have an additional panel featuring previous lecturers aiming to discuss the frontlines of this new field and wrapping up the content seen in the entirety of the whole event.

All meetings will be transmited and the links will be posted in this page. Don't worry if you miss any meeting, the recordings will be available at Data ICMC YouTube Channel!

Important:

Week 1
Apr 19, 2021
6:00 PM (GMT -3:00)
Opening and introductory lecture 1: Introduction to Deep Learning (Portuguese) - Recording Link
A Leo Sampaio (Universidade de São Paulo)
Introduction to the basic principles of Machine Learning (ML); discussion of the importance of data and its representations for learning; introduction to the first Neural Network algorithms in the area; explanation of the concept of convolution and its application in convolutional neural networks; presentation of different deep network architectures (Deep Learning); introduction of the traditional concept of generalization, importance of representations and kernels.
Apr 23, 2021
2:00 PM (GMT -3:00)
Lecture 1: Understanding Generalization Requires Rethinking Deep Learning (English) - Recording link
Boaz Barak and Gal Kaplun (Harvard)
The generalization gap of a learning algorithm is the expected difference between its performance on the training data and its performance on fresh unseen test samples. Modern deep learning algorithms typically have large generalization gaps, as they use more parameters than the size of their training set. Moreover the best known rigorous bounds on their generalization gap are often vacuous. In this talk we will see a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a complex representation of the (label free) training data, and then fitting a simple (e.g., linear) classifier to the labels. Such classifiers have become increasingly popular in recent years, as they offer several practical advantages and have been shown to approach state-of-art results. We show that (under the assumptions described below) the generalization gap of such classifiers tends to zero as long as the complexity of the simple classifier is asymptotically smaller than the number of training samples. We stress that our bound is independent of the complexity of the representation that can use an arbitrarily large number of parameters. Our bound holds assuming that the learning algorithm satisfies certain noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) properties. These conditions widely (and sometimes provably) hold across many standard architectures. We complement this result with an empirical study, demonstrating that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and BigBiGAN. The talk will not assume any specific background in machine learning, and should be accessible to a general mathematical audience. Joint work with Yamini Bansal.
Show / Hide Biography
Week 2
Apr 26, 2021
1:00 PM (GMT -3:00)
Introductory lecture 2: From parametric models to Gaussian Processes (English) - Recording link / Slides
Yingzhen Li (Imperial College London)
Show / Hide Biography
Apr 30, 2021
2:00 PM (GMT -3:00)
Lecture 2: The Wide limit of Neural Networks: NNGP and NTK (English) - Recording link
Jascha Sohl-Dickstein (Google Brain)
As neural networks become wider their accuracy improves, and their behavior becomes easier to analyze theoretically. I will give an introduction to a rapidly growing field -- closely connected to statistical physics -- which examines the learning dynamics and prior over functions induced by infinitely wide, randomly initialized, neural networks. Core results that I will discuss include: that the distribution over functions computed by a wide neural network often corresponds to a Gaussian process with a particular compositional kernel, both before and after training; that the predictions of wide neural networks are linear in their parameters throughout training; and that this perspective enables analytic predictions for how trainability of finite width networks depends on hyperparameters and architecture. These results provide for surprising capabilities -- for instance, the evaluation of test set predictions which would come from an infinitely wide trained neural network without ever instantiating a neural network, or the rapid training of 10,000+ layer convolutional networks. I will argue that this growing understanding of neural networks in the limit of infinite width is foundational for future theoretical and practical understanding of deep learning. Neural Tangents (software library for working with infinite width networks)
Show / Hide Biography
Week 3
May 3, 2021
5:30 PM (GMT -3:00)
Introductory lecture 3: Dynamical Systems (Portuguese) - Recording link
Tiago Pereira (Universidade de São Paulo)
May 7, 2021
3:00 PM (GMT -3:00)
Lecture 3: The Catapult phase of Neural Networks (English) - Streaming link
Guy Gur-Ari (Google)
Why do large learning rates often produce better results? Why do “infinitely wide” networks trained using kernel methods tend to underperform ordinary networks? In the talk I will argue that these questions are related. Existing kernel-based theory can explain the dynamics of networks trained with small learning rates. However, optimal performance is often achieved at large learning rates, where we find qualitatively different dynamics that converge to flat minima. The distinction between the small and large learning rate phases becomes sharp at infinite width, and is reminiscent of nonperturbative phase transitions that appear in physical systems.
Show / Hide Biography
Week 4
May 14, 2021
3:00 PM (GMT -3:00)
Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks
Samuel S Schoenholz (Google Brain)
Random neural networks converge to Gaussian Processes in the limit of infinite width. In this talk we will study how signals propagate through these wide and random networks. At large depth, we will show that a phase diagram naturally emerges between an “ordered” phase where all pairs of inputs converge to the same output and a “chaotic” phase where nearby become increasingly dissimilar with depth. We will then consider fluctuations of gradients as they are backpropagated through the network. We will show that the distribution of gradient fluctuations can be controlled via the random distribution of weights used for initialization. We will discuss empirical observations about the relationship between this prior and training dynamics. In his talk, Lechao Xiao will elaborate on this relationship via the Neural Tangent Kernel.
Show / Hide Biography
Week 5
May 17, 2021
Time yet to be defined
Introductory lecture 5: (Portuguese)
To be defined
May 21, 2021
Time yet to be defined
Lecture 5: Neural Network Loss Landscape in High Dimensions (English)
Stanislav Fort (Stanford / Google AI)
Deep neural networks trained with gradient descent have been extremely successful at learning solutions to a broad suite of difficult problems across a wide range of domains such as vision, gameplay, and natural language, many of which had previously been considered to require intelligence. Despite their tremendous success, we still do not have a detailed, predictive understanding of how these systems work. In my talk, I will focus on recent efforts to understand the structure of deep neural network loss landscapes and how gradient descent navigates them during training. I will discuss how we can use tools from high-dimensional geometry to build a phenomenological model of their large-scale structure, the role of their non-linear nature in the early phases of training, and its effects on ensembling, calibration, and out-of-distribution behavior.
Show / Hide Biography
Week 6
May 24, 2021
Time yet to be defined
Introductory lecture 6: (Portuguese)
To be defined
May 28, 2021
Time yet to be defined
Lecture 6: Disentangling Trainability and Generalization in Deep Neural Networks (English)
Lechao Xiao (Google Brain)
A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks, the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. A thorough empirical investigation of these theoretical results shows excellent agreement on real datasets.
Show / Hide Biography
Week 7
May 31, 2021
Time yet to be defined
Introductory lecture 7: Introduction to Statistical Mechanics (Portuguese)
To be defined
Jun 4, 2021
Time yet to be defined
Lecture 7: Explaining Scaling Laws (English)
Jaehoon Lee (Google Brain)
Week 8
Jun 7, 2021
Time yet to be defined
Introductory lecture 8: Introduction to Information Theory (Portuguese)
Jun 11, 2021
Time yet to be defined
Lecture 8: Information Theory of Deep Learning (English)
Gintare Karolina (Element AI)
Week 9
Jun 14, 2021
Time yet to be defined
Introductory lecture 9: (Portuguese)
To be defined
Jun 18, 2021
Time yet to be defined
Lecture 9: Information-Theoretic Generalization Bounds for Stochastic Gradient Descent (English)
Gergely Neu (Universitat Pompeu Fabra)
Show / Hide Biography
Week 10
Jun 21, 2021
Time yet to be defined
Introductory lecture 10: Introduction to Category Theory: Up to Monoidal Categories (Portuguese)
To be defined
Jun 25, 2021
4:00 PM (GMT -3:00)
Lecture 10: Backprop as a Functor (English)
Brendan Fong (MIT / Topos Institute)
Week 11
Jun 28, 2021
Time yet to be defined
Introductory lecture 11: (Portuguese)
To be defined
Jul 2, 2021
Time yet to be defined
Lecture 11: Compositional Deep Learning (English)
Bruno Gavranović (University of Strathclyde)
Time yet to be defined Panel: The many paths to Understanding Deep Learning (English)
Speakers still to be defined