Understanding Deep Learning

Loading web-font TeX/Math/Italic

Program

This lecture series consists of 11 weeks with two classes each, these being: an introductory lecture and an application lecture. The introductory lectures aim to put every participant "on the same page" regarding a prerequisite of the following application lecture, whereas the application lectures aim to expose and apply an approach based on the contents explored in the introductory lecture to expand the understanding of Deep Learning methods. In the final week, we will have an additional panel featuring previous lecturers aiming to discuss the frontlines of this new field and wrapping up the content seen in the entirety of the whole event.

All meetings will be transmited and the links will be posted in this page. Don't worry if you miss any meeting, the recordings will be available at Data ICMC YouTube Channel!

Important:

All the highlighted in red will be taught in Portuguese
While all the lectures highlighted in blue will be taught in English
The full schedule of this page is in the São Paulo, Brazil (GMT -3:00) time zone, which can be quickly converted here
You can add this series of events to Google Calendar or any other calendar app you prefer.
This event does not offer any kind of certificate of attendance!

Week 1

Apr 19, 2021 06:00 PM (GMT -3:00)	Opening and introductory lecture 1: Introduction to Deep Learning (Portuguese) - Recording Link
	A Leo Sampaio (Universidade de São Paulo)
	Introduction to the basic principles of Machine Learning (ML); discussion of the importance of data and its representations for learning; introduction to the first Neural Network algorithms in the area; explanation of the concept of convolution and its application in convolutional neural networks; presentation of different deep network architectures (Deep Learning); introduction of the traditional concept of generalization, importance of representations and kernels.

Apr 23, 2021 02:00 PM (GMT -3:00)	Lecture 1: Understanding Generalization Requires Rethinking Deep Learning (English) - Recording link
	Boaz Barak and Gal Kaplun (Harvard)
	The generalization gap of a learning algorithm is the expected difference between its performance on the training data and its performance on fresh unseen test samples. Modern deep learning algorithms typically have large generalization gaps, as they use more parameters than the size of their training set. Moreover the best known rigorous bounds on their generalization gap are often vacuous. In this talk we will see a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a complex representation of the (label free) training data, and then fitting a simple (e.g., linear) classifier to the labels. Such classifiers have become increasingly popular in recent years, as they offer several practical advantages and have been shown to approach state-of-art results. We show that (under the assumptions described below) the generalization gap of such classifiers tends to zero as long as the complexity of the simple classifier is asymptotically smaller than the number of training samples. We stress that our bound is independent of the complexity of the representation that can use an arbitrarily large number of parameters. Our bound holds assuming that the learning algorithm satisfies certain noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) properties. These conditions widely (and sometimes provably) hold across many standard architectures. We complement this result with an empirical study, demonstrating that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and BigBiGAN. The talk will not assume any specific background in machine learning, and should be accessible to a general mathematical audience. Joint work with Yamini Bansal.
	Show / Hide Biography Bio: Boaz Barak is the Gordon McKay professor of Computer Science at Harvard University's John A. Paulson school of Engineering and Applied Sciences. His research interests include all areas of theoretical computer science and in particular cryptography and computational complexity. Previously, he was a principal researcher at Microsoft Research New England, and before that an associate professor (with tenure) at Princeton University's computer science department. Barak has won the ACM dissertation award, the Packard and Sloan fellowships, and was also selected for Foreign Policy magazine's list of 100 leading global thinkers for 2014 and chosen as a Simons investigator in 2017 . He serves on the editorial boards of several journals and is also a member of the Committee for the Advancement of Theoretical Computer Science and the scientific advisory board for the Simons Institute for the Theory of Computing. He wrote with Sanjeev Arora the textbook "Computational Complexity: A Modern Approach". Gal Kaplun is a third-year Ph.D. candidate at the Computer Science Department at Harvard University, under the supervision of Prof. Yaron Singer. Gal is working with the Harvard Theory of Machine Learning group, focusing on a deep understanding of Machine Learning models. Gal's research interests revolve around investigating the mysteries of Deep Learning---why and how overparameterized models generalize, what are the failure modes of Deep Networks, how can we make models robust to distribution shift and adversarial examples. At the moment, Gal is exploring the theory behind the emergent area of self-supervised learning, in particular, what is the driving mechanism behind contrastive learning

Week 2

Apr 26, 2021 01:00 PM (GMT -3:00)	Introductory lecture 2: From parametric models to Gaussian Processes (English) - Recording link / Slides
	Yingzhen Li (Imperial College London)

	Show / Hide Biography Bio: Yingzhen Li is currently a lecturer at the Department of Computing at Imperial College London. She is interested in building reliable machine learning systems which can generalise to unseen environments and approaches this goal using probabilistic modelling and representation learning. In general, she is also interested in transfer/meta learning, information theory, optimisation, and sequential data modelling.

Apr 30, 2021 02:00 PM (GMT -3:00)	Lecture 2: The Wide limit of Neural Networks: NNGP and NTK (English) - Recording link
	Jascha Sohl-Dickstein (Google Brain)
	As neural networks become wider their accuracy improves, and their behavior becomes easier to analyze theoretically. I will give an introduction to a rapidly growing field -- closely connected to statistical physics -- which examines the learning dynamics and prior over functions induced by infinitely wide, randomly initialized, neural networks. Core results that I will discuss include: that the distribution over functions computed by a wide neural network often corresponds to a Gaussian process with a particular compositional kernel, both before and after training; that the predictions of wide neural networks are linear in their parameters throughout training; and that this perspective enables analytic predictions for how trainability of finite width networks depends on hyperparameters and architecture. These results provide for surprising capabilities -- for instance, the evaluation of test set predictions which would come from an infinitely wide trained neural network without ever instantiating a neural network, or the rapid training of 10,000+ layer convolutional networks. I will argue that this growing understanding of neural networks in the limit of infinite width is foundational for future theoretical and practical understanding of deep learning. Neural Tangents (software library for working with infinite width networks)
	Show / Hide Biography Bio: Jascha is a senior staff research scientist in Google Brain, and leads a research team with interests spanning machine learning, physics, and neuroscience. He was previously a visiting scholar in Surya Ganguli's lab at Stanford, and an academic resident at Khan Academy. He earned his PhD in 2012 in Bruno Olshausen's lab in the Redwood Center for Theoretical Neuroscience at UC Berkeley. Prior to his PhD, he spent several years working for NASA on the Mars Exploration Rover mission.

Week 3

May 3, 2021 05:30 PM (GMT -3:00)	Introductory lecture 3: Dynamical Systems (Portuguese) - Recording link
	Tiago Pereira (Universidade de São Paulo)

May 7, 2021 03:00 PM (GMT -3:00)	Lecture 3: The Catapult phase of Neural Networks (English) - Recording link
	Guy Gur-Ari (Google)
	Why do large learning rates often produce better results? Why do “infinitely wide” networks trained using kernel methods tend to underperform ordinary networks? In the talk I will argue that these questions are related. Existing kernel-based theory can explain the dynamics of networks trained with small learning rates. However, optimal performance is often achieved at large learning rates, where we find qualitatively different dynamics that converge to flat minima. The distinction between the small and large learning rate phases becomes sharp at infinite width, and is reminiscent of nonperturbative phase transitions that appear in physical systems.
	Show / Hide Biography Bio: Why do large learning rates often produce better results? Why do “infinitely wide” networks trained using kernel methods tend to underperform ordinary networks? In the talk I will argue that these questions are related. Existing kernel-based theory can explain the dynamics of networks trained with small learning rates. However, optimal performance is often achieved at large learning rates, where we find qualitatively different dynamics that converge to flat minima. The distinction between the small and large learning rate phases becomes sharp at infinite width, and is reminiscent of nonperturbative phase transitions that appear in physical systems.

Week 4

May 14, 2021 03:00 PM (GMT -3:00)	Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks (English) - Recording link
	Samuel S Schoenholz (Google Brain)
	Random neural networks converge to Gaussian Processes in the limit of infinite width. In this talk we will study how signals propagate through these wide and random networks. At large depth, we will show that a phase diagram naturally emerges between an “ordered” phase where all pairs of inputs converge to the same output and a “chaotic” phase where nearby become increasingly dissimilar with depth. We will then consider fluctuations of gradients as they are backpropagated through the network. We will show that the distribution of gradient fluctuations can be controlled via the random distribution of weights used for initialization. We will discuss empirical observations about the relationship between this prior and training dynamics. In his talk, Lechao Xiao will elaborate on this relationship via the Neural Tangent Kernel.
	Show / Hide Biography Bio: Sam is a Senior Research Scientist at Google Brain working at the intersection between Machine Learning and Physics. His work focuses on better understanding the large-width limit of neural networks using techniques from statistical physics as well as applying advances in Machine Learning to physical systems. Sam received his PhD in Physics from the University of Pennsylvania in 2015.

Week 5

May 21, 2021 05:00 PM (GMT -3:00)	Lecture 5: Neural Network Loss Landscape in High Dimensions (English) - Recording link
	Stanislav Fort (Stanford / Google AI)
	Deep neural networks trained with gradient descent have been extremely successful at learning solutions to a broad suite of difficult problems across a wide range of domains such as vision, gameplay, and natural language, many of which had previously been considered to require intelligence. Despite their tremendous success, we still do not have a detailed, predictive understanding of how these systems work. In my talk, I will focus on recent efforts to understand the structure of deep neural network loss landscapes and how gradient descent navigates them during training. I will discuss how we can use tools from high-dimensional geometry to build a phenomenological model of their large-scale structure, the role of their non-linear nature in the early phases of training, and its effects on ensembling, calibration, and out-of-distribution behavior.
	Show / Hide Biography Bio: Stanislav Fort is a PhD student at Stanford University, advised by Prof Surya Ganguli. His research focuses on developing a scientific understanding of deep learning and on applications of machine learning and artificial intelligence in the physical sciences, in domains spanning from X-ray astrophysics to quantum computing. Stanislav spent a year as a Google AI Resident, where he worked on deep learning theories and their applications in collaboration with colleagues from Google Brain and DeepMind. He received his Bachelor’s and Master’s degrees in Physics at Trinity College, University of Cambridge, and a Master’s degree at Stanford University.

Week 6

May 28, 2021 04:00 PM (GMT -3:00)	Lecture 6: Disentangling Trainability and Generalization in Deep Neural Networks (English) - Recording link
	Lechao Xiao (Google Brain)
	A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks, the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. A thorough empirical investigation of these theoretical results shows excellent agreement on real datasets.
	Show / Hide Biography Bio: Lechao Xiao is a research scientist in the Brain Team, Google Research. He works on theory of deep learning, including optimization, Gaussian Processes, generalization of neural networks. Before joining Google, he was a Hans Rademacher Instructor at the University of Pennsylvania working on Harmonic analysis. He received his Ph.D. in mathematics from the University of Illinois at Urbana-Champaign in 2014 and his B.S. in mathematics from Zhejiang University in 2009.

Week 7

May 31, 2021 06:00 PM (GMT -3:00)	Introductory lecture 7: Introduction to Statistical Mechanics (Portuguese) - Recording link
	Tereza Cristina da Rocha Mendes (Universidade de São Paulo)

Jun 4, 2021 10:00 AM (GMT -3:00)	Lecture 7: Explaining Neural Scaling Laws (English) - Recording link
	Jaehoon Lee (Google Brain)
	For a large variety of models and datasets, neural network performance has been empirically observed to scale as a power-law with model size and dataset size. We would like to understand why these power laws emerge, and what features of the data and models determine the values of the power-law exponents. Since these exponents determine how quickly performance improves with more data and larger models, they are of great importance when considering whether to scale up existing models. In this talk, we’ll survey some of the well-known power-law scaling behavior observed in deep neural networks. Drawing intuition from statistical physics, we observe that a simplifying limit arises as one scales up deep learning models. We’ll talk about a theoretical framework that explains and connects various scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes.
	Show / Hide Biography Bio: Jaehoon Lee is a senior research scientist at Google Brain team. His main research focus is on fundamental understanding of deep neural networks; actively working on the infinite-width limit of neural networks and their correspondence to the kernel methods. In 2017, he joined Google and started a research career in machine learning as part of the Google Brain Residency program. Before that he was a postdoctoral fellow at University of British Columbia from 2015-2017 working on theoretical high energy physics. Jaehoon obtained his PhD in physics at the Center for Theoretical Physics, Massachusetts Institute of Technology (MIT) in 2015.

Week 8

Jun 7, 2021 06:00 PM (GMT -3:00)	Introductory lecture 8: Introduction to Information Theory (Portuguese) - Recording link
	Eduarda Chagas (Universidade Federal de Minas Gerais)

Jun 11, 2021 02:00 PM (GMT -3:00)	Lecture 8: Progress Towards Understanding Generalization in Deep Learning (English) - Recording link
	Gintare Karolina Dziugaite (Element AI)
	There is, as yet, no satisfying theory explaining why common learning algorithms, like those based on stochastic gradient descent, generalize in practice on overparameterized neural networks. I will discuss various approaches that have been taken to explaining generalization in deep learning, and identify some of the barriers these approaches faced. I will then discuss my recent work on information-theoretic and PAC-Bayesian approaches to understanding generalization in noisy variants of SGD. In particular, I will highlight how we can take advantage of conditioning to obtain sharper data- and distribution-dependent generalization measures. I will also briefly touch upon my work on properties of the optimization landscape and some of the challenges we face incorporating these insights into the theory of generalization.
	Show / Hide Biography Bio: Gintare Karolina Dziugaite is a Lead Research Scientist at Element AI, a ServiceNow company. She is also an associate member at Mila, the Quebec AI Institute. Her research combines theoretical and empirical approaches to understanding deep learning, with a focus on generalization and network compression. Before joining Element AI, she obtained her Ph.D. in machine learning from the University of Cambridge, under the supervision of Zoubin Ghahramani. Prior to that, she studied Mathematics at the University of Warwick and read Part III in Mathematics at the University of Cambridge, receiving a Masters of Advanced Study (MASt) in Applied Mathematics. In 2020, Karolina was a member of the Institute for Advanced Study,, participating in the special year on Optimization, Statistics, and Theoretical Machine Learning. In 2019, she was a Simons Fellow during the Foundations of Deep Learning program at the Simons Institute for the Theory of Computing at the University of Berkeley from programs. She was also a long-term participant at the Simons Institute in 2017 and 2020 during programs on theoretical machine learning and interpretable machine learning.

Week 9

Jun 18, 2021 01:30 PM (GMT -3:00)	Lecture 9: Information-Theoretic Generalization Bounds for Stochastic Gradient Descent (English) - Recording link
	Gergely Neu (Universitat Pompeu Fabra)
	We study the generalization properties of the popular stochastic gradient descent method for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.
	Show / Hide Biography Bio: Gergely Neu is a research assistant professor at the Pompeu Fabra University, Barcelona, Spain. He has previously worked with the SequeL team of INRIA Lille, France and the RLAI group at the University of Alberta, Edmonton, Canada. He obtained his PhD degree in 2013 from the Budapest University of Technology and Economics, where his advisors were András György, Csaba Szepesvári and László Györfi. His main research interests are in machine learning theory, including reinforcement learning and online learning with limited feedback and/or very large action sets. Dr. Neu was the recipient of a Google Faculty Research award in 2018, the Bosch Young AI Researcher Award in 2019, and an ERC Starting Grant in 2020.

Week 10

Jun 21, 2021 04:00 PM (GMT -3:00)	Introductory lecture 10: Introduction to Category Theory: Up to Monoidal Categories (Portuguese) - Recording link
	Jose Vitor Paiva Miranda Siqueira (University of Cambridge)

Jun 25, 2021 04:00 PM (GMT -3:00)	Lecture 10: Backprop as a Functor (English) - Recording link
	Brendan Fong (MIT / Topos Institute)
	A supervised learning algorithm searches over a set of functions parametrised by a space to find the best approximation to some ideal function . It does this by taking examples , and updating the parameter according to some rule. We define a category where these update rules may be composed, and show that gradient descent --- with respect to a fixed step size and an error function satisfying a certain property --- defines a monoidal functor from a category of parametrised functions to this category of update rules. This provides a structural perspective on backpropagation, as well as a broad generalisation of neural networks.
	Show / Hide Biography Bio: Brendan Fong oversees coordination and strategic planning at Topos. He holds a PhD in computer science from Oxford, and undertook postdoctoral studies in systems engineering and mathematics at UPenn and MIT. He is a founding executive editor of the open-access journal Compositionality, and co-authored the textbook An Invitation to Applied Category Theory. Brendan believes technologies of connection and integration are already transforming the world we live in, and is dedicated to ensuring that these transformations benefit society-at-large.

Week 11

Jul 2, 2021 02:00 PM (GMT -3:00)	Lecture 11: Learning Functors using Gradient Descent (English) - Recording link
	Bruno Gavranović (University of Strathclyde)
	CycleGAN is a general approach to unpaired image-to-image translation that has been getting attention in recent years. Inspired by categorical database systems, we show that CycleGAN is a "schema", i.e. a specific category presented by generators and relations, whose specific parameter instantiations are just set-valued functors on this schema. We show that enforcing cycle-consistencies amounts to enforcing composition invariants in this category. We generalize the learning procedure to arbitrary such categories and show that a special class of functors, rather than functions, can be learned using gradient descent. Using this framework we design a novel neural network system capable of learning to insert and delete objects from images without paired data. We qualitatively evaluate the system on three different datasets and obtain promising results.
	Show / Hide Biography Bio: Bruno Gavranović is a PhD student at the Mathematically Structured Programming (MSP) group in Glasgow, advised by Neil Ghani. He is understanding machine learning and game theory through the lens of category theory and functional programming. Before joining MSP, he obtained his Bachelors and Masters at Faculty of Electrical Engineering and Computing (FER) in Zagreb.

Jul 09, 2021 11:00 AM (GMT -3:00)	Panel: The Many Paths to Understanding Deep Learning (English) - Recording link link
	Brendan Fong (MIT / Topos Institute), Gintare Karolina (Element AI), Oriol Vinyals (DeepMind), Yasaman Bahri (Google Research)