## Speakers

Lecture 1: introduction to RKHS This lecture covers the definition of a kernel, as a dot product between features. Features might be explicitly constructed for domain-specific learning problems (e.g. custom kernels for text or image classification), or so that functions of built with these features are smooth. I will show how to combine simpler kernels to make new kernels, and describe how to interpret such combinations. Finally, I will describe the reproducing property and kernel trick.

Lecture 2: distribution embeddings, maximum mean discrepancy, and hypothesis tests The second lecture covers mappings of probabilities to reproducing kernel Hilbert spaces. The distance between these mappings is known as the maximum mean discrepancy (MMD), and has two interpretations: most straightforwardly, as a distance between expected features, but also as an integral probability metric (a "witness" function is sought which seeks out areas of large difference in probability mass). I will describe conditions on kernels to ensure that distribution embeddings are unique, meaning that the distance between distribution embeddings can be used to distinguish between them. Such kernels are known as Characteristic Kernels. Finally, I will describe a hypothesis test, which allows us to find whether an empirical difference between two distributions is statistically significant.

Lecture 3: advanced topics in kernels and distribution embeddings. The third lecture will cover a variety of advanced topics. These will include: testing independence and higher order interactions using kernels embeddings of distributions, choice of kernels to optimise test power, and use of distribution embeddings to perform regression when the inputs are distributions (for example, to regress from samples of aerosol data to air pollution levels, or to speed up expectation propagation by using regression to "cache" the EP updates).

Structure in the input has been crucial for the success of neural networks. I will briefly review Convolutional networks, which depend on the translation and scaling structure of the grid, and Recurrent Networks, which depend on the translation and causal structure in sequences. Then I will discuss some approaches for when the input does not have the structures, for example, if it is a set, or a graph.

##### Slides:

##### Videos:

Throughout a series of examples, this practical will explore the theory and practice of using randomness to extract intelligence from data. From a practical point of view, we will implement randomized methods to i) accelerate linear algebra routines and ii) scale kernel methods to big data. From a theoretical standpoint, we will prove error bounds on i) the error of linear random projections, and ii) the spectral norm of the difference between sums of random matrices and their expectation. We will be using a Python 2 notebook, along with the sklearn-0.17 package.

##### Slides:

##### Videos:

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. Given n observations/iterations, the optimal convergence rates of these algorithms are O(1/n^(1/2)) for general convex functions and reaches O(1/n) for strongly-convex functions.

In this tutorial, I will first present the classical results in stochastic approximation and relate them to classical optimization and statistics results. I will then show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

##### Slides:

##### Videos:

In recent years, there has been an increasing effort on developing realistic models, as well as learning algorithms to understand and predict different phenomena taking place in the Web and social network and media such as, e.g., information diffusion and online social activity. In this practical, we will talk about temporal point processes, which have became popular in recent years for modeling, among other online activities, both information diffusion and social activity. In particular, we will focus on the simulation and parameter estimation for two different applications.

##### Slides:

Reinforcement learning studies decision making and control, and how a decision- making agent can learn to act optimally in a previously unknown environment. Deep reinforcement learning studies how neural networks can be used in reinforcement learning algorithms, making it possible to learn the mapping from raw sensory inputs to raw motor outputs, removing the need to hand-engineer this pipeline. The aim of the tutorial is to introduce you to the most important techniques in deep reinforcement learning. This course will include hands-on labs, where you will implement the algorithms discussed in the lectures.

For the labs, you should have a working installation of OpenAI Gym. A Python-based autodiff library such as Theano or Tensorflow is highly recommended.

##### Slides:

##### Videos:

In the field of causality we are interested in answering questions like how a system reacts under interventions (e.g. in gene knock-out experiments). These questions go beyond statistical dependencies and can therefore not be answered by standard regression or classification techniques. While humans are very efficient in learning causal relations between few random variables, we require automated procedures in situations where large and/or high-dimensional data sets are available.

Part I: We introduce structural equation models and formalize interventional distributions. We define causal effects and show how to compute them if the causal structure is known.

Part II: We present three ideas that can be used to infer causal structure from data: (1) finding (conditional) independences in the data, (2) restricting structural equation models and (3) exploiting the fact that causal models remain invariant in different environments.

Part III: If time allows, we show how causal concepts can be applied in the field of machine learning.

##### Slides:

##### Videos:

Nowadays, large-scale human activity data from online social platforms, such as Twitter, Facebook, Reddit, Stackoverflow, Wikipedia and Yelp, are becoming increasing available and in increasing spatial and temporal resolutions. Such data provide great opportunities for understanding and modeling both macroscopic (network level) and microscopic (node-level) patterns in human dynamics. Such data have also fueled the increasing efforts on developing realistic representations and models as well as learning, inference and control algorithms to understand, predict, control and distill knowledge from these dynamic processes over networks.

It has emerged as a trend to take a bottom-up approach which starts by considering the stochastic mechanism driving the behavior of each node in a network to later produce global, macroscopic patterns at a network level. However, this bottom-up approach also raises significant modeling, algorithmic and computational challenges which require leveraging methods from machine learning, temporal point process theory, probabilistic modeling and optimization. In the first part of the tutorial, we will explain how to utilize the theory of temporal point processes to create realistic representations and models for a wide variety of processes in networks, and then elaborate on efficient learning algorithms to fit the parameters of these models from fine-grained social activity data. In the second part, we will present several canonical inference and control problems in the context of processes over networks and introduce efficient machine learning algorithms to solve some of these problems.

##### Slides:

##### Videos:

Representations are pivotal to learning, inference and computation. Inductive priors and regularization can be phrased very naturally by the choice of representation, and learning can be described in terms of representational changes. I will use the example of visual perception to motivate and discuss important topics in machine learning such as unsupervised and supervised representation learning, generative modeling and model comparison. I will provide some background on understanding visual representations in neuroscience along the way and I will focus particularly on the breakthrough of deep convolutional neural networks in computer vision. Finally, I will discuss important open problems in building robust perceptual systems.

##### Slides:

##### Videos:

The study of brain function requires collecting and analyzing highly complex and multivariate datasets. Modern machine learning techniques are useful at several stages of the analysis. First unsupervised learning techniques based on tools such as non-negative matrix factorization helps identify the relevant features and underlying structure of the data. Second, statistical analysis based on kernel embedding of distributions help identify complex interactions between different aspects of neural activity. Finally, causal inference allows estimating the directionality of information transfer across brain networks. During this tutorial, we will implement and use some of these tools to analyze intercortical recordings and explain how they help neuroscientists understand brain function.

##### Slides:

##### Videos:

Machine learning is a core technique for many companies. However, despite the increasing ties between industry and academia, there are still misconceptions about the important aspects of machine learning in the industry. We will see how machine learning can help create new models, but also impact the processes and the decisions. Through a concrete example, we will explore the interplay between the scientific aspects and the technical constraints of a production environment. Finally, we will end on some challenges for the years to come.

##### Slides:

##### Videos:

ML is making inroads into a variety of applications in the engineering and sciences, unraveling complex patterns in increasingly rich data. Much of this success is due to procedures such as SVMs, Random Forests, Boosting, Neural Nets, Gaussian Process, all nonparametric in essence.

These lectures will attempt, in 3 hours, to cover the building blocks of nonparametric (minimax) analysis. In particular, as time permits, we’ll try to shed light on aspects of data that makes nonparametric problems difficult or easy, and we’ll cover some of the essential insights in understanding the behavior of nonparametric procedures. Finally, we’ll discuss some of the essential research directions, e.g. semi-parametric models, nonparametrics on structured data, real-world time and space constraints, etc.

The study of brain function requires collecting and analyzing highly complex and multivariate datasets. Modern machine learning techniques are useful at several stages of the analysis. First unsupervised learning techniques based on tools such as non- negative matrix factorization helps identify the relevant features and underlying structure of the data. Second, statistical analysis based on kernel embedding of distributions help identify complex interactions between different aspects of neural activity. Finally, causal inference allows estimating the directionality of information transfer across brain networks. During this tutorial, we will implement and use some of these tools to analyze intercortical recordings and explain how they help neuroscientists understand brain function.

Many problems in machine learning that involve discrete structures or subset selection may be phrased in the language of submodular set functions. The property of submodularity, also referred to as a 'discrete analog of convexity', expresses the notion of diminishing marginal returns, and captures combinatorial versions of rank and dependence. Submodular functions occur in a variety of areas including graph theory, information theory, combinatorial optimization, stochastic processes and game theory. In machine learning, they emerge in different forms as the potential functions of graphical models, as utility functions in active learning and sensing, in models of diversity, in structured sparse estimation or network inference. The lectures will give an introduction to the theory of submodular functions, example applications in machine learning, and algorithms for optimization with submodular functions.

##### Videos:

Classification and regression from data requires to approximate functions in high dimensional spaces. Avoiding the curse of dimensionality raises issues in many branches of mathematics including statistics, probability, harmonic analysis and geometry. Recently, deep convolutional networks have obtained spectacular results for image understanding, audio recognition, natural language analysis and all kind of data analysis problems.

We shall review deep network architectures and analyze their mathematical properties, with many open questions. These architectures implement non-linear multiscale contractions, and sparse separations. We introduce mathematical tools that play an important role to understand their properties including wavelet transforms and the action of groups of symmetries. The course will emphasize open mathematical mysteries.

We will describe applications to image and audio classifications, but also to statistical physics models and to the regression of molecule energies in quantum chemistry.

##### Slides:

##### Videos:

In the recent years the multi-armed bandit problem has attracted a lot of attention in the theoretical learning community. This growing interest is a consequence of the large number of problems that can be modeled as a multi-armed bandit: ad placement, website optimization, packet routing, etc. Furthermore the bandit methodology is also used as a building block for more complicated scenarios such as reinforcement learning, model selection in statistics, or computer game-playing. While the basic stochastic multi-armed bandit can be traced back to Thompson (1933) and Robbins (1952), it is only very recently that we obtained an (almost) complete understanding of this simple model. Moreover many extensions of the original problem have been proposed in the past fifteen years, such as bandits without a stochastic assumption (the so-called adversarial model), or bandits with a very large (but structured) set of arms.

This tutorial will cover in details the state-of-the-art for the basic multi-armed bandit problem (both stochastic and adversarial), and the information theoretic analysis of Bayesian bandit problems. We will also touch upon contextual bandits, as well as the case of very large (possibly infinite) set of arms with linear/convex/Lipschitz losses.

Nonparametric Bayesian methods make use of infinite-dimensional mathematical structures to allow the practitioner to learn more from their data as the size of their data set grows. What does that mean, and how does it work in practice? In this tutorial, we'll cover why machine learning and statistics need more than just parametric Bayesian inference. We'll introduce such foundational nonparametric Bayesian models as the Dirichlet process and Chinese restaurant process and touch on the wide variety of models available in nonparametric Bayes. Along the way, we'll see what exactly nonparametric Bayesian methods are and what they accomplish.

##### Slides:

##### Videos:

In recent years, there has been an increasing effort on developing realistic models, as well as learning algorithms to understand and predict different phenomena taking place in the Web and social network and media such as, e.g., information diffusion and online social activity. In this practical, we will talk about temporal point processes, which have became popular in recent years for modeling, among other online activities, both information diffusion and social activity. In particular, we will focus on the simulation and parameter estimation for two different applications.

##### Slides:

## Speakers

Lecture 1: introduction to RKHS This lecture covers the definition of a kernel, as a dot product between features. Features might be explicitly constructed for domain-specific learning problems (e.g. custom kernels for text or image classification), or so that functions of built with these features are smooth. I will show how to combine simpler kernels to make new kernels, and describe how to interpret such combinations. Finally, I will describe the reproducing property and kernel trick.

Lecture 2: distribution embeddings, maximum mean discrepancy, and hypothesis tests The second lecture covers mappings of probabilities to reproducing kernel Hilbert spaces. The distance between these mappings is known as the maximum mean discrepancy (MMD), and has two interpretations: most straightforwardly, as a distance between expected features, but also as an integral probability metric (a "witness" function is sought which seeks out areas of large difference in probability mass). I will describe conditions on kernels to ensure that distribution embeddings are unique, meaning that the distance between distribution embeddings can be used to distinguish between them. Such kernels are known as Characteristic Kernels. Finally, I will describe a hypothesis test, which allows us to find whether an empirical difference between two distributions is statistically significant.

Lecture 3: advanced topics in kernels and distribution embeddings. The third lecture will cover a variety of advanced topics. These will include: testing independence and higher order interactions using kernels embeddings of distributions, choice of kernels to optimise test power, and use of distribution embeddings to perform regression when the inputs are distributions (for example, to regress from samples of aerosol data to air pollution levels, or to speed up expectation propagation by using regression to "cache" the EP updates).

Structure in the input has been crucial for the success of neural networks. I will briefly review Convolutional networks, which depend on the translation and scaling structure of the grid, and Recurrent Networks, which depend on the translation and causal structure in sequences. Then I will discuss some approaches for when the input does not have the structures, for example, if it is a set, or a graph.

##### Slides:

##### Videos:

Throughout a series of examples, this practical will explore the theory and practice of using randomness to extract intelligence from data. From a practical point of view, we will implement randomized methods to i) accelerate linear algebra routines and ii) scale kernel methods to big data. From a theoretical standpoint, we will prove error bounds on i) the error of linear random projections, and ii) the spectral norm of the difference between sums of random matrices and their expectation. We will be using a Python 2 notebook, along with the sklearn-0.17 package.

##### Slides:

##### Videos:

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. Given n observations/iterations, the optimal convergence rates of these algorithms are O(1/n^(1/2)) for general convex functions and reaches O(1/n) for strongly-convex functions.

In this tutorial, I will first present the classical results in stochastic approximation and relate them to classical optimization and statistics results. I will then show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

##### Slides:

##### Videos:

In recent years, there has been an increasing effort on developing realistic models, as well as learning algorithms to understand and predict different phenomena taking place in the Web and social network and media such as, e.g., information diffusion and online social activity. In this practical, we will talk about temporal point processes, which have became popular in recent years for modeling, among other online activities, both information diffusion and social activity. In particular, we will focus on the simulation and parameter estimation for two different applications.

##### Slides:

Reinforcement learning studies decision making and control, and how a decision- making agent can learn to act optimally in a previously unknown environment. Deep reinforcement learning studies how neural networks can be used in reinforcement learning algorithms, making it possible to learn the mapping from raw sensory inputs to raw motor outputs, removing the need to hand-engineer this pipeline. The aim of the tutorial is to introduce you to the most important techniques in deep reinforcement learning. This course will include hands-on labs, where you will implement the algorithms discussed in the lectures.

For the labs, you should have a working installation of OpenAI Gym. A Python-based autodiff library such as Theano or Tensorflow is highly recommended.

##### Slides:

##### Videos:

In the field of causality we are interested in answering questions like how a system reacts under interventions (e.g. in gene knock-out experiments). These questions go beyond statistical dependencies and can therefore not be answered by standard regression or classification techniques. While humans are very efficient in learning causal relations between few random variables, we require automated procedures in situations where large and/or high-dimensional data sets are available.

Part I: We introduce structural equation models and formalize interventional distributions. We define causal effects and show how to compute them if the causal structure is known.

Part II: We present three ideas that can be used to infer causal structure from data: (1) finding (conditional) independences in the data, (2) restricting structural equation models and (3) exploiting the fact that causal models remain invariant in different environments.

Part III: If time allows, we show how causal concepts can be applied in the field of machine learning.

##### Slides:

##### Videos:

Nowadays, large-scale human activity data from online social platforms, such as Twitter, Facebook, Reddit, Stackoverflow, Wikipedia and Yelp, are becoming increasing available and in increasing spatial and temporal resolutions. Such data provide great opportunities for understanding and modeling both macroscopic (network level) and microscopic (node-level) patterns in human dynamics. Such data have also fueled the increasing efforts on developing realistic representations and models as well as learning, inference and control algorithms to understand, predict, control and distill knowledge from these dynamic processes over networks.

It has emerged as a trend to take a bottom-up approach which starts by considering the stochastic mechanism driving the behavior of each node in a network to later produce global, macroscopic patterns at a network level. However, this bottom-up approach also raises significant modeling, algorithmic and computational challenges which require leveraging methods from machine learning, temporal point process theory, probabilistic modeling and optimization. In the first part of the tutorial, we will explain how to utilize the theory of temporal point processes to create realistic representations and models for a wide variety of processes in networks, and then elaborate on efficient learning algorithms to fit the parameters of these models from fine-grained social activity data. In the second part, we will present several canonical inference and control problems in the context of processes over networks and introduce efficient machine learning algorithms to solve some of these problems.

##### Slides:

##### Videos:

Representations are pivotal to learning, inference and computation. Inductive priors and regularization can be phrased very naturally by the choice of representation, and learning can be described in terms of representational changes. I will use the example of visual perception to motivate and discuss important topics in machine learning such as unsupervised and supervised representation learning, generative modeling and model comparison. I will provide some background on understanding visual representations in neuroscience along the way and I will focus particularly on the breakthrough of deep convolutional neural networks in computer vision. Finally, I will discuss important open problems in building robust perceptual systems.

##### Slides:

##### Videos:

The study of brain function requires collecting and analyzing highly complex and multivariate datasets. Modern machine learning techniques are useful at several stages of the analysis. First unsupervised learning techniques based on tools such as non-negative matrix factorization helps identify the relevant features and underlying structure of the data. Second, statistical analysis based on kernel embedding of distributions help identify complex interactions between different aspects of neural activity. Finally, causal inference allows estimating the directionality of information transfer across brain networks. During this tutorial, we will implement and use some of these tools to analyze intercortical recordings and explain how they help neuroscientists understand brain function.

##### Slides:

##### Videos:

Machine learning is a core technique for many companies. However, despite the increasing ties between industry and academia, there are still misconceptions about the important aspects of machine learning in the industry. We will see how machine learning can help create new models, but also impact the processes and the decisions. Through a concrete example, we will explore the interplay between the scientific aspects and the technical constraints of a production environment. Finally, we will end on some challenges for the years to come.

##### Slides:

##### Videos:

ML is making inroads into a variety of applications in the engineering and sciences, unraveling complex patterns in increasingly rich data. Much of this success is due to procedures such as SVMs, Random Forests, Boosting, Neural Nets, Gaussian Process, all nonparametric in essence.

These lectures will attempt, in 3 hours, to cover the building blocks of nonparametric (minimax) analysis. In particular, as time permits, we’ll try to shed light on aspects of data that makes nonparametric problems difficult or easy, and we’ll cover some of the essential insights in understanding the behavior of nonparametric procedures. Finally, we’ll discuss some of the essential research directions, e.g. semi-parametric models, nonparametrics on structured data, real-world time and space constraints, etc.

The study of brain function requires collecting and analyzing highly complex and multivariate datasets. Modern machine learning techniques are useful at several stages of the analysis. First unsupervised learning techniques based on tools such as non- negative matrix factorization helps identify the relevant features and underlying structure of the data. Second, statistical analysis based on kernel embedding of distributions help identify complex interactions between different aspects of neural activity. Finally, causal inference allows estimating the directionality of information transfer across brain networks. During this tutorial, we will implement and use some of these tools to analyze intercortical recordings and explain how they help neuroscientists understand brain function.

Many problems in machine learning that involve discrete structures or subset selection may be phrased in the language of submodular set functions. The property of submodularity, also referred to as a 'discrete analog of convexity', expresses the notion of diminishing marginal returns, and captures combinatorial versions of rank and dependence. Submodular functions occur in a variety of areas including graph theory, information theory, combinatorial optimization, stochastic processes and game theory. In machine learning, they emerge in different forms as the potential functions of graphical models, as utility functions in active learning and sensing, in models of diversity, in structured sparse estimation or network inference. The lectures will give an introduction to the theory of submodular functions, example applications in machine learning, and algorithms for optimization with submodular functions.

##### Videos:

Classification and regression from data requires to approximate functions in high dimensional spaces. Avoiding the curse of dimensionality raises issues in many branches of mathematics including statistics, probability, harmonic analysis and geometry. Recently, deep convolutional networks have obtained spectacular results for image understanding, audio recognition, natural language analysis and all kind of data analysis problems.

We shall review deep network architectures and analyze their mathematical properties, with many open questions. These architectures implement non-linear multiscale contractions, and sparse separations. We introduce mathematical tools that play an important role to understand their properties including wavelet transforms and the action of groups of symmetries. The course will emphasize open mathematical mysteries.

We will describe applications to image and audio classifications, but also to statistical physics models and to the regression of molecule energies in quantum chemistry.

##### Slides:

##### Videos:

In the recent years the multi-armed bandit problem has attracted a lot of attention in the theoretical learning community. This growing interest is a consequence of the large number of problems that can be modeled as a multi-armed bandit: ad placement, website optimization, packet routing, etc. Furthermore the bandit methodology is also used as a building block for more complicated scenarios such as reinforcement learning, model selection in statistics, or computer game-playing. While the basic stochastic multi-armed bandit can be traced back to Thompson (1933) and Robbins (1952), it is only very recently that we obtained an (almost) complete understanding of this simple model. Moreover many extensions of the original problem have been proposed in the past fifteen years, such as bandits without a stochastic assumption (the so-called adversarial model), or bandits with a very large (but structured) set of arms.

This tutorial will cover in details the state-of-the-art for the basic multi-armed bandit problem (both stochastic and adversarial), and the information theoretic analysis of Bayesian bandit problems. We will also touch upon contextual bandits, as well as the case of very large (possibly infinite) set of arms with linear/convex/Lipschitz losses.

Nonparametric Bayesian methods make use of infinite-dimensional mathematical structures to allow the practitioner to learn more from their data as the size of their data set grows. What does that mean, and how does it work in practice? In this tutorial, we'll cover why machine learning and statistics need more than just parametric Bayesian inference. We'll introduce such foundational nonparametric Bayesian models as the Dirichlet process and Chinese restaurant process and touch on the wide variety of models available in nonparametric Bayes. Along the way, we'll see what exactly nonparametric Bayesian methods are and what they accomplish.