Speakers

Neil Lawrence received his bachelor's degree in Mechanical Engineering from the University of Southampton in 1994. Following a period as an field engineer on oil rigs in the North Sea he returned to academia to complete his PhD in 2000 at the Computer Lab in Cambridge University. He spent a year at Microsoft Research in Cambridge before leaving to take up a Lectureship at the University of Sheffield, where he was subsequently appointed Senior Lecturer in 2005. In January 2007 he took up a post as a Senior Research Fellow at the School of Computer Science in the University of Manchester where he worked in the Machine Learning and Optimisation research group. In August 2010 he returned to Sheffield to take up a collaborative Chair in Neuroscience and Computer Science.

Neil's main research interest is machine learning through probabilistic models. He focuses on both the algorithmic side of these models and their application. He has a particular focus on applications in personalized health and computational biology, but happily dabbles in other areas such as speech, vision and graphics.

Neil was Associate Editor in Chief for IEEE Transactions on Pattern Analysis and Machine Intelligence (from 2011-2013) and is an Action Editor for the Journal of Machine Learning Research. He was the founding editor of the JMLR Workshop and Conference Proceedings (2006) and is currently series editor. He was an area chair for the NIPS conference in 2005, 2006, 2012 and 2013, Workshops Chair in 2010 and Tutorials Chair in 2013. He was General Chair of AISTATS in 2010 and AISTATS Programme Chair in 2012. He was Program Chair of NIPS in 2014 and was General Chair for 2015.

In this first session we will introduce Gaussian process models, non parametric Bayesian models that allow for principled propagation of uncertainty in regression analysis. We will assume a background in parametric models, linear algebra and probability.

In the second session we will look at how Gaussian process models are related to Kalman filters and how they may be extended to deal with multiple outputs and mechanistic models.

In the third session we will look at latent variable models from a Gaussian process perspective with a particular focus on dimensionality reduction.

Arthur Gretton is a Reader (Associate Professor) with the Gatsby Computational Neuroscience Unit, CSML, UCL, which he joined in 2010. He received degrees in physics and systems engineering from the Australian National University, and a PhD with Microsoft Research and the Signal Processing and Communications Laboratory at the University of Cambridge. He worked from 2002-2012 at the MPI for Biological Cybernetics, and from 2009-2010 at the Machine Learning Department, Carnegie Mellon University.

Arthur's research interests include machine learning, kernel methods, statistical learning theory, nonparametric hypothesis testing, blind source separation, Gaussian processes, and non-parametric techniques for neural data analysis. He has been an associate editor at IEEE Transactions on Pattern Analysis and Machine Intelligence from 2009 to 2013, an Action Editor for JMLR since April 2013, a member of the NIPS Program Committee in 2008 and 2009, an Area Chair for ICML in 2011 and 2012, and a member of the COLT Program Committee in 2013. Arthur was co-chair of AISTATS in 2016 (with Christian Robert).

This lecture covers the definition of a kernel, as a dot product between features. Features might be explicitly constructed for domain-specific learning problems (e.g. custom kernels for text or image classification), or more generically, so that functions of built with these features are smooth. I will show how to combine simpler kernels to make new kernels, and describe how to interpret such combinations. I will then construct infinite dimensional feature spaces, and show how to enforce smoothness for functions of these (infinitely many) features. Finally, I will describe the reproducing property and kernel trick, and cover some simple kernel algorithms.

The second lecture covers mappings of probabilities to reproducing kernel Hilbert spaces. The distance between these mappings is known as the maximum mean discrepancy (MMD), and has two interpretations: most straightforwardly, as a distance between expected features, but also as an integral probability metric (a "witness" function is sought which reveals areas of large difference in probability mass). I will describe conditions on kernels to ensure that distribution embeddings are unique, meaning that the distance between distribution embeddings can be used to distinguish between them. Such kernels are known as Characteristic Kernels. Finally, I will describe a hypothesis test, which allows us to find whether an empirical difference between two distributions is statistically significant. Applications include distinguishing neural recordings in the presence or absence of spike bursts, and distinguishing amplitude modulated audio from different recordings.

The third lecture will cover a variety of advanced topics. These will include: testing independence and higher order interactions using kernels embeddings of distributions (for instance, whether text in two different languages is dependent, even if we don't know how to translate between them), interactions between three variables (e.g. when two variables jointly cause a third), choice of kernels to optimise test power, and use of distribution embeddings to perform regression when the inputs are distributions (for example, to regress from samples of aerosol data to air pollution levels, or to speed up expectation propagation by using regression to "cache" the EP updates).

Jon Shlens has been a research scientist at Google since 2010. Prior to joining Google Research he was a research fellow at the Howard Hughes Medical Institute and a Miller Fellow at UC Berkeley. His research interests include machine perception, statistical signal processing, machine learning and biological neuroscience. During his time at Google, Jon has been a core contributor to TensorFlow - an open-sourced machine learning system -- leading the effort to develop the first large-scale distributed vision systems. In addition, Jon has focused his research efforts on developing new techniques for training machine learning systems and developing architectures for artificial vision systems.

"Deep learning has profoundly changed the field of computer vision in the last few years. Many computer vision problems have been recast with techniques from deep learning and in turn achieved state of the art results and become industry standards. Over the course of several lectures I will provide an overview about the central ideas of deep learning as applied to computer vision. In the course of the lectures I will build on these ideas to describe the current state of research in artificial vision focusing on the topics of image recognition, object localization and image synthesis. As part of these lectures I will also describe several of the tools available to a machine learning practitioner and provide some introductory tutorial material. The goal of these lectures will be to teach the core ideas, provide a high level overview of how deep learning has influenced computer vision and finally provide a series of hands-on tools so that students may get started in applying these ideas."

Chris Wiggins is an associate professor of applied mathematics at Columbia University and the Chief Data Scientist at The New York Times. At Columbia he is a founding member of the executive committee of the Data Science Institute, of the Department of Systems Biology, and is affiliated faculty in Statistics. He is a co-founder and co-organizer of hackNY (http://hackNY.org), a nonprofit which since 2010 has organized once a semester student hackathons and the hackNY Fellows Program, a structured summer internship at NYC startups. Prior to joining the faculty at Columbia he was a Courant Instructor at NYU (1998-2001) and earned his PhD at Princeton University (1993-1998) in theoretical physics. In 2014 he was elected Fellow of the American Physical Society and is a recipient of Columbia's Avanessians Diversity Award.

The Data Science group at The New York Times develops and deploys machine learning solutions to newsroom and business problems. Re-framing real-world questions as machine learning tasks requires not only adapting and extending models and algorithms to new or special cases but also sufficient breadth to know the right method for the right challenge. I'll first outline how unsupervised, supervised, and reinforcement learning methods are increasingly used in human applications for description, prediction, and prescription, respectively. I'll then focus on the 'prescriptive' cases, showing how methods from the reinforcement learning and causal inference literatures can be of direct impact in engineering, business, and decision-making more generally.

Yoram Singer tails a small research group at Google which focuses of machine learning principles. Prior to his position at Google he was an associate professor at the Hebrew University of Jerusalem, Israel. He was the co-chair of COLT in 2004 and NIPS in 2007. His work in machine learning received a few minor awards.

Online and stochastic optimization are well established tools for machine learning problems which have both theoretical and practical appeals. The tutorial starts with a simple example of predicting the next element of a binary sequence. We then formally introduce the basic definitions. Next we describe the problem of predicting with experts advice by analyzing a few algorithms and contrasting them with an impossibility result. This basic setting is then re-examined in the context of online learning of general convex functions. We then introduce the stochastic optimization view and discuss the stochastic gradient method (SGM) while relating it to online learning. We next build on SGM and describe proximal methods, namely, mirror descent and dual averaging. We show how to modify proximal methods in order to adapt to the unknown geometry of a learning problem via the AdaGrad algorithm. We conclude by discussing analyses and modification of stochastic optimization to non-convex problems in deep learning. The tutorial will include theoretical and coding questions.

Francisco is the CEO at BigML, Inc where he helps conceptualize, design, architect, and implement BigML's distributed Machine Learning platform. Formerly, Francisco founded and led Strands, Inc, a company that pioneered Behavior-based Recommender Systems. Previously, he founded and led Intelligent Software Components, SA (iSOCO), the first spin-off of the Spanish National Research Council (CSIC). He holds a 5-year degree in Computer Science, a Ph.D. in Artificial Intelligence, and a post-doc in Machine Learning. He is the holder of 16 patents in the areas of Recommender Systems and Distributed Machine Learning.

This class will address how to use single decision trees and different ensemble strategies to solve classification and regression problems. You will learn how to create, evaluate, and compare the performance of different predictive models without writing a single line of code using a web-based dashboard and how to automate basic workflows using a Machine Learning API. No previous understanding or experience with Machine Learning is required. Basic Python programming is desirable.

This class will address how to perform cluster analysis using K-means and G-means and how to perform anomaly detection using Isolation Forests. You will learn how to create clusters and anomaly detectors without writing a line of code using a web-based dashboard and how to automate basic workflows using a Machine Learning API. No previous understanding or experience with Machine Learning is required. Basic Python programming is desirable.

In this class, we will teach you how to fully automate Machine Learning workflows: from very basic workflows that automate repetitive tasks to advanced workflows that automate feature selection or hyperparameter optimization. Basic Machine Learning understanding and basic functional programming are desirable.

Cedric Archambeau is a Senior Machine Learning Scientist with Amazon, Berlin. He manages the algorithms team and served as a technical advisor to Sebastian Gunningham, Amazon Senior Vice President Seller Services. Recently, his team delivered the learning algorithms offered in Amazon Machine Learning (aws.amazon.com/machine-learning). He is interested in large scale probabilistic inference and Bayesian optimization. He holds a visiting position in the Centre for Computational Statistics and Machine Learning at University College, London. Prior to joining Amazon, he was leading the Machine Learning and Mechanism Design area at Xerox Research Centre Europe, Grenoble.

Applying complex predictive systems, such as machine learning-based systems, in the wild requires to manually tune and adjust knobs, broadly referred to as system parameters or hyperparameters. To democratize machine learning and reduce the maintenance cost of such systems, it is essential to automate this process. Black-box optimization, and in particular Bayesian optimization, provides a natural framework for addressing this problem by taking the human expert out of the fine tuning loop. In this talk I will introduce the building blocks of Bayesian optimization, some open problems and discuss two applications.

Within Amazon, a company with over 200 millions of active consumers, over 2 million active seller accounts and over 180.000 employees, there are hundreds of problems which can be tackled with Machine Learning. In this talk, I will give an overview of a number of Machine Learning applications. I will explain how they fit within the Amazon ecosystem, the challenges we are facing and how they help us scale. While Machine Learning is routinely used in recommendation, fraud detection and ad allocation, it plays a key role in devices such as the Kindle or the Echo, as well as the automation of Kiva enabled fulfilment centres, statistical machine translation and automated Fresh produce inspection.

Finale Doshi-Velez is an assistant professor in Computer Science at Harvard. She completed her Master's at the University of Cambridge, her PhD at MIT, and her postdoc at Harvard Medical School. She is interested in intersections of machine learning and medicine.

Reinforcement learning (RL) is a framework for learning to solve problems involving a sequence of decisions with uncertain outcomes from experience. This setting is particularly challenging, as actions may have long-term, overlapping effects; however it is representative of many realworld problems. RL has been used in applications ranging from game-playing agents, operations research, and robotics. This short course will focus on understanding the core foundations and questions in sequential decision-making. We will discuss Markov Decision Processes, classic learning and planning methods, and algorithms to manage the exploration-exploitation trade-off; standard approximation methods; and the core ideas behind recent advances in deep RL and Bayesian RL. In addition to lectures, this short course will include hands-on programming and conceptual exercises.

Sinead Williamson is an Assistant Professor of Statistics at the University of Texas at Austin, jointly appointed between the Statistics and Data Science department and the McCombs School of Business. Her research focuses on Bayesian nonparametric methods, in particular the construction of new nonparametric models for machine learning applications and the development of scalable inference algorithms for Bayesian nonparametrics. She is currently working on applications including modeling social interactions and predicting the deterioration level of road surfaces.

Bayesian nonparametric methods allow us to extend models such as a Bayesian mixture of Gaussians to have infinitely many parameters a priori – for example infinitely many Gaussian mixture components. When modeling a given data set, we use a finite – but random – subset of these infinitely many parameters, effectively inferring the appropriate dimensionality. In this series of lectures, we will discuss the Bayesian nonparametric paradigm, and explore some of the key methods in Bayesian nonparametrics: The Dirichlet process, the Indian buffet process, and the hierarchical Dirichlet process. We will explore how we can use these models in a machine learning setting, and derive MCMC inference algorithms for the basic models.

Dilan Gorur is a senior machine learning scientist at Microsoft, working on Bing. Prior to joining Microsoft, she worked as a machine learning scientist at Yahoo! Labs and as a postdoc at University of California, Irvine and at Gatsby Unit of UCL. Her research interests lie in the theory and applications of Machine learning, focusing on probabilistic models, Bayesian inference and large scale applications. She served as an area chair for NIPS 2013, NIPS 2014 and AISTATS 2016.

I aim to give a pragmatic view of machine learning while giving a high level view of web search. In this series of talks, I will introduce practical aspects of machine learning in industry, starting from experiment design to modeling and evaluation. Information retrieval and ranking are two widely known techniques that are essential for web search, but there is more to it. I will use example applications from lesser known aspects of constructing the search results page (SERP). Specifically, I will focus on two machine learning applications: whole page optimization for SERP layout and counterfactual reasoning for tuning marketplace operating points.

Fernando Pérez-Cruz (IEEE Senior Member) was born in Sevilla, Spain, in 1973. He received a PhD. in Electrical Engineering in 2000 from the Technical University of Madrid and an MSc/BSc in Electrical Engineering from the University of Sevilla in 1996. He is a member of the technical staff at Bell Labs and an Associate Professor with the Department of Signal Theory and Communication at University Carlos III in Madrid. He has been a visiting professor at Princeton University under a Marie Curie Fellowship and a Research Scientist at Amazon. He has also held positions at the Gatsby Unit (London), Max Planck Institute for Biological Cybernetics (Tuebingen), BioWulf Technologies (New York) and the Technical University of Madrid and Alcala University (Madrid). His current research interest lies in machine learning and information theory and its application to signal processing and communications. Fernando is the General Chair of AISTATS 2016 and Machine Learning Summer School 2016 in Arequipa and as organized many machine learning (NIPS) and information theory conferences (IEEE ITW). Fernando has supervised 7 PhD students and numerous MSc students, as well as one junior and one senior Marie Curie Fellow. Fernando has published over 40 papers in leading academic journals, as well as over 60 peer-reviewed conferences. A detailed CV and list of publications can be accessed at http://www.tsc.uc3m.es /~fernando.

In machine learning, there are two general approaches for learning from data: discriminative and generative modeling. Discriminative approaches solve a clear supervised learning task, in which from a labeled dataset we want to predict the output given any potential input. The solution to this problem ranges from Fisher’s Linear Discriminative Analysis to Hinton’s Deep Belief Networks. Generative approaches build a probability density model for the data. They do not have a clear task or metric at hand, so potentially they could solve any. What attracts me toward generative models is their ability to organize the messy data that we have nowadays. In this sense, the generative model has to be either human interpretable and actionable, and/or points towards causal interactions.

Discriminative learning is a one-way conversation, given the labeled data and the metric build a machine that is as accurate as possible. Generative modeling is a two-way conversation in which the model evolves between the owner of the data (expert in the field) and the machine-learning practitioner, who builds the model. The information has to flow in both directions because the model has to include all prior intangible information from the owner of the data and the model has to be understandable by the data owner, because in this case the metric is learning and understanding the data, not reducing some error measure.

In this talk, we cover how Bayesian nonparametric (BNP) models can be used to find hidden patterns in the data. BNPs return latent variable models that following De Finetti’s Theorem find latent variables for which the observed data is conditional independent. From the analysis of this latent variable model, we can gain knowledge about the system that generated the data and draw actionable conclusions or propose tests that would verify the found connections.

In the final part of the presentation we focus on its application to modeling psychiatric disorders. In the latent variable representation of the data, the first conclusions are, as expected, the known disorders. But the latent variable model returns much more information and subtleties that show the interactions between the different disorders and the questions asked to uncover them.

A system that provides accurate wide-area indoor-and-outdoor localization will revolutionize the world, because a universal localization service has potential game-changing applications in many industries including health, security, commerce, gaming, transportation, planning and smart cities, among others. Geo-localization, either with received power or time of arrival information, is a machine-learning problem, because we need to learn the environment to achieve the necessary accuracy. In this paper, we propose a generative algorithm for ToA geo-localization that it is able to propose a probabilistic model for all the sources of error. Specifically it models that power delay profile bias and the errors in the Access Points to provide meter-level accuracy. We illustrate our algorithms using an ad-hoc network and live LTE network, which shows that our algorithm can be deployed in large metropolitan networks.