Research

...

Reaping the benefits of machine learning in scientific and high-stakes applications imposes three primary goals on machine learning: (1) accuracy and expressive power; (2) reliability; (3) computational and statistical efficiency.
Attaining all three goals relies on a common core principle: the method must capture all relevant structure in the data – e.g., (combinatorial) dependencies, constraints, invariances – while ignoring unimportant variations. But how can we efficiently build such desiderata into an ML method? And what is this relevant structure? To answer these questions, my research develops representations that leverage various forms of mathematical structure: combinatorial, geometric, algebraic and computational.
Below is a summary of a selection of research projects that my group has been working on. A full list of publications is here.

  • 1. Modeling and representing structured data.

    Application domains generate data in various forms, e.g., networks, molecules, configurations, subsets or images. Moreover, data often has specific geometric properties. Suitably representing such data is a central topic in ML.
  • 2. Learning with limited supervision.

    High-quality labels increase accuracy but are expensive. We improve accuracy and efficiency of, e.g., sequential label queries (Bayesian black-box optimization) and self-supervised learning, while advancing their theoretical understanding.
  • 3. Understanding robustness.

    We identify new methods for perturbation-robust learning and for understanding ML methods’ behavior under shifts of the data distribution.

  • 1. Modeling and representing structured data.

    Graph representation learning (Graph Neural Networks) and learning combinatorial algorithms

    Learning problems on graphs are manifold: predicting from molecules for drug development or materials science; forecasting physical systems; improving discrete optimization solvers and chip design; recommender systems; and many others. Graph Neural Networks (GNNs) have become the standard for such applications, and have gained tremendous interest in ML. But despite their wide empirical success, the theoretical foundations of GNNs are just being understood. Learning in neural networks depends on the model architecture, learning task, data distribution and the learning algorithm, and this rich interplay makes the analysis challenging. Our results characterize (1) representational power and (2) learning (generalization within and out-of-distribution) of GNNs. Set functions, invariances and combinatorial structure play a key role in these works.

    Representational Power of GNNs

    Which graph functions can GNNs approximate? First, we frame this question from the viewpoint of graph isomorphism, and show that GNNs' discriminative power is upper bounded by the Weisfeiler-Leman algorithm. A sufficient condition for achieving this power is to have injective aggregation and readout functions, as in our GIN. Second, from a graph theoretic viewpoint, we show that message passing GNNs (and variations) cannot represent many structural graph properties, e.g., diameter, specific cycles, girth, etc. Third, looking at the computational structure, we relate GNNs to dynamic programming.

    K. Xu, J. Li, M. Zhang, S. Du, K. Kawarabayashi, S. Jegelka. What Can Neural Networks Reason About? ICLR, 2020. Spotlight
    V. K. Garg, S. Jegelka, T. Jaakkola. Generalization and Representational Limits of Graph Neural Networks.. ICML, 2020.
    K. Xu, W. Hu, J. Leskovec and S. Jegelka.How Powerful are Graph Neural Networks? ICLR, 2019. Oral Presentation

    Learning, Generalization and Extrapolation

    What kinds of functions do GNNs learn well? We analyze generalization within and out of (the training) distribution. First, we analyze sample complexity via a Rademacher bound. Second, for algorithmic reasoning tasks and learning combinatorial algorithms, we quantify the relation between the computational structure of the task and the neural network via the notion of algorithmic alignment and show its implications on generalization. Third, we extend this analysis to out-of-distribution behavior of ReLu GNNs trained with gradient descent, and observe that nonlinearities play a decisive role. This implies that task-specific nonlinearities must be encoded in the architecture or input representation. Fourth, we analyze the optimization dynamics of (linear) GNNs.

    K. Xu, M. Zhang, S. Jegelka, K. Kawaguchi. Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth. ICML, 2021.
    A. Liao, H. Zhao, K. Xu, T. Jaakkola, G. Gordon, S. Jegelka, R. Salakhutdinov. Information Obfuscation of Graph Neural Networks. ICML 2021.
    K. Xu, M. Zhang, J. Li, S. S. Du, K. Kawarabayashi, S. Jegelka. How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks. ICLR 2021. Oral Presentation
    V. K. Garg, S. Jegelka, T. Jaakkola. Generalization and Representational Limits of Graph Neural Networks.. ICML 2020.
    K. Xu, J. Li, M. Zhang, S. Du, K. Kawarabayashi, S. Jegelka. What Can Neural Networks Reason About? ICLR 2020. Spotlight

    back up



    Other Deep learning theory

    We are interested in understanding (deep) representation learning more generally, too. In one recent work, we show that even with a width of one unit, the deep ResNet architecture is a universal approximator. This is in contrast to fully connected networks, which need a width of the order of the input dimension.

    H. Lin, S. Jegelka. ResNet with one-neuron hidden layers is a Universal Approximator. NeurIPS 2018. Spotlight

    back up



    Clustering and graph partitioning


    LP Stability for graph partitioning problems

    Data in machine learning often arises from noisy measurements. When such data is used in an optimization problem, it is beneficial to know the stability of the optimal solution to perturbations in the data. We show a method for analyzing this stability for LP relaxations of graph partitioning problems. The method can handle the exponential number of constraints and applies to problems such as correlation clustering, clustering aggregation or modularity clustering.

    S. Nowozin and S. Jegelka. Solution stability in linear programming relaxations: graph partitioning and unsupervised learning. ICML 2009.

    Separating distributions by higher-order moments

    Many clustering criteria aim for clusers that are spatially separated. For example, the popular k-means crtiterion seeks clusers whose means are maximally far apart. If the data is assumed to be samples of a mixture of distributions and we want to recover the underlying distributions, then spatial separation may not be the ideal criterion, e.g. if we have two Gaussians with the same mean but different variance. Using a kernel criterion, however, we can separate distributions by higher-order moments. This observation also explains capabilities of the kernel k-means algorithm for example in separating distributions by moments other than the mean.

    S. Jegelka, A. Gretton, B. Schoelkopf, B.K Sriperumbudur and U. von Luxburg. Generalized clustering via kernel embeddings. KI 2009.

    Tensor Clustering

    The tensor clustering problem generalizes co-custering (also called biclustering) from matrices to tensors. Informally, we aim to partition a given tensor into honogeneous "cubes". Formally, we want to find the closest best low-rank factorization of a particular form. We show the first approximation bounds for tensor clustering with metrics and Bregman divergences. This work also illustrates the limits of ignoring the "co" in co-clustering.

    S. Jegelka, S. Sra and A. Banerjee. Approximation algorithms for tensor clustering. ALT 2009.

    Statistically consistent clustering

    Most clustering problems correspond to NP-hard optimization problems. Furthermore, even if we could find the optimzal solution, this procedure may fail to be statisticaly consistent. Therefore, we relax computationally hard clustering problems (such as k-means or normalized cut) to formulations that cen be solved exactly in polynomial time and that are statistically consistent and converge to the solution of the given objective as the number of sample points grows.

    U. von Luxburg, S. Bubeck, S. Jegelka and M. Kaufmann. Consistent minimization of clustering objective functions. NIPS 2007

    back up



    Determinantal Point Processes, Strongly Rayleigh Measures, Diversity and Sampling

    From probabilistic modeling to randomized algorithms, probability distributions over discrete variables play a major role in ML. But computational efficiency remains challenging. As a partial exception, Determinantal Point Processes (DPPs) are elegant probabilistic models of diversity: these probability distributions over subsets prefer sets that are diverse, and, conveniently, many inference computations reduce to linear algebra. DPPs belong to a larger class of distributions that are characterized by specific negative correlations and real stable polynomials: Strongly Rayleigh (SR) Measures. These occur from combinatorics to random matrices, and recently gained attention for breakthroughs in graph algorithms and the Kadison-Singer problem.
    In machine learning, they are, for instance, key to modeling repulsion and diversity (from vision to recommender systems), and for compactly approximating data and/or models for faster learning, by capturing large parts of the information or variance. We show new results for fast sampling from DPPs, SR and related measures, and applications to matrix approximation and kernel methods. Our results include practical algorithms, theoretical bounds on mixing time, fast lazy evaluation schemes that exploit quadrature, and empirical results that reflect the theory. We also show the first results -- hardness and an algorithm -- for distribution testing of DPPs, i.e., how many samples do we need to decide whether the underlying distribution is a DPP (or log-submodular or SR measure).

    K. Gatmiry, M. Aliakbarpour, S. Jegelka. Testing Determinantal Point Processes. NeurIPS 2020. Spotlight
    Z. Mariet, J. Robinson, J. Smith, S. Sra, S. Jegelka. Optimal Batch Variance with Second-Order Marginals. ICML workshop on Real World Experiment Design and Active Learning, 2020.
    J. Robinson, S. Sra, S. Jegelka. Flexible Modeling of Diversity with Strongly Log-Concave Distributions. NeurIPS 2019.
    Z. Mariet, S. Sra, S. Jegelka. Exponentiated Strongly Rayleigh Distributions. NeurIPS 2018.
    C. Li, S. Jegelka, S. Sra. Column Subset Selection via Polynomial Time Dual Volume Sampling. NIPS 2017
    C. Li, S. Sra, S. Jegelka. Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling. NIPS, 2016
    C. Li, S. Sra, S. Jegelka. Gaussian quadrature for matrix inverse forms with applications. ICML, 2016
    C. Li, S. Jegelka, S. Sra. Fast DPP Sampling for Nyström with Application to Kernel Methods. ICML, 2016
    C. Li, S. Jegelka, S. Sra. Efficient Sampling for k-Determinantal Point Processes. AISTATS, 2016.

    Talk: UAI keynote by Stefanie

    back up



    Submodularity and submodular optimization in machine learning

    Submodular set functions are characterized by an intuitive diminishing marginal costs property. They have long been important in combinatorics and many other areas, and it turns out that they can help model interesting complex interactions in machine learning problems too. Luckily, their structure admits efficient (approximate) optimization algorithms in many settings.
    We are interested in how we can efficiently use submodularity in machine learning, what new interesting (combinatorial) models are possible, and how we can design better optimization algorithms.

    On faster (approximate/parallel) submodular minimization

    It is known that submodular set functions can be minimized in polynomial time, but many algorithms are not very practical for large real data. We address this problem with two very different approaches: (1) We write the minimization of a decomposable submodular function as a "Best Approximation" problem and apply operator splitting methods. The result is an easy, parallelizable and efficient algorithm for decomposable submodular functions that needs no parameter tuning. (2) Graph Cuts have been popular tools for representing set functions (and thereby offering an efficient tool for their optimization). Unfortunately, this is not possible for all submodular functions. However, every submodular function can be represented as a cooperative graph cut, and this insight leads to practical approximate algorithms.

    S. Jegelka, F. Bach and S. Sra. Reflection methods for user-friendly submodular optimization . NIPS 2013.
    R. Nishihara, S. Jegelka and M.I. Jordan. On the linear convergence rate of decomposable submodular function minimization. NIPS 2014.
    S. Jegelka, H. Lin and J. Bilmes. On fast approximate submodular minimization. NIPS 2011.

    Minimizing approximately submodular functions

    Not all set functions are submodular, but, for approximately submodular functions, submodular optimization techniques may still work in practice. While this has been studied for submodular maximization, we show the first results for minimization problems, which use very different algorithmic tools. We show an efficient algorithm with approximation guarantees, and corresponding lower bounds.

    M. El Halabi, S. Jegelka. Optimal approximation for unconstrained non-submodular minimization. ICML 2020.

    Cooperative Graph cuts in Computer Vision

    Graph cuts have been widely used as a generic tool in combinatorial optimization. Replacing the common sum of edge weights by a submodular function enhances the representative capabilities of graph cuts. For example, graph cuts haven been popular for MAP inference in pairwise Markov Random Fields, used for image segmentation. These have some well-known shortcomings: the optimal segmentations tend to short-cut fine object boundaries, in particular when the contrast is low. Cooperative cuts correspond enable us to introduce complex long-range dependencies between variables (high-order potentials), such as incorporating global information about the object boundary, and thereby lead to much better segmentations.
    Cooperative cuts indeed unify and generalize a number of higher-order energy functions that have been used in Computer Vision.

    P. Kohli, A. Osokin and S. Jegelka. A principled deep random field for image segmentation. CVPR 2013
    S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. CVPR 2011. (see also the description and demo by Evan Shelhamer here)
    S. Jegelka and J. Bilmes. Multi-label Cooperative Cuts. CVPR 2011 Workshop on Inference in Graphical Models with Structured Potentials.

    Probabilistic inference, submodularity and submodular edge weights

    Submodular functions over pairs of variables (edges) do not only provide structure for optimization, but for full approximate probabilistic inference. We provide a framework for efficient inference in such models that exploits both the structure of an underlying graph and the polyhedral structure of submodularity to compute lower and upper bounds on the partition function of high-treewidth probabilistic models, and approximate marginal probabilities. In other work, we show how to accelerate sampling in discrete probabilistic models via tools from submodular optimization.

    A. Gkotovos, H. Hassani, A. Krause, S. Jegelka. Discrete Sampling using Semigradient-based Product Mixtures. UAI 2018. Oral presentation
    J. Djolonga, S. Jegelka, A. Krause. Provable Variational Inference for Constrained Log-Submodular Models. NeurIPS, 2018.
    J. Djolonga, S. Jegelka, S. Tschiatschek, A. Krause. Cooperative Graphical Models. NIPS, 2016.

    Submodular edge weights in combinatorial problems - theory and algorithms

    Starting with (a generalized) minimum cut, we study combinatorial problems where instead of a sum of edge weights, we have a submodular function on the edges. In their most general form, such problems are usually very hard, with polynomial lower bounds on the approximation factor (as several recent works show). But with some assumptions, efficient algorithms can give very decent results.
    Motivated by good empirical results, we continued to study properties of functions that affect the complexity of submodular problems: the curvature of the function is a general complexity measure for minimization, maximization, approximation and learning.

    D. Alvarez-Melis, T.S. Jaakkola and S. Jegelka. Structured Optimal Transport. AISTATS, 2018.
    S. Jegelka and J. Bilmes. Graph Cuts with Interacting Edge Costs - Examples, Approximations, and Algorithms. Mathematical Programming Ser. A, pp. 1-42, 2016. (arXiv version)
    R. Iyer, S. Jegelka and J. Bilmes. Monotone Closure of Relaxed Constraints in Submodular Optimization: Connections Between Minimization and Maximization. UAI 2014.
    R. Iyer, S. Jegelka and J. Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions. NIPS 2013.
    R. Iyer, S. Jegelka and J. Bilmes. Fast Semidifferential-based Submodular Function Optimization. ICML 2013
    S. Jegelka and J. Bilmes. Approximation bounds for inference using cooperative cut. ICML 2011.

    Sequential combinatorial problems with submodular costs

    Sequential decision problems ask to repeatedly solve an optimization problem with an unknown, changing cost function. A decision for the current cost must be made based on observing the costs at previous steps, but not the current one. Such problems become challenging when the optimization problem is combinatorial and therefore the decision space exponentially large. We address sequential combinatorial problems and derive the first algorithms that handle nonlinear, submodular cost functions.

    S. Jegelka and J. Bilmes. Online submodular minimization for combinatorial structures. ICML 2011.

    Finding Diverse Subsets in Exponentially-Large Structured Item Sets

    To cope with the high level of ambiguity faced in domains such as Computer Vision or Natural Language processing, robust prediction methods often search for a diverse set of high-quality candidate solutions or proposals. In structured prediction problems, this becomes a daunting task, as the solution space (image labelings, sentence parses, etc.) is exponentially large - e.g., all possible labelings of an image, or all possible parse trees. We study greedy algorithms for finding a diverse subset of solutions in such combinatorial spaces by drawing new connections between submodular functions over combinatorial item sets and High-Order Potentials (HOPs) studied for graphical models. Specifically, we show via examples that when marginal gains of submodular diversity functions allow structured representations, this enables efficient (sub-linear time) approximate maximization by reducing the greedy augmentation step to inference in a factor graph with appropriately constructed HOPs.

    A. Prasad, S. Jegelka and D. Batra. Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets. NIPS 2014.

    back up



    Geometry and comparing probability distributions

    Metric between probability distributions are of central importance in machine learning, occurring from estimation to distributed learning and clustering. In our work, we have studied two promising routes: (1) Hilbert space embeddings of probability distributions, which offers measures for testing independence as well as metrics between distributions; and (2) Optimal transport, which takes into account the geometry of the underlying space. In both cases, we have developed new applications and faster algorithms. Hilbert space embeddings allow to find independent sources within a signal, and to cluster data where we consider each cluster a distribution, and take higher-order moments (e.g., variance and higher) of the distributions into account (as opposed to, e.g., k-means). Optimal-transport-based Wasserstein Barycenters are a base routine for Bayesian inference, they allow to merge posteriors estimated on different data subsets while preserving shapes of distributions. Our algorithm offers a much faster, streaming method for merging that can even adapt to slowly shifting distributions (e.g., in sensor fusion). We also show a principled initialization for Wasserstein k-means, with an application to climate data analysis.

    M. Staib, S. Jegelka. Distributionally Robust Optimization and Generalization in Kernel Methods. NeurIPS 2019.
    C. Bunne, D. Alvarez Melis, A. Krause, S. Jegelka. Learning Generative Models across Incomparable Spaces. ICML 2019.
    D. Alvarez-Melis, S. Jegelka and T. Jaakkola. Towards Optimal Transport with Global Invariances. AISTATS 2019.
    D. Alvarez-Melis, T.S. Jaakkola and S. Jegelka. Structured Optimal Transport. AISTATS, 2018.
    M. Staib and S. Jegelka. Distributionally Robust Deep Learning as a Generalization of Adversarial Training. NIPS Machine Learning and Computer Security Workshop, 2017.
    M. Staib and S. Jegelka. Wasserstein k-means++ for Cloud Regime Histogram Clustering. Climate Informatics, 2017.
    M. Staib, S. Claici, J. Solomon, S. Jegelka. Parallel Streaming Wasserstein Barycenters, 2017.
    H. Shen, S. Jegelka and A. Gretton. Fast Kernel-based Independent Component Analysis. IEEE Transactions on Signal Processing 57(9), pp. 3498-3511, 2009.
    S. Jegelka, A. Gretton, B. Schoelkopf, B.K. Sriperumbudur and U. von Luxburg. Generalized Clustering via Kernel Embeddings. KI 2009: Advances in Artificial Intelligence, 2009.

    back up



    Kernel independent component analysis

    In Independent Component Analysis (ICA), we observe a linear mixture of signals from independent source distributions and aum to recover the unknown sources. Kernel dependence measures have proved particularly useful for ICA. However, optimizing such a kernel criterion over the special orthogonal group is a difficult optimization problem, and can quickly become inefficient as the kernel matrices become large. We therefore derive an approximate Newton method that handles these problems more efficiently. Empirically, the method compares favorably to state-of-the-art ICA methods.
    In earlier work, we explored the effectiveness of different factorizations to approximate large kernel matrices.

    H. Shen, S. Jegelka and A. Gretton. Fast Kernel-based Independent Component Analysis. IEEE Transactions on Signal Processing 57(9), pp. 3498-3511, 2009.
    H. Shen, S. Jegelka and A. Gretton. Fast Kernel ICA using an Approximate Newton Method. AISTATS, 2007.
    S. Jegelka and A. Gretton. Brisk Kernel Independent Component Analysis. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston, editors. Large Scale Kernel Machines, pp. 225-250. MIT Press, 2007.

    back up



    Concurrency control for machine learning


    Many machine learning algorithms iteratively transform some global state (e.g., model parameters or variable assignment) giving the illusion of serial dependencies between each operation. However, due to sparsity, exchangeability, and other symmetries, it is often the case that many, but not all, of the state-transforming operations can be computed concurrently while still preserving serializability: the equivalence to some serial execution where individual operations have been reordered.
    This opportunity for serializable concurrency forms the foundation of distributed database systems. In this project, we implement updates in machine learning algorithms as concurrent transactions in a distributed database. As a result, we achieve high scalability while maintaining the semantics and theoretical properties of original serial algorithm.

    X. Pan, S. Jegelka, J. Gonzalez, J. Bradley and M.I. Jordan. Parallel Double Greedy Submodular Maximization. NIPS 2014.
    X. Pan, J. Gonzalez, S. Jegelka, T. Broderick and M.I. Jordan. Optimistic Concurrency Control for Distributed Unsupervised Learning. NIPS 2013

    back up



    2. Learning with limited supervision


    Self-supervised learning: theory and algorithms

    A major challenge for deep learning is scarcity of high-quality labels. Common approaches then pre-train a network on a large pretext task, and transfer the learned representations to the actual target task of interest. Theoretically, we quantify how pre-training can accelerate the learning rate for the target task up to a fast rate of O(1/n) in the number n of data points. Recently, self-supervised learning, where the pretext task is generated automatically, has enabled central results in NLP and vision. A widely successful self-supervision method, contrastive learning, aims to map semantically similar data points close and push dissimilar data apart. We address important shortcomings in the treatment of dissimilar pairs. These are usually drawn uniformly by ignoring semantics, which impairs performance. First, by drawing connections to positive-unlabed learning, we identify an easy to implement, unsupervised correction for this common practice. Second, we devise techniques for focusing on informative pairs and third, for automatically discouraging shortcuts, i.e., the risky tendency of neural networks to rely on simple, sometimes spurious correlations.

    J. Robinson, L. Sun, K. Yu, K. Batmanghelich, S. Jegelka, S. Sra. Can contrastive learning avoid shortcut solutions? ICML workshop on Self-Supervised Learning for Reasoning and Perception, 2021.
    J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka. Contrastive Learning with Hard Negative Samples. ICLR 2021.
    C.-Y. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, S. Jegelka. Debiased Contrastive Learning. NeurIPS 2020. Spotlight
    J. Robinson, S. Jegelka, S. Sra. Strength from Weakness: Fast Learning Using Weak Supervision. ICML 2020.

    Deep metric learning

    H. Song, S. Jegelka, V. Rathod and K. Murphy. Deep Metric Learning via Facility Location. International Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Spotlight
    H. Song, Y. Xiang, S. Jegelka and S. Savarese. Deep Metric Learning via Lifted Structured Feature Embedding. International Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Spotlight

    back up



    Weakly supervised object detection


    Learning to localize objects in images is a fundamental problem in computer vision. For this problem (as for many others), we are increasingly faced with the problem that accurately labeled training data is expensive and hence scarce. Therefore, we desire algorithms that are robust to weak labelings, i.e., image-level labels of the nature "the object is present" (instead of object locations). We address this problem via a combination of combinatorial and convex optimization: a discriminative submodular cover problem and a smoothed SVM formulation.

    H. Song, Y.J. Lee, S. Jegelka and T. Darrell. Weakly-supervised Discovery of Visual Pattern Configurations. NIPS 2014.
    H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui and T. Darrell. On learning to localize objects with minimal supervision. ICML 2014

    back up



    Efficiently modeling uncertainty and Bayesian Black-Box Optimization

    When important decisions are based on a machine learning method, it can be extremely beneficial to obtain uncertainty estimates along with the prediction. Uncertainties can also help judge where more observations are needed, and are exploited in Bayesian black-box optimization, the optimization of an unknown function via queries. Bayesian Optimization has applications from robotics to parameter tuning to experiment design. Yet, common methods based on Gaussian Processes are computationally very expensive in high dimensions and when a lot of data is needed. We substantially improve the computational and sample complexity especially in high dimensions, without losing performance, by using a different estimate of information, and by learning adaptive distributions over partitions along multiple axes. Theoretically, we provide the first regret bounds for an instance of the popular “entropy search” criteria.

    J. Kirschner, I. Bogunovic, S. Jegelka, A. Krause. Distributionally Robust Bayesian Optimization. AISTATS 2020.
    I. Bogunovic, J. Scarlett, S. Jegelka, V. Cevher. Adversarially Robust Optimization with Gaussian Processes. NeurIPS 2018. Spotlight
    Z. Wang, C. Gehring, P. Kohli, S. Jegelka. Batched Large-scale Bayesian Optimization in High-dimensional Spaces. AISTATS 2018.
    Z. Wang, S. Jegelka. Max-value Entropy Search for Efficient Bayesian Optimization. ICML, 2017.
    Z. Wang, C. Li, S. Jegelka, P. Kohli. Batched High-dimensional Bayesian Optimization via Structural Kernel Learning. ICML, 2017.
    Z. Wang, S. Jegelka, L. P. Kaelbling, T. Lozano-Perez. Focused Model-Learning and Planning for Non-Gaussian Continuous State-Action Systems. ICRA, 2017.
    Z. Wang, B. Zhou, S. Jegelka. Optimization as Estimation with Gaussian Processes in Bandit Settings. AISTATS 2016. (code)

    back up



    3. Understanding robustness


    Non-convex robust optimization

    Robustness to perturbation and distribution shifts is a prevalent challenge in AI safety. The framework of Robust Optimization allows for uncertainty in the objective function or constraints – in ML, this often means performing well even under data perturbations. The resulting re-formulation into an adversarial min-max problem, however, can lead to much harder optimization problems. For example, robust versions of bidding, budget allocation and bipartite influence maximization are nonconvex saddle point problems. Yet, we obtained an optimization algorithm with theoretical guarantees. Perhaps surprisingly, it solves nonconvex optimization via ideas from discrete optimization! To do so, we leveraged a then practically unused tool in machine learning: the theory of continuous submodularity.
    In other work, we initiate the study of adversarial Bayesian optimization, i.e., a robust version of the widely used black-box optimization framework. Our algorithm combines upper and lower confidence bounds with robust optimization, and enables robustness under data perturbations and various other robustness settings. In certain cases, its regret bounds are nearly optimal. Our other works address distributionally robust submodular maximization, distributionally robust Bayesian Optimization, and stochastic optimization of the CVaR with connections to negative dependence.
    Robust learning is closely tied to generalization, but its concrete implications depend on how the uncertainty in the problem is defined. In this context, we reveal connections between using MMD, a popular distance between probability distributions, and classic results in statistical learning, including a new generalization proof for kernel ridge regression.

    S. Curi, K.Y. Levy, S. Jegelka, A. Krause. Adaptive Sampling for Stochastic Risk-Averse Learning. NeurIPS 2020.
    J. Kirschner, I. Bogunovic, S. Jegelka, A. Krause. Distributionally Robust Bayesian Optimization. AISTATS 2020.
    M. Staib, S. Jegelka. Distributionally Robust Optimization and Generalization in Kernel Methods. NeurIPS 2019.
    M. Staib, B. Wilder and S. Jegelka. Distributionally Robust Submodular Maximization. AISTATS 2019.
    I. Bogunovic, J. Scarlett, S. Jegelka, V. Cevher. Adversarially Robust Optimization with Gaussian Processes. NeurIPS 2018. Spotlight
    M. Staib and S. Jegelka. Distributionally Robust Deep Learning as a Generalization of Adversarial Training. NIPS Machine Learning and Computer Security Workshop, 2017.
    M. Staib, S. Jegelka. Robust Budget Allocation via Continuous Submodular Functions. ICML, 2017.

    back up



    Learning and distribution shifts

    While standard generalization in ML assumes that the training data distribution is the same as the data distribution where the model is deployed, this is often not the case in practice. Hence, the behavior of ML methods under distribution shifts, and extrapolation beyond the training data, are important questions to understand. For graph neural networks, we analyze their behavior outside the training distribution, and show how the computational structure of the task and the nonlinearities in the architecture determine extrapolation behavior to larger graphs, or graphs with different degree distributions. Indeed, one may expect a GNN that has “learned” an algorithm on small graphs to perform well on larger graphs, too. But this is not always true. Our results apply especially to GNNs for learning combinatorial optimization algorithms, and to neural symbolic reasoning, inferring physical relations in cosmology, and learning simulations.
    In other work, we devise a generic method for estimating the performance of a given, trained classifier on a new unlabeled dataset. Our work outperformed state of the art, and relies on proxy models that use domain adaptation, and on understanding the generalization behavior of these models.
    Related is also our work on self-supervised pretraining.

    K. Xu, M. Zhang, J. Li, S. S. Du, K. Kawarabayashi, S. Jegelka. How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks. ICLR 2021. Oral Presentation.
    C.-Y. Chuang, A. Torralba, S. Jegelka. Estimating Generalization under Distribution Shifts via Domain-Invariant Representations. ICML, 2020.

    back up