Statistics Seminar

Generative Data Mining with Longtail-Guided Diffusion

It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.

Bio

David Hayden leads Perception AI Research at Cruise, where he focuses on generative and world models, foundation model alignment and guidance, longtail robustness, uncertainty quantification, and synthetic data. He has consulted on machine learning and computer vision for diverse industries including pharmaceuticals, retail, and competitive sports. His work has shipped to hundreds of driverless cars, ran live in stadiums of 40,000 people, supported seed and Series A rounds, and is published in top conferences and journals including ICML, CVPR, Neurips, and Nature. He previously founded Essistive Technologies, where he developed and licensed discreet note-taking tech for individuals with limited vision. David received a PhD at MIT working on interpretable machine learning and computer vision, with emphasis on behavior analysis, multi-object tracking, Bayesian nonparametrics for time-series, distributions on manifolds, and uncertainty to guide decision making.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

A Computational Theory for Black-Box Variational Inference

Variational inference with stochastic gradients, commonly called black-box variational inference (BBVI) or stochastic gradient variational inference, is the workhorse of probabilistic inference in the large data, large model regime. For a decade, however, the computational properties of VI have largely been unknown. For instance, under what conditions is BBVI guaranteed to converge, and is it provably efficient? In this talk, I will present recent theoretical results on VI in the form of quantitative non-asymptotic convergence guarantees for obtaining a variational posterior. Following this, I will demonstrate the usefulness of the theoretical framework by investigating the theoretical properties of various design choices and algorithmic modifications, such as parametrizations of variational approximation, variance-reduced gradient estimators such as sticking-the-landing, structured variational families, and beyond.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

First passage time distributions for jump-diffusion processes and flexible boundaries

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca

Abstract: The first passage time (FPT) is a useful tool in stochastic modeling of many biological, physical, social and economic processes evolving with time. It refers to the time when a random process first passes a threshold, e.g., when the population of an endangered species reaches a certain critical level, or when the number of infected individuals with a disease reaches a limit. Other examples include the survival time of a cancer patient, failure time of a mechanical system, and default time of a business, etc.

We study the boundary crossing problem for jump-diffusion processes over a discontinuous boundary and provide a complete characterization on the FPT distributions. We derive new formulas for piecewise linear boundary crossing probabilities and density of Brownian motion with general random jumps. These formulas can be used to approximate the boundary crossing distributions for general nonlinear boundaries. The method can be extended to more general diffusion processes such as geometric Brownian motion and Ornstein-Uhlenbeck processes with jumps. The numerical computation can be done by Monte Carlo integration which is straightforward and easy to implement. Some numerical examples are presented for illustration.

Ensembles in the Age of Overparameterization: Promises and Pathologies

To join this seminar virtually: please click here.

Abstract: Ensemble methods have historically used either high-bias base learners (e.g. through boosting) or high-variance base learners (e.g. through bagging). Modern neural networks cannot be understood through this classic bias-variance tradeoff, yet "deep ensembles" are pervasive in safety-critical and high-uncertainty application domains. This talk will cover surprising and counterintuitive phenomena that emerge when ensembling overparameterized base models like neural networks. While deep ensembles improve generalization in a simple and cost-effective manner, their accuracy and robustness are often outperformed by single (but larger) models. Furthermore, discouraging diversity amongst component models often improves the ensemble's predictive performance, counter to classic intuitions underpinning bagging and feature subsetting techniques. I will connect these empirical findings with new theoretical characterizations of overparameterized ensembles, and I will conclude with implications for uncertainty quantification, robustness, and decision making.

Causal Inference with Cocycles

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Abstract: Many interventions in causal inference can be represented as transformations of the variables of interest. Abstracting interventions in this way allows us to identify a local symmetry property exhibited by many causal models under interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional distributions and, under suitable assumptions, counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show that they achieve semiparametric efficiency under standard conditions. Since entire families of distributions can share the same cocycle, these estimators can make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset.

Joint work with Hugh Dance (UCL/Gatsby Unit): https://arxiv.org/abs/2405.13844

Online Kernel-Based Mode Learning

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Abstract: The presence of big data, characterized by exceptionally large sample size, often brings the challenge of outliers and data distributions that exhibit heavy tails. An online learning estimation that incorporates anti-outlier capabilities while not relying on historical data is therefore urgently required to achieve robust and efficient estimators. In this talk, we introduce an innovative online learning approach based on a mode kernel-based objective function, specifically designed to address outliers and heavy-tailed distributions in the context of big data. The developed approach leverages mode regression within an online learning framework that operates on data subsets, which enables the continuous updating of historical data using pertinent information extracted from a new data subset. We demonstrate that the resulting estimator is asymptotically equivalent to the mode estimator calculated using the entire dataset. Monte Carlo simulations and an empirical study are presented to illustrate the finite sample performance of the proposed estimator.

van Eeden seminar: The four pillars of machine learning

Registration

To join this seminar, please register via Zoom. Once your registration is approved, you'll receive an email with details on how to join the meeting.

If you have any questions about your registration or the seminar, please contact headsec@stat.ubc.ca.

Title

The four pillars of machine learning

Abstract

I will present a unified perspective on the field of machine learning research, following the structure of my recent book, "Probabilistic Machine Learning: Advanced Topics" (https://probml.github.io/book2). In particular, I will discuss various models and algorithms for tackling the following four key tasks, which I call the "pillars of ML": prediction, control, discovery and generation. For each of these tasks, I will also briefly summarize a few of my own contributions, including methods for robust prediction under distribution shift, statistically efficient online decision making, discovering hidden regimes in high-dimensional time series data, and for generating high-resolution images.

van Eeden speakers

Dr. Kevin Patrick Murphy has been invited by our department's graduate students to be this year's van Eeden speaker. A van Eeden speaker is a prominent statistician who is chosen by our graduate students each year to give a lecture, supported by the Constance van Eeden Fund.

This seminar is sponsored by Canadian Statistical Sciences Institute (CANSSI).

Meta-Analytic Inference for the COVID-19 Infection Fatality Rate

To join via Zoom: To join this seminar, please request Zoom connection details from pims@uvic.ca

Title: Meta-Analytic Inference for the COVID-19 Infection Fatality Rate

Abstract: Estimating the COVID-19 infection fatality rate (IFR) has proven to be challenging, since data on deaths and data on the number of infections are subject to various biases. I will describe some joint work with Harlan Campbell and others on both methodological and applied aspects of meeting this challenge, in a meta-analytic framework of combining data from different populations. I will start with the easier case when the infection data are obtained via random sampling. Then I will discuss drawing in additional infection data obtained in decidedly non-random manner.

Adjusting for Bias Induced by Informative Dose Selection Procedures

Many fields such acute toxicity studies, Phase I cancer trials, sensory studies and psychometric testing use informative dose allocation procedures. In this talk, we explain how such adaptive designs induce bias, and in the context of dose-finding designs we show how to modify frequency data to adjust for this bias.

To provide context, we start the talk with a general discussion of issues in inference following adaptive designs. Then, we assume a binary response Y has a monotone positive response prob- ability to a stimulus or treatment X, and we consider designs that sequentially select X values for new subjects in a way that concentrates treatments in a certain region of interest under the dose-response curve. We discuss how data analysis at the end of a study is affected by choosing the stimulus value for each subject sequentially according to some informative sampling rule.

Without loss of generality, we call a positive response a toxicity and the stimulus a dose. For simplicity, we restrict this talk to the case of a univariate treatment X and binary Y, and further assume that treatments are limited to a finite set {d1, d2, . . . , dM } of M values we call doses. Now suppose n subjects receive treatments that were sequentially selected (according so some rule using data from prior subjects) from the restricted set of M doses. Let Nm and Tm denote the number of subjects receiving treatment dm and the number of toxicities observed on treatment dm, respectively. Define Fm = P{Y = 1|X = dm} = E[Y |X = dm].

Then it is often said that the distribution of Tm given Nm is Binomial with parameters (Fm, Nm). But taking Nm as fixed is not the same as conditioning on this random variable, and conditioning on informative dose assignments is not the same as conditioning on summary dose frequencies. Indeed, it is easy to show that the observed dose-specific toxicity rate, Tm/Nm, is biased for Fm. From first principals, we obtain

E[Tm / Nm] = Fm - Cov[Tm/Nm, Nm] / E[Nm]

The observed toxicity rate is biased for Fm because adaptive allocations, by design, induce a correlation between toxicity rates and allocation frequencies.

This bias impacts inference procedures: Isotonic regression methods use dose-specific toxicity rates directly. Standard likelihood-based methods mask the bias by providing first-order linear approximations. We illustrate these biases using isotonic and likelihood-based regression methods in some well known (small sample size) adaptive methods including selected up-and-down designs, interval designs, and the continual reassessment method. Then we propose a bias adjustment inspired by Firth (1993).

[Nancy Flournoy; University of Missouri – http://web.missouri.edu/flournoyn/]

[flournoyn@missouri.edu – https://en.wikipedia.org/wiki/Nancy_Flournoy]

Methods for Preferential Sampling in Geostatistics

Preferential sampling in geostatistics refers to the instance in which the process that determines the sampling locations may depend on the spatial process that is being modelled. If ignored, this dependency can result in biased parameter estimates and may affect the resulting spatial prediction. Recent research on correcting for preferential sampling bias has been limited to stationary sampling locations, such as air-quality monitoring sites. We propose a flexible framework for inference on preferentially sampled fields, which can be used to expand preferential sampling methodology to the case in which the preferentially sampled locations are obtained from a process moving in space and time. An example of such data, the preferential sampling of ocean temperature by tagged marine mammals, is presented.