Statistics Seminar

Making a Training Dataset from Multiple Data Distributions

Over time we might accumulate lots of data from several different populations: e.g., the spread of a virus across different countries. Yet what we wish to model is not any one of these populations. One might want a model for the spread of the virus that is robust to the different countries, or is predictive on a new location we have only limited data for. We overview and formalize the objectives these present for mixing different distributions to make a training dataset, which have historically been hard to optimize. We show that by assuming we train models near "optimal" for our training distribution these objectives simplify to convex objectives, and provide methods to optimize these reduced objectives. Experimental results show improvements across language modeling, bio-assays, and census data tasks.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Bayesian Modeling for Functional Neuroimaging Data

Functional neuroimaging data, such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), often exhibit rich temporal, spatial, and spectral structure, posing unique challenges for statistical modeling. This talk presents Bayesian modeling approaches for functional neuroimaging data, focusing on time-frequency representations of EEG signals from multi-condition experiments. In such experiments, brain activity is recorded as subjects engage in various tasks or are exposed to different stimuli. The resulting data often exhibit smooth variation across time and frequency and can be naturally represented as two-way functional data, with conditions nested within subjects. To jointly account for the data’s multilevel structure, functional nature, and subject-level covariates, we propose a Bayesian mixed-effects model incorporating covariate-dependent fixed effects and multilevel random effects. For interpretability and parsimony, we introduce a novel decomposition of the fixed effects with marginally interpretable time and frequency patterns, along with a sparsity-inducing prior for rank selection. The proposed method is evaluated through extensive simulations and applied to EEG data collected to investigate the effects of alcoholism on cognitive processing in response to visual stimuli. Extensions to modeling dynamic functional connectivity and other Bayesian methods developed for fMRI data will also be discussed.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

A Practical Introduction to LLMs, Chatbots, and Dashboards

LLMs have a lot of hype around them these days. Let’s demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviours in our data science artifacts. This talk will introduce you to LLms, the Ellmer, and Chatlas packages for R and Python, and how they can be integrated into a Shiny to create an AI-powered dashboard. We’ll see how we can leverage the tasks LLMs are good at to better our data science products.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Copula-based Non-Gaussian Time Series Models

There are many non-Gaussian time series models available in the literature. Copula-based time series models are particularly relevant as they can handle serial tail dependence or the clustering of extreme observations. To date, mainly copula-based Markov time series models that extend the autoregressive time series model have been studied and applied. In this talk, I will consider non-Markovian copula-based time series models that can be viewed as an extension of Gaussian autoregressive moving average (ARMA) models. I derive distributional properties and discuss conditions for stationarity, as well as the asymptotic properties of the maximum-likelihood estimators. Finally, the probabilistic forecasting performance is evaluated.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Generative Data Mining with Longtail-Guided Diffusion

It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.

Bio

David Hayden leads Perception AI Research at Cruise, where he focuses on generative and world models, foundation model alignment and guidance, longtail robustness, uncertainty quantification, and synthetic data. He has consulted on machine learning and computer vision for diverse industries including pharmaceuticals, retail, and competitive sports. His work has shipped to hundreds of driverless cars, ran live in stadiums of 40,000 people, supported seed and Series A rounds, and is published in top conferences and journals including ICML, CVPR, Neurips, and Nature. He previously founded Essistive Technologies, where he developed and licensed discreet note-taking tech for individuals with limited vision. David received a PhD at MIT working on interpretable machine learning and computer vision, with emphasis on behavior analysis, multi-object tracking, Bayesian nonparametrics for time-series, distributions on manifolds, and uncertainty to guide decision making.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

A Computational Theory for Black-Box Variational Inference

Variational inference with stochastic gradients, commonly called black-box variational inference (BBVI) or stochastic gradient variational inference, is the workhorse of probabilistic inference in the large data, large model regime. For a decade, however, the computational properties of VI have largely been unknown. For instance, under what conditions is BBVI guaranteed to converge, and is it provably efficient? In this talk, I will present recent theoretical results on VI in the form of quantitative non-asymptotic convergence guarantees for obtaining a variational posterior. Following this, I will demonstrate the usefulness of the theoretical framework by investigating the theoretical properties of various design choices and algorithmic modifications, such as parametrizations of variational approximation, variance-reduced gradient estimators such as sticking-the-landing, structured variational families, and beyond.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

First passage time distributions for jump-diffusion processes and flexible boundaries

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca

Abstract: The first passage time (FPT) is a useful tool in stochastic modeling of many biological, physical, social and economic processes evolving with time. It refers to the time when a random process first passes a threshold, e.g., when the population of an endangered species reaches a certain critical level, or when the number of infected individuals with a disease reaches a limit. Other examples include the survival time of a cancer patient, failure time of a mechanical system, and default time of a business, etc.

We study the boundary crossing problem for jump-diffusion processes over a discontinuous boundary and provide a complete characterization on the FPT distributions. We derive new formulas for piecewise linear boundary crossing probabilities and density of Brownian motion with general random jumps. These formulas can be used to approximate the boundary crossing distributions for general nonlinear boundaries. The method can be extended to more general diffusion processes such as geometric Brownian motion and Ornstein-Uhlenbeck processes with jumps. The numerical computation can be done by Monte Carlo integration which is straightforward and easy to implement. Some numerical examples are presented for illustration.

Ensembles in the Age of Overparameterization: Promises and Pathologies

To join this seminar virtually: please click here.

Abstract: Ensemble methods have historically used either high-bias base learners (e.g. through boosting) or high-variance base learners (e.g. through bagging). Modern neural networks cannot be understood through this classic bias-variance tradeoff, yet "deep ensembles" are pervasive in safety-critical and high-uncertainty application domains. This talk will cover surprising and counterintuitive phenomena that emerge when ensembling overparameterized base models like neural networks. While deep ensembles improve generalization in a simple and cost-effective manner, their accuracy and robustness are often outperformed by single (but larger) models. Furthermore, discouraging diversity amongst component models often improves the ensemble's predictive performance, counter to classic intuitions underpinning bagging and feature subsetting techniques. I will connect these empirical findings with new theoretical characterizations of overparameterized ensembles, and I will conclude with implications for uncertainty quantification, robustness, and decision making.

Causal Inference with Cocycles

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Abstract: Many interventions in causal inference can be represented as transformations of the variables of interest. Abstracting interventions in this way allows us to identify a local symmetry property exhibited by many causal models under interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional distributions and, under suitable assumptions, counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show that they achieve semiparametric efficiency under standard conditions. Since entire families of distributions can share the same cocycle, these estimators can make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset.

Joint work with Hugh Dance (UCL/Gatsby Unit): https://arxiv.org/abs/2405.13844

Online Kernel-Based Mode Learning

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Abstract: The presence of big data, characterized by exceptionally large sample size, often brings the challenge of outliers and data distributions that exhibit heavy tails. An online learning estimation that incorporates anti-outlier capabilities while not relying on historical data is therefore urgently required to achieve robust and efficient estimators. In this talk, we introduce an innovative online learning approach based on a mode kernel-based objective function, specifically designed to address outliers and heavy-tailed distributions in the context of big data. The developed approach leverages mode regression within an online learning framework that operates on data subsets, which enables the continuous updating of historical data using pertinent information extracted from a new data subset. We demonstrate that the resulting estimator is asymptotically equivalent to the mode estimator calculated using the entire dataset. Monte Carlo simulations and an empirical study are presented to illustrate the finite sample performance of the proposed estimator.