Making a Training Dataset from Multiple Data Distributions

Over time we might accumulate lots of data from several different populations: e.g., the spread of a virus across different countries. Yet what we wish to model is not any one of these populations. One might want a model for the spread of the virus that is robust to the different countries, or is predictive on a new location we have only limited data for. We overview and formalize the objectives these present for mixing different distributions to make a training dataset, which have historically been hard to optimize. We show that by assuming we train models near "optimal" for our training distribution these objectives simplify to convex objectives, and provide methods to optimize these reduced objectives. Experimental results show improvements across language modeling, bio-assays, and census data tasks.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Event type: Statistics Seminar
Speaker's page: Anvith Thudi
Location: ESB 4192 / Zoom
Event date: Tue, 07/08/2025 - 11:00 - Tue, 07/08/2025 - 12:00
Speaker: Anvith Thudi, Ph.D. student, Department of Computer Science, University of Toronto