AstroAI Workshop 2025

David Alvarez-Melis

Data First: How Composition, Labels, and Adaptation Shape Model Behavio

Presenter: David Alvarez-Miles (Harvard University)

Title: Data First: How Composition, Labels, and Adaptation Shape Model Behavio

Date/Time: Tuesday, July 8th, 9:30 AM - 11:00 AM

Abstract: As machine learning systems are increasingly deployed in high-stakes scientific and user-facing settings, a fundamental shift is underway: from model-centric to data-centric AI. This talk explores how the structure, composition, and semantics of training data—rather than model architecture alone—profoundly influence generalization, control, and performance. Drawing from three recent papers, I will first show how the presence of complex syntactic structures in language data can induce hierarchical generalization in neural models, but also lead to instability across random seeds when such patterns compete with simpler heuristics. I will then turn to the challenge of controllable generation in large language models, introducing a continuous interpolation framework where models are dynamically constructed at inference time from data-driven adaptations to multiple stylistic or domain preferences. Finally, I will examine the surprising power of soft labels in dataset distillation, showing that structured label information—not image content—is often the main driver of data efficiency. Together, these findings highlight emerging principles in data-centric ML and suggest new levers for steering model behavior via thoughtful data design, composition, and representation.

David Alvarez-Melis

Biography: David Alvarez-Miles is an Assistant Professor of Computer Science at Harvard SEAS where he leads the Data-Centric Machine Learning (DCML) group. He is also an Associate Faculty at the Kempner Institute, and has affiliations with the Center for Research on Computation and Society and the Harvard Data Science Initiative. He is also a researcher at Microsoft Research New England.

David’s research seeks to make machine learning more broadly applicable (especially to data-poor applications) and trustworthy (e.g., robust and interpretable). He is particularly interested in the implications of these two directions for applications in the natural and medical sciences. David’s approach to the first of these goals draws on ideas from statistics, optimization, and applied mathematics, especially optimal transport, which he has used to develop methods to mitigate data scarcity by various types of geometric dataset manipulations: alignment, comparison, generation, and transformation. As for trustworthy machine learning, David has worked on methods for explaining predictions of black box models, showed their lack of robustness, proposed methods to robustify them, and sought inspiration in the social sciences to make them human-centered. In the past, David has worked on various aspects of learning with highly-structured data such as text or graphs, ranging from learning representations of structured objects, to generating them, to interpreting models that operate on them.

-->