AstroAI Workshop 2026
Steven Dillmann
Terminal-Bench-Science: Evaluating AI Agents on Computational Workflows in the Natural Sciences
Presenter: Steven Dillmann (Stanford University)
Title: Terminal-Bench-Science: Evaluating AI Agents on Computational Workflows in the Natural Sciences
Date/Time: Monday, June 15, 11:30 AM - 12:30 PM
Abstract: Terminal-Bench-Science (TB-Science) is a benchmark for evaluating AI agents on the complex real-world computational workflows that natural scientists run in their research labs. It builds on the success of Terminal-Bench, adopted by frontier labs such as OpenAI, Anthropic, and Google DeepMind, which helped drive rapid progress in AI coding agents by defining what leading labs measure and optimize for. No equivalent exists for science — until now.
Current “AI for Science” benchmarks test textbook knowledge or abstract capabilities like hypothesis generation. They do not measure whether an AI system can execute the end-to-end computational workflows that drive modern research in the natural sciences. TB-Science will close this gap by porting real workflows from leading research labs into executable benchmark tasks, evaluated in containerized environments with deterministic, programmatic verification.
Our goal is to catalyze a “Claude Code for Science” moment by giving natural scientists a direct voice in shaping AI progress: domain experts contribute real workflows, frontier labs optimize against them, and the resulting advances flow back as more capable AI tools for scientific discovery, creating a virtuous cycle between the scientists who know what matters and the labs building the next generation of AI.