AstroAI Workshop 2025
Kiera McCormick
Evaluating Large Language Models in Astronomy Research
Presenter: Kiera McCormick
Title: Evaluating Large Language Models in Astronomy Research
Date/Time: Monday, July 7th, 3:30 - 5:00 PM
Abstract: The advancement of generative technologies, specifically Large Language Models (LLMs), has led to significant breakthroughs in data analysis, generation, and predictions in many different fields. Scientific research is shifting as these LLMs become more relevant and reliable to handle nuanced and complex topics. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. To further understand and learn how these models work, we built and deployed a Slack chatbot that can answer queries from users via Retrieval-Augmented Generation (RAG) by referencing and citing astronomy papers from arXiv. This Slack chatbot was deployed for four weeks and data was anonymized and collected for evaluation. Users were also prompted to leave feedback and ratings to these queries, which was utilized to determine the flaws in this initial Slack chatbot. Following the analysis of this trial, improvements to the Slack chatbot were made to further display the promise of LLMs in astronomy research. This resulted in an updated prototype Slack chatbot with more promising and reliable generated responses to these astronomical queries. Methods of evaluating the quality of the generated responses were developed using a dataset of “Gold Answers” curated by expert astronomers. Robust evaluation studies like this particular case study will ensure that LLMs can be helpful for assisting scientific research, specifically for astronomy, but also many other academic fields.