AstroAI Lunch Talks - September 9, 2024 - Shivam Raval
09 Sep 2024 - Joshua Wing
The video can be found here: https://www.youtube.com/watch?v=TuWGOJ9lndA
Speaker: Shivam Raval (Harvard)
Title: If [0.32, 0.42, -0.18, … 0.86] is Monday, [0.48, -0.27, 0.98, … -0.22] is Interpretability, which direction is Shivam’s AstroAI Lunch Talk?
Abstract: Frontier language models have unique abilities to combine and connect seemingly unrelated concepts to provide novel, surprising yet seemingly plausible responses. A natural question arises: do they really understand human-interpretable concepts and if so, can we extract them from the model internals? One of the main goals of machine learning interpretability is to identify and disentangle complex representations of inputs into human-interpretable concepts for transparency, control, and safety. The main focus of this talk will be on techniques used to understand what a brain scan of a model encodes and how to decompose it into its most atomic units. Recent findings [1,2] suggest that interpretable features might be represented surprisingly as linear directions in the high dimensional space of the model’s activations. I will briefly talk about empirical findings that support this hypothesis, and how they can be operationalized towards designing better, aligned AI systems [3]. This so-called linear representation hypothesis has led to the use of sparse coding to decode internal activations of Large language models, and the introduction of Sparse Autoencoders (SAEs) for interpretability and model steering [4]. Using toy examples and synthetic datasets, I will highlight some benefits and challenges of using SAEs for interpretability and the effect of architectural choices on the learned features, and what the future can look like for language model interpretability. Finally, I will describe Lumiscope, an in-development platform for interactive interpretability that would allow researchers to study the internals of frontier models without having to implement an interpretability technique. With a case study of Patchscopes [5], a recently introduced interpretability framework I will describe some early findings on training-free approaches to studying entity-attribute extraction and bias quantification using Lumiscope.
[1] Marks, Samuel and Max Tegmark. “The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.” ArXiv abs/2310.06824 (2023)
[2] A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, N Nanda. “Refusal in Language Models Is Mediated by a Single Direction” Mechanistic Interpretability Workshop at ICML (2024)
[3] Y Chen, A Wu, T DePodesta, C Yeh, K Li, NC Marin, O Patel, J Riecke, S Raval, O Seow, M Wattenberg and F Viégas. “Designing a Dashboard for Transparency and Control of Conversational AI” arXiv preprint arXiv:2406.07882 (2024)
[4] A Templeton and T Conerly and others. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” (2024) https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
[5] N Hussein, A Ghandeharioun, R Mullins, E Reif, J Wilson, Ni Thain and L Dixon. “Can large language models explain their internal mechanisms?” (2024) https://pair.withgoogle.com/explorables/patchscopes/
Watch the talk below!