CLASP
The Centre for Linguistic Theory and Studies in Probability

Causal abstraction for faithful, human-interpretable model explanations

Abstract: Explaining why a modern AI model makes the predictions it does has emerged as one of the most important questions in AI. In this era of ever-widening impact, the field has rightly turned its attention to questions of trust, safety, reliability, and bias mitigation for the models we deploy, and seriously addressing these questions will require us to understand whether and how these models represent and use human-interpretable concepts. In this talk, I’ll report on our recent efforts to achieve these explanations using a family of techniques called causal abstraction. In causal abstraction analysis, one assesses the extent to which an interpretable high-level model (say, a computer program) is a faithful proxy for a lower-level model (say, a neural network). Such analyses have already revealed a great deal about how models solve complex tasks. In particular, we are seeing that the best present-day large language models often induce interpretable, quasi-symbolic solutions that enable them to do well on hard, out-of-domain generalization tasks. This is encouraging, but it should be said that we are far from having the comprehensive understanding we need to offer even tightly circumscribed guarantees of safety and trust.