CLASP
The Centre for Linguistic Theory and Studies in Probability

Probing and Explaining Neural Language Models

Abstract

The impressive performance of neural language models such as GPT-3, PaLM and FLAN raises the question of to what extent these models have ‘learned’ language and how to reason with it. This talk will summarise recent work addressing this question along two lines of research: probing and explaining. Probing neural language models aims at finding evidence of learned linguistic structure by empirically testing hypotheses about the learned representations on diagnostic tasks. While this approach has generated interesting insights, we have shown that it comes with several methodological issues, including uncertainty about the suitability and validity of performance measures and the lack of suitable baselines [1–3]. Recent work has studied architectures augmented with the capability to generate free-text rationales that explain model output to investigate the reasoning capabilities of neural language models on tasks such as natural language inference and commonsense question answering. We have compared explanations by a generation-only model to those generated by a self-rationalizing model and found that, while the former score higher in terms of validity, factual correctness, and similarity to gold explanations, they are not more useful for downstream classification [4]. Our work raises important questions about the limitations of current methods for analysing neural language models and points to avenues for future work.

[1] Jenny Kunz and Marco Kuhlmann. Classifier Probes May Just Learn from Linear Context Features. COLING 2020 [2] Jenny Kunz and Marco Kuhlmann. Test Harder Than You Train: Probing with Extrapolation Splits. BlackboxNLP 2021 [3] Jenny Kunz and Marco Kuhlmann. Where Does Linguistic Information Emerge in Neural Language Models? Measuring Gains and Contributions Across Layers. COLING 2022 [4] Jenny Kunz, Martin Jirénius, Oskar Holmström, and Marco Kuhlmann. Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions. Accepted to BlackboxNLP 2022.