CLASP
The Centre for Linguistic Theory and Studies in Probability

Comprehensively Evaluating Language in Language Models

Abstract

As Large Language Models (LLMs) are being increasingly used in high-stakes situations, it is vital that we accurately assess their strengths, but also their limitations. To this end, I ask: how can we ensure that we neither over- nor underestimate Language Models’ linguistic capabilities? For this, evaluations must consider the full breadth of human language. In my talk, I will demonstrate how progress can be made towards this goal in two aspects: multilingual evaluation, and evaluation for the long tail of language. For multilingual evaluation, I will show how agreement evaluation can be scaled to over 100 languages. For the long tail of language, I will report results from two investigations of language models’ understanding of the so-that construction, with which even state-of-the-art models struggle, even though rich distributional information is available in their training data. I will further demonstrate how LLMs themselves can be leveraged to annotate corpora for long-tail constructions. This will further stretch the boundaries of what we are able to test. All evaluations together paint a nuanced picture of the linguistic capabilities of large language models, showing achievements as well as remaining deficits.