CLASP
The Centre for Linguistic Theory and Studies in Probability

Not (yet) the whole story: The need for human-like evaluation of visual LLMs in multimodal communicative tasks

Abstract

The ability to describe and narrate what we see is a fundamental and pervasive element of human language communication, widely regarded as a hallmark of human intelligence. This core skill involves processing visual input—an image, sequence of images, or video—and generating natural language descriptions, commentaries, or stories. In multimodal NLP (language-and-vision) research, this line of work began with image captioning, asking systems to generate plausible descriptions of single images. Advances in generative models now enable more complex tasks like visual storytelling, where models generate entire narratives from temporally ordered sequences of images. Unlike more ‘factual’ tasks like visual question answering, where there is often a single valid answer, the outputs for these tasks vary greatly in detail, coherence, visual grounding, repetition, and creativity. In this talk, I will present recent work from my lab focused on assessing these outputs in a human-like, cognitively, and communicatively valid way. I argue that evaluation must consider all these aspects, going well beyond the assessment of the general plausibility of the description or story generated.

Bio: Sandro investigates human-like natural language understanding and generation in text-only large language models (LLMs) and their multimodal versions combining language-and-vision (VLMs). As such, his work combines methods and insights from Natural Language Processing, Computer Vision, and Cognitive Science. His current research interests span LLM and VLM evaluation and interpretability inspired by human cognition, how the learning of semantic and pragmatic abilities compares in humans and machines, and whether (and how) the cognitive mechanisms underlying human language communication can be used to develop better language models. He co-authored articles in top-tier conference proceedings (ACL, EMNLP, EACL, NAACL, CoLM) and journals (TACL, Cognition, Cognitive Science). He is a member of the ELLIS society, a faculty member of the ELLIS Amsterdam Unit, and a board member of SigSem, the ACL special interest group in computational semantics.