CLASP
The Centre for Linguistic Theory and Studies in Probability

Scene Context, Object Reference, and Image Memorability: Insights into Visuo-Linguistic Processing in Humans and Models

Abstract

Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur where. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? In the first part of this talk, I will introduce the Common Objects Out-of-Context (COOCO) dataset that we created to test to what extent VLMs rely on scene context while referring to objects under different degrees of scene-object congruency and different perturbations. Our findings show that models leverage scene context adaptively, dynamically balancing local and contextual information for reference generation. In the second part, I will talk about how scenes vary in how memorable they are to humans. Inspired by findings from cognitive science and computer vision, we explored correlates of image memorability in pretrained vision encoders, investigating activations, attention distributions, and the uniformity of image patches. I will especially focus on sparse autoencoder loss of reconstructing a scene as a strong proxy for memorability. Collectively, our findings reveal interesting parallels between models’ adaptive use of scene context and internal encoding of signals for memorability, and human visuo-linguistic processing.