CLASP
The Centre for Linguistic Theory and Studies in Probability

Towards Pragmatic Visual Description Generation

Abstract

Images have become an omnipresent communicative tool that we use in all aspects of life, such as in social settings (e.g., in social media and dating apps), for online shopping (e.g., clothes or vacations), and to educate (e.g., in news, textbooks, and scientific papers). However, the undeniable benefits they carry for sighted communicators turns into a serious accessibility challenge for people who are blind or have low vision (BLV). BLV users often have to rely on textual descriptions of those images to equally participate in an ever-increasing image-dominated (online) lifestyle. Despite the extraordinary performance of current models on many image-text tasks, they have faced significant challenges for the image accessibility purpose. This is a striking example where neglecting fundamental pragmatic factors results in seemingly powerful systems that are largely unhelpful in complex interactive settings. I will suggest that the next frontier is moving from what can be said to the pragmatic problem of what should be said. Specifically, I argue that the communicative goal of the image and text needs to become a fundamental component for datasets, models and evaluation protocols to build useful image description systems. I further present progress in all three domains, which provides a basis for image description models that can promote equal access.