Grounded language learning, from sounds and images to meaning

Event: Seminar
Lecturer: Afra Alishahi from Tilburg University
Date: 09 February 2022
Duration: 2 hours
Venue: Online

Abstract: “In this talk, I will present and discuss the results from our recently published journal article on how language can affect the structure of visual representations captured in multi-modal transformer (https://doi.org/10.3389/frai.2021.767971). This study examined learned self-attention patterns and focused on how two modalities affect each other’s representations. In particular, these patterns captured various object-level relations (e.g., part-of vs whole) in different layers. Plus, we demonstrate the grounding of objects in text in deeper layers. Also, we observe a strong priming signal from language modality that eventually shapes and determines learned attention. In addition, we show that these findings echo several studies from cognitive science on how the human brain processes visual information. Our experiments demonstrate that knowledge captured by a multi-modal transformer can be not only interpreted but also linked with how humans structure the visual world around them. Thus, the question is: do such structures occur randomly or due to an actual learning process, and why do we observe so many similarities with the hierarchical visual processing performed by humans?”