CLASP
The Centre for Linguistic Theory and Studies in Probability

On the Interplay between Language and Vision in Transformers: How Much of a "Multi-Modal Learning" Do We Observe?

Abstract: “In this talk, I will present and discuss the results from our recently published journal article on how language can affect the structure of visual representations captured in multi-modal transformer (https://doi.org/10.3389/frai.2021.767971). This study examined learned self-attention patterns and focused on how two modalities affect each other’s representations. In particular, these patterns captured various object-level relations (e.g., part-of vs whole) in different layers. Plus, we demonstrate the grounding of objects in text in deeper layers. Also, we observe a strong priming signal from language modality that eventually shapes and determines learned attention. In addition, we show that these findings echo several studies from cognitive science on how the human brain processes visual information. Our experiments demonstrate that knowledge captured by a multi-modal transformer can be not only interpreted but also linked with how humans structure the visual world around them. Thus, the question is: do such structures occur randomly or due to an actual learning process, and why do we observe so many similarities with the hierarchical visual processing performed by humans?”