Contextual understanding of language in the world: language-and-vision neural models as learners of multi-modal knowledge
- Event: Seminar
- Lecturer: Nikolai Ilinykh from University of Gothenburg
- Date: 27 March 2024
- Duration: 2 hours
- Venue: Gothenburg and online
Abstract: This slutseminar’’ (final seminar) will discuss language-and-vision models and how they represent and structure multi-modal representations in different tasks. The primary question that is asked is whether computational models of language and vision can incorporate contextual information from different modalities to produce a contextual description of the image. The studies that are to be discussed cover a wide range of experiments and results. The thesis itself consists of three parts, focusing on three different types of contexts which are important for multi-modal language modelling.
First, we will discuss what do language-and-vision models learn from multi-modal representations and how different they are from uni-modal (text-only) representations. What are the mechanisms that such models employ to extract and structure knowledge about language and perception? How much of this knowledge can we extract and interpret?
Second, we will look at how can multi-modal contexts be represented for automatic image description generation. At what level of granularity should we represent vision and language? How informative are representations of different modalities for the image description model?
Finally, the third type of context is the context of the task. In addressing this question, we will examine the extent to which generated image descriptions are task-related and, in more general terms, what are the task-related aspects of (human) perception and language and whether these aspects can be interpreted from machine-produced language.