CLASP
The Centre for Linguistic Theory and Studies in Probability

Why the pond is not outside the frog? Grounding in contextual representations by neural language models

In this thesis, to build a multi-modal system for language generation and understanding, we study grounded neural language models. Literature in psychology informs us that spatial cognition involves different aspects of knowledge that include visual perception and human interaction with the world. This makes spatial descriptions a compelling case for the study of how spatial language is grounded in different kinds of knowledge. In six studies, we investigate what and how neural language models (NLM) encode spatial knowledge.

In the first study, we ask if the language model has a systematic generalisation to learn the grounding on the unseen composition of representations. Then in the second study, we show the potentials in using uni-modal knowledge for detecting metaphors in adjective-nouns compositions. In the third study, we explore the traces of functional-geometric distinction of spatial relations in uni-modal NLM. This distinction is essential since the knowledge about object-specific relations are not grounded in the visible situation. Following that, in the fourth study, we inspect representations of spatial relations in a uni-modal NLM to understand how they capture the concept of space from the corpus. The predictability of grounding spatial relations from contextual embeddings is vital for the evaluation of grounding in multi-modal language models. In the fifth study, we try to evaluate the degree of grounding in language and vision with adaptive attentions. In the sixth study, we use adaptive attention to understand if and how additional bounding box geometric information could improve the generation of relational image descriptions.

The primary argument of the thesis is that spatial expressions in natural language are not always grounded in direct interpretations of the locations. In a joint model of vision and language, the neural language model provides spatial knowledge that is contextualising the knowledge from visual repre-sentations about locations. The knowledge in the language model comes from locative expressions in the dataset used for the training task and is also shaped by the aspects of the model design.