A space is worth a thousand words: A new spectral analysis method to evaluate vector space similarity

Event: Seminar
Lecturer: Haim Dubossarsky from the University of Cambridge
Date: 20 October 2021
Duration: 2 hours
Venue: Gothenburg

Abstract: “Vector-based models represent the meaning of words as numeric vectors, based on the words’ co-occurrence usage statistics as reflected in natural texts. These representations are ubiquitous in everyday language technology applications, and are also the object of scientific inquiry in computational linguistic, social sciences, and other data-driven research domains. Despite significant differences in the architecture of different models (e.g., whether they are static or contextualized word embeddings), all models can be thought of as implementing the distributional hypothesis. Perhaps due to the original theoretical framing of this hypothesis (“You shall know a word by the company it keeps”), word vectors are typically analyzed as separate units, and their potential interactions are thus overlooked. This unnecessarily limits the potential that lies in these representations for both scientific research and language technology applications.

I will present a novel framework that analyzes the entire vector space of a language, rather than focusing on individual vectors. Indeed, when the entire semantic space spanned by these vector representations is analyzed using spectral analysis, new information and language related features emerge. I will present results from cross-lingual transfer learning tasks, which are particularly suitable for the testing of the current framework, since performance in these tasks is impacted by the similarity between the languages at hand (i.e. the assumption of isomorphism between vector spaces). I will present a large-scale study focused on the correlations between similarity scores that were developed and computed for vector spaces and task performance, covering thousands of language pairs and four different tasks: Automatic bilingual lexicon induction (BLI), syntactic parsing, Part-Of-Speech tagging and Machine Translation. I will further introduce several similarity-isomorphism measures between two vector spaces, based on the relevant statistics of their individual spectra. I will empirically show that: (a) similarity scores derived from such spectral isomorphism measures are strongly associated with performance observed in different cross-lingual tasks; (b) these spectral-based measures consistently outperform previous standard isomorphism measures which are computed at the word level, while being computationally more tractable and easier to interpret; (c) these novel similarity-isomorphism measures capture complementary information to linguistic distance measures, and the combination of measures from the two types of measures yields even better results. Overall, these findings make an inroad to a new type of analysis, and demonstrate that richer and unique information lies beyond simple word level analysis.”