CLASP
The Centre for Linguistic Theory and Studies in Probability

Composing Byte-Pair Encodings for Morphological Sequence Classification

In this talk I’ll present research regarding composing sub-word representations, specifically representations obtained for byte-pair tokens by a large language model, into word representations. In our paper, we evaluate four different methods of obtaining word representations for morphological sequence classification, that is, the task of assigning grammatical features to words. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other three methods across a sample of eight languages with different typology and varying number of byte-pair tokens per word.