NLP beyond English: Do we need to think more about linguistics?

Event: Seminar
Lecturer: Marcel Bollmann from Linköping University
Date: 05 May 2023
Duration: 2 hours
Venue: Gothenburg and Online
Slides: Marcel Bollmann 5.5.2023.pdf

Abstract From analysis of 16th-century text collections to machine translation for Creole languages: there are a lot of challenging application scenarios for NLP outside the “mainstream” English-language tasks. Yet many new NLP technologies are developed first and foremost for English, with “multilinguality” being achieved as a by-product of throwing more data at a model. Will this be the way forward? Are there still benefits in thinking about how we represent language for deep learning models, such as subword tokenization or incorporating linguistic structure?

In this talk, I will probably have more questions than answers, but will provide some perspectives from my own work on these topics — from failed attempts at building machine translation models for indigenous American languages to investigations of morphology and subword tokenization — with the overarching themes of: How good are we already at NLP beyond English? Is there value in thinking more about linguistics when building NLP models?