Digital Humanities

Massively Multilingual Natural Language Processing

Goran Glavaš
JMU Würzburg

7 Feb 2023, 14:45–15:30

Abstract

Multilingual language models (LMs) (e.g., multilingual BERT or XLM-R) have pushed the state of the art in multilingual NLP, yielding robust performance for various NLP tasks for languages with little or no task-specific training data. Multilingual LMs, however, suffer from a phenomenon known as curse of multilinguality: for a fixed model capacity, representations of individual languages deteriorate with inclusion of more languages into pretraining. The quality of text encodings thus varies drastically across languages, correlating highly with the size of the languages’ pretraining corpora. I will present work on remedying the curse of multilinguality and improving NLP models for low-resource languages, including the approaches that leverage massively multilingual lexical resources (e.g., BabelNet, PanLex).