Distilling monolingual models from large multilingual transformers

Publication type
A1
Publication status
Published
Authors
Singh, P., De Clercq, O., & Lefever, E.
Journal
ELECTRONICS
Volume
12
Issue
4
Issue title
AI for Text Understanding
Download
(.pdf)
View in Biblio
(externe link)

Abstract

Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-a-vis their model size, inference speeds, etc. We attempt to tackle this problem by proposing a novel methodology to apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model. We demonstrate the viability of this methodology on two downstream tasks each for six languages. We further dive into the possible modifications to the basic setup for low-resourced languages by exploring ideas to tune the final vocabulary of the distilled models. Lastly, we perform a detailed ablation study to understand the different components of the setup better and find out what works best for the two under-resourced languages, Swahili and Slovene.