Publikation

Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects

Niyati Bafna; Cristina España-Bonet; Josef van Genabith; Benoît Sagot; Rachel Bawden

In: 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN). Conférence sur le Traitement Automatique des Langues Naturelles (TALN-2023), June 5-9, Paris, France, Pages 28-42, ATALA, 6/2023.

Zusammenfassung

Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.

Weitere Links

https://coria-taln-2023.sciencesconf.org/data/proceedings_TALN_longs.pdf

TALNBafnaEtAl23.pdf (pdf, 878 KB )