Monolingual Pre-trained Language Models for Tigrinya

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 202
  • Download : 0
Pre-trained language models (PLMs) are driving much of the recent progress in natural language processing. However, due to the resource-intensive nature of the models, underrepresented languages without sizable curated data have not seen significant progress. Multilingual PLMs have been introduced with the potential to generalize across many languages, but their performance trails compared to their monolingual counterparts and depends on the characteristics of the target language. In the case of the Tigrinya language, recent studies report a sub-optimal performance when applying the current multilingual models. This may be due to its orthography and unique linguistic characteristics, especially when compared to the Indo-European and other typologically distant languages that were used to train the models. In this work, we pre-train three monolingual PLMs for Tigrinya on a newly compiled corpus, and we compare the models with their multilingual counterparts on two downstream tasks, part-of-speech tagging and sentiment analysis, achieving significantly better results and establishing the state-of-the-art. We make the data and trained models publicly available.
Publisher
Association for Computational Linguistics
Issue Date
2021-11-11
Language
English
Citation

The 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021

URI
http://hdl.handle.net/10203/289422
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0