GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 107
  • Download : 0
Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user’s experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge’ez script as a writing system; namely Amharic, Blin, Ge’ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.
Publisher
European Language Resources Association (ELRA)
Issue Date
2022-06-21
Language
English
Citation

The 13th Conference on Language Resources and Evaluation (LREC 2022), pp.6578 - 6584

URI
http://hdl.handle.net/10203/299503
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0