Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 33
  • Download : 0
Question-Answering (QA) has seen significant advances recently, achieving near human-level performance over some benchmarks. However, these advances focus on high-resourced languages such as English, while the task remains unexplored for most other languages, mainly due to the lack of annotated datasets. This work presents a native QA dataset for an East African language, Tigrinya. The dataset contains 10.6K question-answer pairs spanning 572 paragraphs extracted from 290 news articles on various topics. The dataset construction method is discussed, which is applicable to constructing similar resources for related languages. We present comprehensive experiments and analyses of several resource-efficient approaches to QA, including monolingual, cross-lingual, and multilingual setups, along with comparisons against machine-translated silver data. Our strong baseline models reach 76% in the F1 score, while the estimated human performance is 92%, indicating that the benchmark presents a good challenge for future work. We make the dataset, models, and leaderboard publicly available.
Publisher
Association for Computational Linguistics (ACL)
Issue Date
2023-07-11
Language
English
Citation

61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, pp.11857 - 11870

URI
http://hdl.handle.net/10203/314658
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0