Reducing annotation artifacts in crowdsourcing datasets for natural language processingAnnotation artifact를 감소시키는 자연어처리 데이터셋의 크라우드소싱 기법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 183
  • Download : 0
Many NLP datasets are generated with crowdsourcing because it is a relatively low-cost and scalable solution. One important issue in datasets built with crowdsourcing is annotation artifacts. That is, a model trained with such a dataset learns annotators' writing strategies that are irrelevant to the task itself. While this problem has already been identified and studied, there is limited research approaching it from the perspective of crowdsourcing workflow design. We suggest a simple but powerful adjustment to the dataset collection procedure: instruct workers not to use a word that is highly indicative of annotation artifacts. In the case study of natural language inference dataset construction, the results from two rounds of studies on Amazon Mechanical Turk reveal that applying a word-level constraint reduces the annotation artifacts from the generated dataset by 9.2% in terms of accuracy-gap score at the time cost of 19.7 second increase per unit task.
Advisors
Oh, Haeyunresearcher오혜연researcherKim, Juhoresearcher김주호researcher
Description
한국과학기술원 :전산학부,
Publisher
한국과학기술원
Issue Date
2021
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전산학부, 2021.2,[iv, 23 p. :]

Keywords

Datasets▼aannotation artifacts▼acrowdsourcing; 데이터셋▼a편향▼a크라우드소싱

URI
http://hdl.handle.net/10203/296135
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=948468&flag=dissertation
Appears in Collection
CS-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0