Reducing Annotation Artifacts in Crowdsourcing Datasets for Natural Language Processing

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 216
  • Download : 0
Many datasets for natural language processing are generated with crowdsourcing due to its low cost and scalability. However, in the datasets built with crowd workers’ generated language, a problem called annotation artifacts arises; a model trained on such datasets learn annotators’ writing strategies that are irrelevant to the task itself. Despite the increasing attention, little work dealt with the issue from the perspective of crowdsourcing workflow design. We suggest a simple but powerful adjustment to the dataset collection procedure: instruct workers not to use a word that is highly indicative of annotation artifacts. In the case study of natural language inference dataset construction, the results from two rounds of studies on Amazon’s MTurk suggest that applying a word-level constraint reduces the annotation artifacts from the generated dataset by 9.2% in terms of accuracy–gap score at the time cost of 19.7s increase per unit task.
Publisher
AAAI
Issue Date
2020-10-26
Language
English
Citation

The eighth AAAI Conference on Human Computation and Crowdsourcing

URI
http://hdl.handle.net/10203/277210
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0