Many datasets for natural language processing are generated with crowdsourcing due to its low cost and scalability. However, in the datasets built with crowd workers’ generated language, a problem called annotation artifacts arises; a model trained on such datasets learn annotators’ writing strategies that are irrelevant to the task itself. Despite the increasing attention, little work dealt with the issue from the perspective of crowdsourcing workflow design. We suggest a simple but powerful adjustment to the dataset collection procedure: instruct workers not to use a word that is highly indicative of annotation artifacts. In the case study of natural language inference dataset construction, the results from two rounds of studies on Amazon’s MTurk suggest that applying a word-level constraint reduces the annotation artifacts from the generated dataset by 9.2% in terms of accuracy–gap score at the time cost of 19.7s increase per unit task.