Prioritizing informative features and examples for deep learning from noisy data노이즈 데이터에서 딥러닝을 위한 정보력 높은 특성과 샘플 선별

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 13
  • Download : 0
Deep neural networks (DNNs) have achieved remarkable success in various fields such as computer vision and natural language processing based on vast amounts of high-quality data. However, real-world data collections are invariably noisy and DNNs are reported to unintentionally memorize most of such noise, resulting in severe performance degradation. Although noise-robust learning approaches for DNNs have been actively developed, most works focus on improving the model training stage. However, such noise data disrupt DNNs not only during model training but throughout the entire model development process including sample selection, cleaning, and labeling. For example, the unlabeled noisy data obtained from out-of-distribution waste the labeling cost since a human labeler can not assign any label on them, while the non-filtered labeled noisy data can significantly degrade the model performance. This calls attention to developing a systematic method to avoid such noise and utilize highly informative features and examples throughout the model development process. In this dissertation, we propose a systemic framework that prioritizes informative features and examples to enhance each stage of the development process. Specifically, we prioritize informative features and examples and improve the performance of feature learning, data labeling, and data selection. We first propose an approach to extract only informative features that are inherent to solving a target task by using auxiliary out-of-distribution data. We deactivate the noise features in the target distribution by using that in the out-of-distribution data. Next, we introduce an approach that prioritizes informative examples from unlabeled noisy data in order to reduce the labeling cost of active learning. In order to solve the purity-information dilemma, where an attempt to select informative examples induces the selection of many noisy examples, we propose a meta-model that finds the best balance between purity and informativeness. Lastly, we suggest an approach that prioritizes informative examples from labeled noisy data to preserve the performance of data selection. For labeled image noise data, we propose a data selection method that considers the confidence of neighboring samples to maintain the performance of the state-of-the-art Re-labeling models. For labeled text noise data, we present an instruction selection method that takes diversity into account for ranking the quality of instructions with prompting, thereby enhancing the performance of aligned large language models. Overall, our unified framework induces the deep learning development process robust to noisy data, thereby effectively mitigating noisy features and examples in real-world applications.
Advisors
이재길researcher
Description
한국과학기술원 :데이터사이언스대학원,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 데이터사이언스대학원, 2024.2,[vii, 82 p. :]

Keywords

Deep learning▼aNoisy data▼aOut-of-distribution data▼aFeature regularization▼aActive learning▼aData pruning▼aCoreset selection▼aLarge language models; 심층 학습▼a노이즈 데이터▼a분포외 데이터▼a특성 정규화▼a능동 학습▼a데이터 가지치기▼a핵심집합 선별▼a거대언어모델

URI
http://hdl.handle.net/10203/321986
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1098141&flag=dissertation
Appears in Collection
IE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0