Data Collection and Quality Challenges for Deep Learning

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 77
  • Download : 0
Software 2.0 refers to the fundamental shift in software engineering where using machine learning becomes the new norm in software with the availability of big data and computing infrastructure. As a result, many software engineering practices need to be rethought from scratch where data becomes a first-class citizen, on par with code. It is well known that 80{90% of the time for machine learning development is spent on data preparation. Also, even the best machine learning algorithms cannot perform well without good data or at least handling biased and dirty data during model training. In this tutorial, we focus on data collection and quality challenges that frequently occur in deep learning applications. Compared to traditional machine learning, there is less need for feature engineering, but more need for significant amounts of data. We thus go through state-of-the-art data collection techniques for machine learning. Then, we cover data validation and cleaning techniques for improving data quality. Even if the data is still problematic, hope is not lost, and we cover fair and robust training techniques for handling data bias and errors. We believe that the data management community is well poised to lead the research in these directions. The presenters have extensive experience in developing machine learning platforms and publishing papers in top-tier database, data mining, and machine learning venues.
Publisher
ASSOC COMPUTING MACHINERY
Issue Date
2020-08
Language
English
Article Type
Article
Citation

PROCEEDINGS OF THE VLDB ENDOWMENT, v.13, no.12, pp.3429 - 3432

ISSN
2150-8097
DOI
10.14778/3415478.3415562
URI
http://hdl.handle.net/10203/280093
Appears in Collection
EE-Journal Papers(저널논문)CS-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0