An Empirical Study of Utility and Disclosure Risk for Tabular Data Synthesis Models: In-Depth Analysis and Interesting Findings

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 16
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorPark, Dae-Youngko
dc.contributor.authorKo, In-Youngko
dc.date.accessioned2024-06-18T08:19:38Z-
dc.date.available2024-06-18T08:19:38Z-
dc.date.created2024-06-18-
dc.date.issued2024-02-20-
dc.identifier.citation2024 IEEE International Conference on Big Data and Smart Computing, BigComp 2024, pp.67 - 74-
dc.identifier.urihttp://hdl.handle.net/10203/319842-
dc.description.abstractThe ever-growing accumulation of data in various applications has spurred research into privacy-enhancing technologies. Synthetic data, in particular, has gained significant attention for enhancing machine learning model performance while preserving personal information. Although synthetic data studies have been on the rise, there are no clear criteria for how to measure the utility and disclosure risk of synthetic data. Furthermore, although many existing studies have primarily concentrated on image data synthesis models, there's a notable scarcity of research on tabular data synthesis models, particularly concerning disclosure risk. This is crucial in domains such as finance, which heavily rely on tabular datasets containing sensitive information. In this paper, we perform in-depth analysis of utility and disclosure risk index from classical to state-of-the-art tabular data synthesis models in terms of different metrics and various types of datasets. Our interesting findings can be summarized as follows: (1) Synthetic data's utility tends to increase as the proportion of continuous attributes in the original data decreases, (2) Conversely, disclosure risk rises with a lower proportion of continuous attributes in the original data, (3) As the volume of synthetic data grows, both utility and disclosure risk metrics generally increase, (4) An inverse relationship is observed between the sparsity of original data and a specific utility metric, and (5) Notably, we discover that Targeted Correct Attribution Probability (TCAP), a widely-used disclosure risk metric, fails to measure certain outlier records that are potential vulnerabilities for malicious attacks.-
dc.languageEnglish-
dc.publisherInstitute of Electrical and Electronics Engineers Inc.-
dc.titleAn Empirical Study of Utility and Disclosure Risk for Tabular Data Synthesis Models: In-Depth Analysis and Interesting Findings-
dc.typeConference-
dc.type.rimsCONF-
dc.citation.beginningpage67-
dc.citation.endingpage74-
dc.citation.publicationname2024 IEEE International Conference on Big Data and Smart Computing, BigComp 2024-
dc.identifier.conferencecountryTH-
dc.identifier.conferencelocation태국 방콕-
dc.identifier.doi10.1109/BigComp60711.2024.00020-
dc.contributor.localauthorKo, In-Young-
dc.contributor.nonIdAuthorPark, Dae-Young-
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0