Generation and assessment of useful and privacy preserving synthetic datasets

Gewirtz-O’Reilly, Flynn (2024) Generation and assessment of useful and privacy preserving synthetic datasets. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024Gewirtz-O'ReillyPhD.pdf] PDF
Download (7MB)

Abstract

Synthetic datasets are gaining traction as a potential solution for allowing access to sensitive data while protecting the privacy of individuals. However, the assessment of both the utility and disclosure risk of synthetic data is still an open question for which there is little consensus. Solutions that are theoretically good have been proposed but these are not currently feasible for most use cases. Meanwhile, most practicable disclosure risk assessments are ad hoc, unsuitable for more than a few sensitive variables, and only consider a narrow range of risk scenarios. For greater uptake of synthetic data it is important to establish a standard for its assessment.

In this thesis, we evaluate methods for the assessment of synthetic data and identify several clear issues in the literature. We develop a practical framework for the quantitative assessment of disclosure risk for synthetic data. Hierarchical regression models are used for the evaluation and comparison of disclosure risk for multiple sensitive variables, synthetic datasets and intruder assumptions simultaneously. We demonstrate our methods on two example datasets. A small dataset containing less than 1000 samples and 9 variables, and a larger dataset that contains over 50000 samples and 40 variables. We find that the method of prediction has a significantly larger effect on attribute disclosure risk than the synthetic data generation method.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Additional Information: Supported by funding from EPSRC and Canon Medical Research Europe.
Subjects: H Social Sciences > HA Statistics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Funder's Name: Engineering and Physical Sciences Research Council (EPSRC), Canon Medical Research Europe
Supervisor's Name: Ray, Professor Surajit
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84327
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 17 May 2024 14:25
Last Modified: 21 May 2024 08:42
Thesis DOI: 10.5525/gla.thesis.84327
URI: https://theses.gla.ac.uk/id/eprint/84327

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year