Differentially Private Generative Adversarial Network (DP-GAN) to generate private synthetic data for analysis tasks.
Team members: Rachel Cummings, Dhamma Kimpara, Digvijay Boob, Kyle Zimmerman, Chris Waites, Uthaipon Tantipongpipat
Our approach is to generate differentially private synthetic data using Generative Adversarial Networks (GANs). This synthetic data can then be used for a variety of analysis tasks, including classification, regression, clustering, and answering unknown research questions. If the synthetic data are statistically similar to the original (sensitive) data, then analysis on the synthetic data should be accurate with respect to the original database. By generating synthetic data privately, any future analysis on the data will also be private, due to the post-processing guarantees of differential privacy.
GANs are a type of generative model, in which two neural networks are trained against each other in a zero-sum game. These neural networks are parameterized by their edge weights which specify the function computed by each network. The Generator takes as input a random vector drawn from a known distribution, and produces a new datapoint that (hopefully) has a similar distribution to the true data distribution. If we are given a finite-size database, then the true data distribution can be interpreted as the empirical distribution that would arise from sampling entries of the database with replacement. The Discriminator then tries to detect whether this new datapoint is from the Generator or from the true data distribution. If the Discriminator is too successful in distinguishing between the Generator’s outputs and the true data, then this feedback is used to improve the Generator’s data generation process.
For the design of differentially private GANs (DP-GANs), the Discriminator is trained in a differentially private manner. GANs are typically trained using iterative updates from noisy (stochastic) gradient descent steps. Once we know the sensitivity of each update to the Discriminator, then we add noise proportional to the sensitivity using the Gaussian Mechanism. The overall privacy guarantee of the algorithm follows from composition of these private updates.
We reduce the sensitivity of these gradient descent updates (and hence improve overall accuracy) by clipping the stochastic gradients and ensuring that the gradient will lie in a bounded range. This allows for an upper bound on the magnitude of each update, and hence the sensitivity. We then only need to add noise proportional to the sensitivity to ensure differential privacy.
Once training is complete, our Generator can generate private synthetic data that closely approximates the original data with respect to standard statistical measures within a practical privacy limit. The private synthetic data also retains high accuracy for common machine learning tasks, when both are generated from moderate-sized datasets. Our algorithm enjoys formal differential privacy guarantees through standard composition. In practice, we can employ a moments accountant to give even tighter privacy bounds. For the same privacy budget, this method yields far better training accuracy compared to advanced-composition-based analysis.
Back to The Unlinkable Data Challenge