a NIST blog
In this series of blog posts, we have tried to give an accessible overview of the state-of-the-art in differential privacy. In this final post, we review some of the open challenges in the practical use of differential privacy, and conclude with a summary of contexts where differential privacy is already ready for deployment and what comes next.
The impact of the privacy parameter (or privacy budget) ε has been a consistent theme throughout this series. Conceptually, the privacy parameter is simple: smaller values of ε yield better privacy, and larger values yield worse privacy. But there’s one very important question we haven’t answered: what, exactly, does ε mean, and how should we set it? Unfortunately, we still don’t have a consensus answer to this question.
Researchers have recommended ε values around 1. Can we set ε higher than 1 without opening ourselves up to privacy attacks? Existing deployments of differential privacy indeed have set ε larger than 1:
You can find a longer list of these values in a blog post by Damien Desfontaines.
Despite concerns expressed by the academic community about the large ε values used in these systems, none of them has yet been the target of a successful privacy attack (in contrast to systems based on de-identification, which have been broken repeatedly).
Conclusive answers to questions about ε will require more experience with how the math of differential privacy translates to real-world outcomes. Until we have that experience, setting ε remains challenging. Current experience suggests that:
The value of ε is one very visible parameter in differential privacy. A second - less visible, but just as important - also exists: the definition of neighboring databases (sometimes called the unit of privacy). In our first post, we defined neighboring databases as ones that differ in exactly one person’s data. But, this intuitive definition doesn’t make sense in all contexts. In mobility data, where each individual submits multiple location reports, do neighboring databases differ in one report or all of one person’s reports?
The distinction makes a big difference in the real-world privacy guarantee. If we protect just a single report, then it may still be possible to learn frequently-visited places for an individual (e.g. their home or work). The Apple and Google systems mentioned above, for example, define neighboring databases in terms of all of one person’s reports during one day - which represents a middle ground between the two.
Subtle points like these can have a big impact on the real-world privacy guarantee implied by differential privacy, but they are difficult to navigate and challenging to communicate to non-experts. Fortunately, as differential privacy gains prominence, researchers are beginning to study these issues.
A previous post in the series described the tradeoff between privacy and utility - how useful a differentially private release is for downstream users of the data. Researchers often use accuracy (how close the private results are to the “true” results) as a proxy for utility - but the two are not always the same. Intuitively, differential privacy’s impact on utility can be thought of in terms of how differential privacy impacts the ability of data users to do their jobs.
The use of differential privacy by the US Census Bureau highlights the dual challenges of navigating this tradeoff. In an extensive process including multiple rounds of feedback, the Bureau’s Data Stewardship Executive Policy (DSEP) Committee carefully considered both privacy and utility requirements when setting the privacy parameters for the 2020 Census.
We’re still learning how to design processes like these that successfully address both challenges: measuring utility of differentially private data releases (as described in our previous post), and helping data users understand how to work with differentially private data (e.g. as described in this recent op-ed by differential privacy experts). Broader use of differential privacy will probably require both.
As described in earlier posts, performing counting, summation, and average queries on a single database table with differential privacy is a well-understood problem with mature solutions. For analyses like these, data users can reasonably expect to achieve highly accurate differentially private results for large datasets.
However, other settings remain challenging. As discussed in previous posts, for queries on multiple tables, synthetic data generation, and deep learning, solutions are known - but they’re not yet as accurate as we would like. These areas represent the frontier of research in differential privacy, and improved algorithms are being developed all the time.
We also highlighted the challenges of ensuring correctness, in posts on testing and automatic proofs. Solutions in these areas remain primarily academic, and have only recently begun migrating to practical systems. Practical systems have also begun to implement protections against side-channel attacks, like the floating-point vulnerability present in naive implementations of the Laplace mechanism.
Finally, open-source systems for differential privacy have recently started to focus on usability. The OpenDP and diffprivlib projects both provide notebook-based programming interfaces that will be familiar to many data scientists, as well as extensive documentation. Researchers are beginning to study the usability of systems like these, and this research is likely to lead to further improvements.
Is differential privacy ready for prime-time?
In some cases, the answer is yes, absolutely!
For use cases involving counting, summation, or average queries over a large, single table of data - for example, generation of histograms or aggregated microdata - there are software tools available now that can produce highly accurate results. These systems are open-source, well-supported by their authors, and carefully tuned to provide good performance.
Both tools allow differentially private analyses in Python notebooks, and provide programming environments designed to be accessible to existing data scientists.
For the other kinds of analyses we have discussed in this series - joins, synthetic data, deep learning, and more - accessible tools are still being developed. Progress on tools for differential privacy has accelerated rapidly in the past several years, and we look forward to the availability of accessible tools for these tasks in the near future!
We hope you have enjoyed this blog series on the basics of differential privacy! Stay tuned - in the coming year, we plan to use the series as a foundation for developing technical guidelines. We look forward to your continued engagement!
This post is part of a series on differential privacy. Learn more and browse all the posts on the differential privacy blog series page in NIST’s Privacy Engineering Collaboration Space.