Threat Models for Differential Privacy

differential privacy blog series banner — Credit: metamorworks/shutterstock.com

This post is part of a series on differential privacy. Learn more and browse all the posts published to date on the differential privacy blog series page in NIST’s Privacy Engineering Collaboration Space.

It's not so simple to deploy a practical system that satisfies differential privacy. Our example in the last post was a simple Python program that adds Laplace noise to a function computed over the sensitive data. For this to work in practice, we'd need to collect all of the sensitive data on one server to run our program.

What if that server gets hacked? Differential privacy provides no protection in this case—it only protects the output of our program!

When deploying differentially private systems, it's important to consider the threat model—that is, what kind of adversaries we want the system to protect against. If the threat model includes adversaries who might compromise the server holding the sensitive data, then we need to modify the system to protect against this kind of attack.

The design of systems for differential privacy thus needs to consider both privacy and security. As described in our last post, privacy refers to controlling what can be inferred from a data release. We can think of security as an orthogonal issue: security mechanisms control who is allowed to access a piece of data, but make no guarantees about what can be learned from that data.

Central Differential Privacy

The most commonly-used threat model in differential privacy research is called the central model of differential privacy (or simply, "central differential privacy").

The key component of the central model is a trusted data curator. Each individual submits their sensitive data to the data curator, who stores all of the data in a central location (i.e. on a single server). The data curator is trusted: we assume they will not look at the sensitive data directly, will not share it with anyone, and cannot be compromised by any other adversary. In other words, with this model we assume that the server holding the sensitive data cannot be hacked.

In the central model, we typically add noise to query results, as we did in our Laplace mechanism example. The advantage of this model is that it allows algorithms to add the smallest possible amount of noise, and therefore produce results with the maximum accuracy allowed under differential privacy. The figure below demonstrates this process. We place the privacy barrier in between the trusted data curator and the analyst; to the right of the privacy barrier, only differentially private results can be viewed, so the analyst does not need to be trusted.

The disadvantage of the central model is that it requires a trusted data curator—and many data curators are not considered trustworthy! In fact, lack of trust in the data collector is often a primary motivation for the use of differential privacy.

Local Differential Privacy

The local model of differential privacy addresses the security issue in the central model by eliminating the trusted data curator. Each individual adds noise to their own data before sending it to the data curator. This means that the data curator never sees the sensitive data, so does not need to be trusted. The figure below demonstrates the local model; here, the privacy barrier stands between each data owner and the (untrusted) data curator.

The local model of differential privacy avoids the security issues of the central model—if the data curator's server is hacked, the hackers only see noisy data that already satisfies differential privacy. This is a major benefit of the local model, and it's why the local model was adopted in Google's RAPPOR system [1] and Apple's data collection system [2].

The drawback? The local model produces less accurate answers than the central model. In the local model, each individual adds enough noise to satisfy differential privacy; thus the total noise for all participants is much larger than the single noise sample used in the central model.

As a result, the local model is only useful for queries with a very strong "signal." Apple's system, for example, uses the local model to estimate the popularity of emojis, but the results are only useful for the most popular emojis (i.e. where the "signal" is strongest). The local model is typically not used for more complex queries, like those used in the U.S. Census [3] or applications like machine learning.

Hybrid Models

The central and local models have their advantages and drawbacks, so achieving the best of both is a natural goal and an active area of research.

One alternative approach is the shuffling model, implemented in a system called Prochlo [4]. The shuffling model includes an untrusted data curator and individual data contributors, and adds a set of partially trusted shufflers. In this model, each individual adds a small amount of noise to their own data, and then submits it to the shuffler, which adds additional noise before forwarding batches of data to the data curator. The idea is that shufflers are unlikely to collude with the data curator or each other, so the small amount of noise added by individuals is sufficient to guarantee privacy. Each shuffler operates on a batch of inputs—in the same way as the central model—so a small amount of additional noise guarantees privacy for the whole batch. The shuffling model is a compromise between the local and central models: it allows adding less noise than the local model, but requires more noise than the central model.

Another possibility is to combine differential privacy with techniques from cryptography, such as secure multiparty computation (MPC) or fully homomorphic encryption (FHE). FHE allows computing on encrypted data without decrypting it first, and MPC allows a group of parties to securely compute functions over distributed inputs without revealing the inputs. Computing a differentially private function using secure computation is a promising way to achieve the accuracy of the central model with the security benefits of the local model. In this approach, the use of secure computation eliminates the need for a trusted data curator. Recent work [5] demonstrates the promise of combining MPC and differential privacy, and achieves most of the benefits of both the central and local models. In most cases, secure computation is several orders of magnitude slower than native execution, which is often impractical for large datasets or complex queries. Secure computation is an active area of research, however, and its performance is improving quickly. Stay tuned for a more complete blog post on this topic later in the series.

Coming up Next

In our next post, we'll look at our first open-source tools for enforcing differential privacy. We'll explore tools that can be deployed to answer counting queries on real databases, including tools accessible to non-experts and ones that can scale to very large databases, like the ones used by the U.S. Census Bureau to count people with differential privacy!

References

[1] Erlingsson, Úlfar, Vasyl Pihur, and Aleksandra Korolova. "Rappor: Randomized aggregatable privacy-preserving ordinal response." In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp. 1054-1067. 2014.

[2] Apple Inc. "Apple Differential Privacy Technical Overview." Accessed 7/31/2020. https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

[3] Garfinkel, Simson L., John M. Abowd, and Sarah Powazek. "Issues encountered deploying differential privacy." In Proceedings of the 2018 Workshop on Privacy in the Electronic Society, pp. 133-137. 2018.

[4] Bittau, Andrea, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. "Prochlo: Strong privacy for analytics in the crowd." In Proceedings of the 26th Symposium on Operating Systems Principles, pp. 441-459. 2017.

[5] Roy Chowdhury, Amrita, Chenghong Wang, Xi He, Ashwin Machanavajjhala, and Somesh Jha. "Cryptε: Crypto-Assisted Differential Privacy on Untrusted Servers." In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 603-619. 2020.

About the author

Joseph Near

Joseph Near is an assistant professor of computer science at the University of Vermont who supports NIST as a moderator for the Privacy Engineering Collaboration Space. His research interests include data privacy, computer security, and programming languages. Joseph received his B.S. in computer science from Indiana University, and his M.S. and Ph.D. in computer science from MIT.

David Darais

David Darais is a Principal Scientist at Galois, Inc. and supports NIST as a moderator for the Privacy Engineering Collaboration Space. David's research focuses on tools for achieving reliable software in critical, security-sensitive, and privacy-sensitive systems. David received his B.S. from the University of Utah, M.S. from Harvard University and Ph.D. from the University of Maryland.

Comments

Thank you for sharing your research on the subject, and the opportunity to comment.

Regardless of the instantiation - whether central, local or hybrid - the central server will act as a single point of failure (SPOF) if successfully breached. The three architectures described do not contemplate this attack vector, or an elegant fail-over capability as countermeasure as part of resilience-based non-functional requirements (NFRs). The SPOF is a choke point that will hinder the successful consummation of any, and all, differential privacy transactions without an as-designed redundancy factor, thereby undermining steady-state operation of the ecosystem.

Central Differential Privacy - "The data curator is trusted: we assume they will not look at the sensitive data directly, will not share it with anyone, and cannot be compromised by any other adversary. In other words, with this model we assume that the server holding the sensitive data cannot be hacked."

Positing, by way of assumption, that the server holding sensitive data cannot be hacked is counter-intuitive to the enterprise of threat modeling, according to which all components, and associated functionality, are assessed for security weaknesses. In order for all components, and associated functionality, to be deemed trustworthy, they must be explicitly assessed for assurance, or its absence, so that a "whole of" conclusion can be made on design in advance of any build/manufacture, and implementation, considerations.

Hybrid Models - "The idea is that shufflers are unlikely to collude with the data curator or each other, so the small amount of noise added by individuals is sufficient to guarantee privacy."

Positing that shufflers are unlikely to collude with the data curator or each other is counter-intuitive to the enterprise of threat modeling, according to which all components, and associated functionality, are assessed for security weaknesses. In order for all components, and associated functionality, to be deemed trustworthy, they must be explicitly assessed for assurance, or their absence, so that a "whole of" conclusion can be rendered on design in advance of any build/manufacture, and implementation, considerations. The threat model must also consider (a) in-line, "happy path" motivations, behaviors, and intentions of human actors, and (b) out-of-band (OOB), anomalistic motivations, behaviors, and intentions of the same parties. To do otherwise presents a material gap in threat modeling efforts that can seriously undermine overall trust assurance.

Reply

Thank you for your comment! You're right that many of these deployment models make a number of assumptions that won't always hold in practice - that's why we tried to state them explicitly, to make it possible for the threat modeling process to evaluate whether or not they are reasonable in a particular deployment of differential privacy.

Reply

Add new comment

Your name

CAPTCHA

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

Please be respectful when posting comments. We will post all comments without editing as long as they are appropriate for a public, family friendly website, are on topic and do not contain profanity, personal attacks, misleading or false information/accusations or promote specific commercial products, services or organizations. Comments that violate our comment policy or include links to non-government organizations/web pages will not be posted.

Cybersecurity Insights