Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Information Technology Laboratory / Applied Cybersecurity Division

Privacy Engineering Program

All Contributions

Return to Browse

Approximate Minima Perturbation (AMP)

De-identification Tool
Keywords: Differential Privacy, Machine Learning

This work presents a novel algorithm called Approximate Minima Perturbation (AMP) for differentially private convex optimization, and an extensive empirical evaluation on real datasets of both AMP and a number of previous approaches for solving this problem. The Github repository contains Python implementations of AMP, noisy stochastic gradient descent, noisy Frank-Wolfe, objective perturbation, and two variants of output perturbation, as well as a number of benchmarks for generating experimental results.

Notes: The AMP algorithm and associated experimental results are described in an IEEE Symposium on Security and Privacy 2019 paper available here.

Affiliation/Organization(s) Contributing: Carnegie Mellon University; Boston University; University of California, Berkeley; University of California, Santa Cruz; Peking University
GitHub POC: @jnear

AMP on GitHub Share Feedback

ARX Data Anonymization Tool

De-identification Tool
Keywords: Differential Privacy, K-Anonymity, Anonymization, Machine Learning

ARX is a comprehensive open source software for anonymizing sensitive personal data. It supports a wide variety of (1) privacy and risk models, (2) methods for transforming data and (3) methods for analyzing the usefulness of output data.

Affiliation/Organization(s) Contributing: TUM - Technical University of Munich
GitHub POC: @prasser

ARX Share Feedback

Chorus

De-identification Tool
Keywords: Differential Privacy

Chorus is a tool for answering SQL queries with differential privacy. Chorus works with a standard SQL database, and scales to large datasets by offloading the heavy lifting of query answering to the database. To implement differential privacy mechanisms, Chorus uses a combination of query rewriting and post-processing.

Notes: Chorus is described in a EuroS&P paper available here.

Affiliation/Organization(s) Contributing: University of Vermont, University of California Berkeley
GitHub POC: @jnear

Chorus on GitHub Share Feedback

City of Seattle Open Data Risk Assessment

Privacy Risk Assessment Use Case

While the transparency goals of the open data movement serve important functions in cities like Seattle, some municipal datasets about the city and its citizens’ activities carry inherent risks to individual privacy when shared publicly. In 2016, the City of Seattle declared in its Open Data Policy that the city’s data would be “open by preference,” except when doing so may affect individual privacy. To ensure its Open Data Program effectively protects individuals, Seattle committed to performing an annual risk assessment and tasked the Future of Privacy Forum (FPF) with creating and deploying an initial privacy risk assessment methodology for open data.

This Report first describes inherent privacy risks in an open data landscape, with an emphasis on potential harms related to re-identification, data quality, and fairness. To address these risks, the Report includes a Model Open Data Benefit-Risk Analysis (“Model Analysis”). The Model Analysis evaluates the types of data contained in a proposed open dataset, the potential benefits – and concomitant risks – of releasing the dataset publicly, and strategies for effective de-identification and risk mitigation. This holistic assessment guides city officials to determine whether to release the dataset openly, in a limited access environment, or to withhold it from publication (absent countervailing public policy considerations). The Report methodology builds on extensive work done in this field by experts at the National Institute of Standards and Technology, the University of Washington, the Berkman Klein Center for Internet & Society at Harvard University, and others, and adapts existing frameworks to the unique challenges faced by cities as local governments, technological system integrators, and consumer facing service providers. The Report concludes by detailing concrete technical, operational, and organizational recommendations to enable the Seattle Open Data Program’s approach to identify and address key privacy, ethical, and equity risks, in light of the city’s current policies and practices.

Notes: Templates for the Model Benefit-Risk Assessment (https://fpf.org/wp-content/uploads/2018/01/Model-Benefit-Risk-Analysis.pdf) and the Program Maturity Assessment (https://fpf.org/wp-content/uploads/2018/01/Program-Maturity-Assessment.pdf) are available separately, as well.

Future of Privacy Forum website: https://fpf.org/

City of Seattle Executive Order 2016-01: http://murray.seattle.gov/wp-content/uploads/2016/02/2.26-EO.pdf

Affiliation/Organization(s) Contributing: Future of Privacy Forum (FPF)
GitHub POC and Email: @k-finch | kfinch [at] fpf.org (kfinch[at]fpf[dot]org)

Risk Assessment Report (PDF) Share Feedback

Differential Privacy Synthetic Data Challenge Algorithms

De-identification Tool
De-identification Keywords: Differential Privacy, Synthetic Data Generation

Participants in Match #3 of NIST's 2018 Public Safety Communications Research Differential Privacy Synthetic Data Challenge developed these open source algorithms as part of an effort to advance differential privacy. Participants were challenged to create new methods, or improve existing methods of data de-identification, while preserving the dataset’s utility for analysis. All solutions were required to satisfy the differential privacy guarantee, a provable guarantee of individual privacy protection. Participants used a data set of emergency response events occurring in San Francisco and a sub-sample of the IPUMS USA data for the 1940 U.S. Census. Contributions are listed in alphabetical order.

DP_WGAN-UCLANESL

This repo contains an implementation for the award-winning solution to the 2018 Differential Privacy Synthetic Data Challenge by team UCLANESL. Our solution has been awarded the 5th place in Match #3 of the challenge and an earlier version has also won the 4th place in Match #1. The solution trains a wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset. Differentially private training is applied by sanitizing (norm clipping and adding Gaussian noise) the gradients of the discriminator. Once the model is trained, it can be used to generate synthetic dataset by feeding random noise into the generator.

Team Members: Prof. Mani Srivastava (@msrivastava) - Team Captain (Match 1 and Match 3), Moustafa Alzantot (@malzantot) - (Match 1 and Match 3), Nat Snyder (@natsnyder1) - Match 1, Supriyo Charkaborty (@supriyogit) - Match 1

DP_WGAN-UCLANESL on GitHub More Information Share Feedback

DPFieldGroups

This is the fourth place entry in the third round of the NIST Differential Privacy Synthetic Data Challenge. The goal of this challenge is to produce differentially private synthetic data while retaining as much useful information as possible about the original data set. Colorado census data from 1940 with 98 field columns were provided for algorithm development with census data from other states used for testing. This solution groups together fields which have been found to be highly correlated. For each of these groups, a histogram is created for the purpose of counting the number of occurrences of every possible combination of values of all fields in the group. For privatization, Laplacian noise is added to every bin with scale proportional to the number of groups / total epsilon. Synthetic data is generated by selecting a random bin for each group with probability weighted by these noisy bin counts. The field values corresponding to each group's selected bin are written out as a single row of synthetic data.

Team Member & Affiliation: John Gardner (no affiliation)

DPFieldGroups on GitHub Share Feedback

DPSyn

We present DPSyn, an algorithm for synthesizing microdata while satisfying differential privacy, and its instantiation to the dataset used in the competition, namely Public Use Microdata Sample (PUMS) of the 1940 USA Census Data.

Team Members & Affiliations: Ninghui Li (Purdue University), Zhikun Zhang (Zhejiang University), Tianhao Wang (Purdue University)

DPSyn GitHub Share Feedback

rmckenna

The first place entry in the third round of the NIST Differential Privacy Synthetic Data Challenge. The high-level idea is to (1) use the Gaussian mechanism to obtain noisy answers to a carefully selected set of counting queries (1, 2, and 3 way marginals) and (2) find a synthetic data set that approximates the true data with respect to those queries. The latter step is accomplished with [3], and the previous step uses ideas inspired by [1] and [2]. More specifically, this is done by calculating the mutual information (on the public dataset) for each pair of attributes and selecting the marginal queries that have high mutual information.

[1] Zhang, Jun, et al. "Privbayes: Private data release via bayesian networks." ACM Transactions on Database Systems (TODS) 42.4 (2017): 25.
[2] Chen, Rui, et al. "Differentially private high-dimensional data publication via sampling-based inference." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
[3] McKenna, Ryan, Daniel Sheldon, and Gerome Miklau. "Graphical-model based estimation and inference for differential privacy." Proceddings of the 36th International Conference on Machine Learning. 2019.

Team Member & Affiliation: Ryan McKenna (UMass Amherst)

rmckenna Algorithm on GitHub Share Feedback

Differentially Private Stochastic Gradient Descent (DP-SGD)

De-identification Tool
De-identification Keywords: Differential Privacy, Machine Learning

Train machine learning models with differential privacy by clipping and noising gradients during stochastic gradient descent.

Notes: Paper with full details: https://arxiv.org/abs/1607.00133

Affiliation/Organization(s) Contributing: Google
GitHub POC: @ilyamironov

DP-SGD on GitHub Share Feedback

Diffprivlib

De-identification Tool
De-identification Keywords: Differential Privacy, Machine Learning, Data Analytics

Diffprivlib is a general-purpose Python library for experimenting with, and, building tools for, differential privacy. Diffprivlib includes a number of algorithms for machine learning and data analytics with differential privacy off-the-shelf in the familiar Scikit-learn and Numpy syntax.

Notes: Introductory whitepaper

Affiliation/Organization(s) Contributing: IBM Research
GitHub POC and Email: @naoise-h | naoise [at] ibm.com (naoise[at]ibm[dot]com)

Diffprivlib on GitHub Share Feedback

Duet

De-identification Tool
De-identification Keywords: Differential Privacy, Verification of Algorithms, Machine Learning

Duet is a programming language which automatically derives (and checks) differential privacy bounds for programs written in the language. Duet is designed to support modern machine learning algorithms, and advanced variants of differential privacy in order to add minimal noise to algorithm results in order to ensure privacy.

Notes: paper [arXiv]

Affiliation/Organization(s) Contributing: University of Vermont, University of California at Berkeley, University of Utah
GitHub User Serving as POC: @jnear

Duet on GitHub Share Feedback

Ektelo

De-identification Tool
De-identification Keywords: Differential Privacy

Ektelo is a programming framework and system that aids programmers in developing differentially private programs with high utility. Ektelo can be used to author programs for a variety of statistical tasks that involve answering counting queries over a table of arbitrary dimension.

Notes: Ektelo is described in detail in a SIGMOD 2018 paper, titled "EKTELO: A Framework for Defining Differentially-Private Computations." https://dl.acm.org/citation.cfm?id=3196921

Affiliation/Organization(s) Contributing: UMass Amherst, Duke University, Colgate University
GitHub POC: @michaelghay

Ektelo on GitHub Share Feedback

FAIR Privacy

Privacy Risk Assessment Tool

FAIR Privacy is a quantitative privacy risk framework based on FAIR (Factors Analysis in Information Risk). FAIR Privacy examines personal privacy risks (to individuals), not organizational risks. Included in this tool is a PowerPoint deck illustrating the components of FAIR Privacy and an example based on a hypothetical smart lock manufacturer. In addition, an Excel spreadsheet provides a powerful risk calculator using Monte Carlo simulation.

Notes: V2.11 March 2022 Update: A revised version of the PowerPoint deck and calculator are provided based on the example used in the paper "Quantitative Privacy Risk" presented at the 2021 International Workshop on Privacy Engineering (https://ieeexplore.ieee.org/document/9583709). The newer Excel based calculator:

uses a Poisson distribution for threat opportunity (previously Beta-PERT)
uses Binomial distribution for Attempt Frequency and Violation Frequency (Note: inherent baseline risk assumes 100% vulnerability)
provides a method of calculating organizational risk tolerance
provides a second risk calculator for comparison between two risks for help prioritizing efforts
provides a tab for comparing inherent/baseline risk to residual risk, risk tolerance and the other risk tab
increased instructional text
genericization of privacy harm and adverse tangible consequences

Some additional resources are provided in the PowerPoint deck. Feedback and suggestions for improvement on both the framework and the included calculator are welcome. Additionally, analysis of the spreadsheet by a statistician is most welcome.

Affiliation/Organization(s) Contributing: Enterprivacy Consulting Group
GitHub POC: @privacymaverick

FAIR Privacy on GitHub Share Feedback

Google Differential Privacy Library

De-identification Tool
De-identification Keywords: Differential Privacy

Google's differential privacy library provides a set of building block components that allow developers to build differentially private applications in C++, Java, and Go. Furthermore, Google's DP library offers 'Privacy on Beam', an end-to-end implementation of differential privacy that helps developers perform operations in a differentially private manner. This codelab gives further insight.

Notes:

Affiliation/Organization(s) Contributing: Google
Email POC: dp-open-source [at] google.com (dp-open-source[at]google[dot]com)

Google Differential Privacy on GitHub Share Feedback

GUPT: Privacy preserving data analysis made easy

De-identification Tool
De-identification Keywords: Differential Privacy, Machine Learning, Database Queries

The tool provides differential privacy guarantees to statistical/machine learning algorithms by treating the underlying algorithm as a black-box, and only relying on input/output signatures. It implements a variant of the celebrated sample and aggregate framework by Nissim, Rashkhodnikova, and Smith, 2007. The empirical evaluation shows that the system scores well on various learning tasks (like clustering and regression).

Notes: GUPT is described in detail in a SIGMOD 2012 paper, titled "GUPT: Privacy Preserving Data Analysis Made Easy." A PDF is available here.

Affiliation/Organization(s) Contributing: University of California, Berkeley; University of California, Santa Cruz; Cornell University
GitHub POC: @prashmohan

GUPT on GitHub Share Feedback

NIST Privacy Risk Assessment Methodology (PRAM)

Privacy Risk Assessment Tool

The PRAM is a tool that applies the risk model from NISTIR 8062 and helps organizations analyze, assess, and prioritize privacy risks to determine how to respond and select appropriate solutions. The PRAM can help drive collaboration and communication between various components of an organization, including privacy, cybersecurity, business, and IT personnel.

Worksheet 1: Framing Business Objectives and Organizational Privacy Governance
Worksheet 2: Assessing System Design; Supporting Data Map
Worksheet 3: Prioritizing Risk
Worksheet 4: Selecting Controls
Catalog of Problematic Data Actions and Problems

Notes: NIST welcomes organizations to use the PRAM and share feedback to improve the PRAM.

Affiliation/Organization(s) Contributing: NIST
GitHub POC: @kboeckl

PRAM on GitHub Share Feedback

PixelDP

De-identification Tool
De-identification Keywords: Differential Privacy, Verification of Algorithms, Machine Learning, Adversarial Examples

Adversarial examples that fool prediction models are a new class of attacks introduced by machine learning deployments. PixelDP is the first certified defense that both offers provable guarantees of robustness against these attacks and scales to large models and datasets, such as Google’s Inception on the ImageNet dataset. PixelDP's design relies on a novel use of differential privacy at prediction time.

Notes: This IEEE S&P 2019 research paper describes PixelDP.

Affiliation/Organization(s) Contributing: Columbia University
GitHub POC: @matlecu

PixelDP on GitHub Share Feedback

Privacy Protection Application (PPA)

De-identification Tool
Keywords: K-Anonymity, Anonymization, Information Leakage, Algorithmic Fairness, Database Queries, Location Data

The Privacy Protection Application de-identifies databases that contain sequential geolocation data, sometimes called moving object databases. A record of a personally-owned vehicle’s route of travel is an example, but the tool can process other types of geolocation sequences. The application has a graphical user interface and operates on Linux, OS X, and Windows. Location suppression is the de-identification strategy used, and decisions about which locations to suppress are based on information theory. This strategy does not modify the precision of retained location information. One of the objectives is to produce data usable for vehicle safety analysis and transportation application development.

Notes: This tool treats static databases and has two versions. The main GUI versions uses a very efficient map matching strategy that may identify false roads for certain types of road structures. The tagged version (https://github.com/usdot-its-jpo-data-portal/privacy-protection-application/releases/tag/hmm-mm) uses a Hidden Markov Model map matching algorithm that is more accurate, but less efficient. This version is a command line tool that runs in Docker. Additionally, a streaming de-identification tool was developed for a USDOT Safety Pilot Study. This tool uses geofencing to identify locations that can be retained. It can also be found on GitHub: https://github.com/usdot-jpo-ode/jpo-cvdp

POC: carterjm [at] ornl.gov (carterjm[at]ornl[dot]gov)

PPA on GitHub Share Feedback