Differential PrivacyWebsite for the differential privacy research community
https://differentialprivacy.org
Conference Digest - NeurIPS 2021<p>The accepted papers for <a href="https://neurips.cc/Conferences/2020">NeurIPS 2021</a> were recently announced, and there’s a huge amount of differential privacy content.
We found one relevant workshop and 48 papers.
This is up from 31 papers last year, an over 50% increase!
It looks like there’s huge growth in interest on differentially private machine learning.
Impressively, at the time of this writing, all but five papers are already posted on arXiv!
For the full list of accepted papers, see <a href="https://neurips.cc/Conferences/2021/AcceptedPapersInitial">here</a>.
Please let us know if we missed relevant papers on differential privacy!</p>
<h2 id="workshops">Workshops</h2>
<ul>
<li><a href="https://priml2021.github.io/">Privacy in Machine Learning (PriML) 2021</a></li>
</ul>
<h2 id="papers">Papers</h2>
<ul>
<li>
<p><a href="https://arxiv.org/abs/2103.08721">A Central Limit Theorem for Differentially Private Query Answering</a><br />
Jinshuo Dong, Weijie Su, Linjun Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2108.02391">Adapting to function difficulty and growth conditions in private optimization</a><br />
Hilal Asi, Daniel Levy, John Duchi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.04378">Adaptive Machine Unlearning</a><br />
Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, Chris Waites</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.13239">An Uncertainty Principle is a Price of Privacy-Preserving Microdata</a><br />
John Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Daniel Kifer, Philip Leclerc, William Sexton, Ashley Simpson, Christine Task, Pavel Zhuravlev</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03408">Antipodes of Label Differential Privacy: PATE and ALIBI</a><br />
Mani Malek Esmaeili, Ilya Mironov, Karthik Prasad, Igor Shilov, Florian Tramer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.13329">Covariance-Aware Private Mean Estimation Without Private Covariance Estimation</a><br />
Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, Lydia Zakynthinou</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.06062">Deep Learning with Label Differential Privacy</a><br />
Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.05855">Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent</a><br />
Rishav Chourasia, Jiayuan Ye, Reza Shokri</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02674">Differentially Private Empirical Risk Minimization under the Fairness Lens</a><br />
Cuong Tran, My Dinh, Ferdinando Fioretto</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.14153">Differentially Private Federated Bayesian Optimization with Distributed Exploration</a><br />
Zhongxiang Dai, Bryan Kian Hsiang Low, Patrick Jaillet</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/1905.03871">Differentially Private Learning with Adaptive Clipping</a><br />
Galen Andrew, Om Thakkar, Swaroop Ramaswamy, Brendan McMahan</p>
</li>
<li>
<p>Differentially Private Model Personalization<br />
Prateek Jain, John Rush, Adam Smith, Shuang Song, Abhradeep Guha Thakurta</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02900">Differentially Private Multi-Armed Bandits in the Shuffle Model</a><br />
Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2108.02831">Differentially Private n-gram Extraction</a><br />
Kunho Kim, Sivakanth Gopi, Janardhan Kulkarni, Sergey Yekhanin</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.02516">Differential Privacy Over Riemannian Manifolds</a><br />
Matthew Reimherr, Karthik Bharath, Carlos Soto</p>
</li>
<li>
<p>Differentially Private Sampling from Distributions<br />
Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith, Marika Swanberg</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.05585">Differentially Private Stochastic Optimization: New Results in Convex and Non-Convex Settings</a><br />
Raef Bassily, Cristóbal Guzmán, Michael Menart</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.01177">Don’t Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence</a><br />
Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, Karsten Kreis</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2010.09063">Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization</a><br />
Pranav Subramani, Nicholas Vadivelu, Gautam Kamath</p>
</li>
<li>
<p>Exact Privacy Guarantees for Markov Chain Implementations of the Exponential Mechanism with Artificial Atoms<br />
Jeremy Seeman, Matthew Reimherr, Aleksandra Slavković</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.03013">Fast and Memory Efficient Differentially Private-SGD via JL Projections</a><br />
Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, Uthaipon Tantipongpipat</p>
</li>
<li>
<p>G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators<br />
Yunhui Long, Boxin Wang, Zhuolin Yang, Bhavya Kailkhura, Aston Zhang, Carl Gunter, Bo Li</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03365">Generalized Linear Bandits with Local Differential Privacy</a><br />
Yuxuan Han, Zhipeng Liang, Yang Wang, Jiheng Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2008.11193">Individual Privacy Accounting via a Rényi Filter</a><br />
Vitaly Feldman, Tijana Zrnic</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2104.00979">Information-constrained optimization: can adaptive processing of gradients help?</a><br />
Jayadev Acharya, Clement Canonne, Prathamesh Mayekar, Himanshu Tyagi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.00463">Instance-optimal Mean Estimation Under Differential Privacy</a><br />
Ziyue Huang, Yuting Liang, Ke Yi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.07153">Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods</a><br />
Terrance Liu, Giuseppe Vietri, Steven Wu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.11845">Learning with User-Level Privacy</a><br />
Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, Ananda Theertha Suresh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.13513">Littlestone Classes are Privately Online Learnable</a><br />
Noah Golowich, Roi Livni</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2010.07778">Local Differential Privacy for Regret Minimization in Reinforcement Learning</a><br />
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.03940">Locally differentially private estimation of functionals of discrete distributions</a><br />
Cristina Butucea, Yann Issartel</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.10675">Locally private online change point detection</a><br />
Tom Berrett, Yi Yu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.10870">Multiclass versus Binary Differentially Private PAC Learning</a><br />
Satchit Sivakumar, Mark Bun, Marco Gaboardi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02848">Numerical Composition of Differential Privacy</a><br />
Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.11526">On the Sample Complexity of Privately Learning Axis-Aligned Rectangles</a><br />
Menachem Sadigurschi, Uri Stemmer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03645">Photonic Differential Privacy with Direct Feedback Alignment</a><br />
Ruben Ohana, Hamlet Medina, Julien Launay, Alessandro Cappelli, Iacopo Poli, Liva Ralaivola, Alain Rakotomamonjy</p>
</li>
<li>
<p>Private and Non-private Uniformity Testing for Ranking Data<br />
Róbert Busa-Fekete, Dimitris Fotakis, Emmanouil Zampetakis</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.07171">Private learning implies quantum stability</a><br />
Yihui Quek, Srinivasan Arunachalam, John A Smolin</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.15352">Private Non-smooth ERM and SCO in Subquadratic Steps</a><br />
Janardhan Kulkarni, Yin Tat Lee, Daogao Liu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02162">Privately Learning Mixtures of Axis-Aligned Gaussians</a><br />
Ishaq Aden-Ali, Hassan Ashtiani, Christopher Liaw</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.00001">Privately Learning Subspaces</a><br />
Vikrant Singhal, Thomas Steinke</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.02281">Privately Publishable Per-instance Privacy</a><br />
Rachel Redberg, Yu-Xiang Wang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2109.06153">Relaxed Marginal Consistency for Differentially Private Query Answering</a><br />
Ryan McKenna, Siddhant Pradhan, Daniel Sheldon, Gerome Miklau</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.03279">Remember What You Want to Forget: Algorithms for Machine Unlearning</a><br />
Ayush Sekhari, Jayadev Acharya, Gautam Kamath, Ananda Theertha Suresh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.08763">Renyi Differential Privacy of The Subsampled Shuffle Model In Distributed Learning</a><br />
Antonious Girgis, Deepesh Data, Suhas Diggavi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.09159">Robust and differentially private mean estimation</a><br />
Xiyang Liu, Weihao Kong, Sham Kakade, Sewoong Oh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.04995">The Skellam Mechanism for Differentially Private Federated Learning</a><br />
Naman Agarwal, Peter Kairouz, Ken Liu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.11208">User-Level Differentially Private Learning via Correlated Sampling</a><br />
Badih Ghazi, Ravi Kumar, Pasin Manurangsi</p>
</li>
</ul>
Gautam KamathTue, 09 Nov 2021 11:00:00 -0400
https://differentialprivacy.org/neurips2021/
https://differentialprivacy.org/neurips2021/How to deploy machine learning with differential privacy?<p>In many applications of machine learning, such as machine learning for medical diagnosis, we would like to have machine learning algorithms that do not memorize sensitive information about the training set, such as the specific medical histories of individual patients. Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data.</p>
<h2 id="why-do-we-need-private-machine-learning-algorithms">Why do we need private machine learning algorithms?</h2>
<p>Machine learning algorithms work by studying a lot of data and updating their parameters to encode the relationships in that data. Ideally, we would like the parameters of these machine learning models to encode general patterns (e.g., ‘‘patients who smoke are more likely to have heart disease’’) rather than facts about specific training examples (e.g., “Jane Smith has heart disease”). Unfortunately, machine learning algorithms do not learn to ignore these specifics by default. If we want to use machine learning to solve an important task, like making a cancer diagnosis model, then when we publish that machine learning model (for example, by making an open source cancer diagnosis model for doctors all over the world to use) we might also inadvertently reveal information about the training set. A malicious attacker might be able to inspect the published model’s predictions and learn private information about Jane Smith. For instance, the adversary could mount a membership inference attack to know whether or not Jane Smith contributed her data to the model’s training set [SSS17]. The adversary could also build on membership inference attacks to extract training data by repeatedly guessing possible training points until they result in a sufficiently strong membership signal from the model’s prediction [CTW20]. In many instances, the model itself may be represented by a few of the data samples (e.g., Support Vector Machine in its dual form).</p>
<p>A common misconception is that if a model generalizes (i.e., performs well on the test examples), then it preserves privacy. As mentioned earlier, this is far from being true. One of the main reasons being that generalization is an average case behavior of a model (over the distribution of data samples), whereas privacy must be provided for everyone, including outliers (which may deviate from our distributional assumptions).</p>
<p>Over the years, researchers have proposed various approaches towards protecting privacy in learning algorithms (k-anonymity [SS98], l-diversity [MKG07], m-invariance [XT07], t-closeness [LLV07] etc.). Unfortunately, [GKS08] all these approaches are vulnerable to what are called composition attacks, that use auxiliary information to violate the privacy protection. Famously, this strategy allowed researchers to de-anonymize part of a movie ratings dataset released to participants of the Netflix Prize when the individuals had also shared their movie ratings publicly on the Internet Movie Database (IMDb) [NS08]. If Jane Smith had assigned the same ratings to movies A, B and C in the Netflix Prize dataset and publicly on IMDb at similar times, then researchers could link data corresponding to Jane across both datasets. This would in turn give them the means to recover ratings that were included in the Netflix Prize but not on IMDb. This example shows how difficult it is to define and guarantee privacy because it is hard to estimate the scope of knowledge—about individuals—available to adversaries. While the dataset released by Netflix has since been taken down, it is difficult to ensure that all of its copies have been deleted. In recent years, data sample instance encoding based methods like InstaHide [HSL20], and NeuraCrypt [YEO21] have been demonstrated to be vulnerable to such composition attacks as well.</p>
<p>As a result, the research community has converged on differential privacy [DMNS06], which provides the following semantic guarantee, as opposed to ad-hoc approaches: An adversary learns almost the same information about an individual whether or not they are present in or absent from the training data set. In particular, it provides a condition on the algorithm, independent from who might be attacking it, or the specifics of the data set instantiation. Put another way, differential privacy is a framework for evaluating the guarantees provided by a system that was designed to protect privacy. Such systems can be applied directly to “raw” data which potentially still contains sensitive information, altogether removing the need for procedures that sanitize or anonymize data and are prone to the failures described previously. That said, minimizing data collection in the first place remains a good practice to limit other forms of privacy risk.</p>
<h2 id="designing-private-machine-learning-algorithms-via-differential-privacy">Designing Private Machine Learning Algorithms via Differential Privacy</h2>
<p>Differential privacy [DMNS06] is a semantic notion of privacy that addresses a lot of the limitations of previous approaches like k-anonymity. The basic idea is to randomize part of the mechanism’s behavior to provide privacy. In our case, the mechanism considered is a learning algorithm, but the differential privacy framework can be applied to study any algorithm.</p>
<p>The intuition for why we introduce randomness into the learning algorithm is that it obscures the contribution of an individual, but does not obscure important statistical patterns. Without randomness, we would be able to ask questions like: “What parameters does the learning algorithm choose when we train it on this specific dataset?” With randomness in the learning algorithm, we instead ask questions like: “What is the probability that the learning algorithm will choose parameters in this set of possible parameters, when we train it on this specific dataset?”</p>
<p>We use a version of differential privacy which requires (in our use case of machine learning) that the probability of learning any particular set of parameters stays roughly the same if we change a single data record in the training set. A data record can be a single training example from an individual, or the collection of all the training examples provided by an individual. The former is often referred to as example level/item level privacy, and the latter is referred to as user level differential privacy. While user level privacy provides stronger semantics, it may be harder to achieve. For a more thorough discussion about the taxonomy of these notions, see [DNPR10, JTT18, HR12, HR13]. In this document, for the ease of exposition of the technical results, we focus on the example level notion. This could mean to add a training example, remove a training example, or change the values within one training example. The intuition is that if a single patient (Jane Smith) does not affect the outcome of learning much, then that patient’s records cannot be memorized and her privacy is respected. In the rest of this post, how much a single record can affect the outcome of learning is called the sensitivity of an algorithm.</p>
<p>The guarantee of differential privacy is that the adversary is not able to distinguish the answers produced by the randomized algorithm based on the data of two of the three users from the answers returned by the same algorithm based on the data of all three users. We also refer to the degree of indistinguishability as the privacy loss. Smaller privacy loss corresponds to a stronger privacy guarantee.</p>
<p>It is often thought that privacy is a fundamental bottleneck in obtaining good prediction accuracy/generalization for machine learning algorithms. In fact, recent research has shown that in many instances it actually helps in designing algorithms with strong generalization ability. Some of the examples where DP has resulted in designing better learning algorithms are Online linear predictions [KV05] and online PCA [DTTZ13]. Notably, [DFH15] formally showed that generalization for any DP learning algorithm comes for free. More concretely, if a DP learning algorithm has good training accuracy, it is guaranteed to have good test accuracy.
This is true because differential privacy itself acts as a very strong form of regularization.</p>
<p>One might argue that the generalization guarantee which a DP algorithm can achieve may be sub-par to that of its non-private baselines. For a large class of learning tasks, one can show that asymptotically DP does not introduce any further error beyond the inherent statistical error [SSTT21]. [ACG16,BFTT19] highlights that in the presence of enough data, a DP algorithm can get arbitrarily close to the inherent statistical error, even under strong privacy parameters.</p>
<h2 id="private-empirical-risk-minimization">Private Empirical Risk Minimization</h2>
<p>Before we go into the design of specific differentially private learning algorithms, we first formalize the problem setup, and standardize some notation. Consider a training data set \(D={(x_1,y_1),…,(x_n,y_n)}\) drawn i.i.d. from some fixed (unknown) distribution \(\Pi\), with the feature vector being \(x_i\) and label/response being \(y_i\). We define the training loss at any model \(\theta\) as \(L_{train} (\theta, D) = \frac{1}{n} \sum_{i=1}^{n} l(\theta; (x_i, y_i))\), and the corresponding test loss as \(L_{test} (\theta) = E_{(x,y) \sim \Pi} l(\theta;(x,y)) \).
We will design DP algorithms to output models that approximately minimize the test loss while having access only to the training loss.</p>
<p>In the literature, there are a variety of approaches towards designing these DP learning algorithms [CMS11, KST12, BST14, PAE16, BTT18]. One can categorize them broadly as: i) algorithms that assume that the individual loss function \(l(\theta;\cdot) \) is convex in the model parameter to ensure differential privacy, ii) algorithms that are differentially private even when the loss function is non-convex in nature (e.g., deep learning models), and iii) model agnostic algorithms, that do not require any information about the representation of the model \(\theta\), or the loss function \(l(\theta;\cdot) \). In our current discussion, we will only focus on designing algorithms for (ii), and (iii). This is because it turns out that the best known algorithms for (ii) are already competitive to algorithms that are specific for (i) [INS19].</p>
<h2 id="private-algorithms-for-training-deep-learning-models">Private Algorithms for Training Deep Learning Models</h2>
<p>The first approach, due to SCS13, BST14, and ACG16, is named differentially private stochastic gradient descent (DP-SGD). It proposes to modify the model updates computed by the most common optimizer used in deep learning: stochastic gradient descent (SGD). Typically, stochastic gradient descent trains iteratively. At each iteration, a small number of training examples (a “minibatch”) are sampled from the training set. The optimizer computes the average model error on these examples, and then differentiates this average error with respect to each of the model parameters to obtain a gradient vector. Finally, the model parameters (\(\theta_t\)) are updated by subtracting this gradient (\(\nabla_t\)) multiplied by a small constant \(\eta\) (the learning rate controls how quickly the optimizer updates the model’s parameters). At a high level, two modifications are made by DP-SGD to obtain differential privacy: gradients, which are computed on a per-example basis (rather than averaged over multiple examples), are first clipped to control their sensitivity, and, second, spherical Gaussian noise \(b_t\) is added to their sum to obtain the indistinguishability needed for DP. Succinctly, the update step can be written as follows: \(\theta_{t+1} \leftarrow \theta_t - \eta \cdot (\nabla_t + b_t)\).</p>
<p>Let us take the example of a hospital training a model to predict whether patients will be readmitted after being released. To train the model, the hospital uses information from patient records, such as demographic variables and admission variables (e.g., age, ethnicity, insurance type, type of Intensive Care Unit admitted to) but also time-varying vitals and labs (e.g., heart rate, blood pressure, white blood cell counts) [JPS16]. The modifications made by DP-SGD ensure that if (1) Jane Smith’s individual patient record contained unusual features, e.g., her insurance provider was uncommon for people of her age or her heart rate followed an unusual pattern, the resulting signal will have a bounded impact on our model updates, and (2) the model’s final parameters would be essentially identical should Jane Smith have chosen to not contribute (i.e., opt-out) her patient record to the training set. Stronger differential privacy is achieved when one is able to introduce more noise (i.e., sample noise with larger standard deviation) and train for as few iterations as possible.</p>
<p>Two main components in the above DP-SGD algorithm that distinguishes itself from traditional SGD are: i) per-example clipping and ii) Gaussian noise addition. In addition, for the analysis to hold, DP-SGD requires that sub-sampling of mini batches is uniform at random from the training data set. While this is not a requirement of DP-SGD per se, in practice many implementations of SGD do not satisfy this requirement and instead analyze different permutations of the data at each epoch of training.</p>
<p>While gradient clipping is common in deep learning, often used as a form of regularization, it differs from that in DP-SGD as follows: The average gradient over the minibatch is clipped, as opposed to clipping the gradient of individual examples (i.e., \(l(\theta_t;(x,y)) \) before averaging. It is an ongoing research direction to both understand the effect of per-example clipping in DP-SGD in model training [SSTT21], and also effective ways to mitigate its impact both in terms of accuracy [PTS21], and training time [ZHS19].</p>
<p>In standard stochastic gradient descent, subsampling is usually used either as a way to speed up the training process [CAR16], or as a a form of regularization [RCR15]. In DP-SGD, the randomness in the subsampling of the minibatch is used to guarantee DP. The technical component for this sort of privacy analysis is called privacy amplification by subsampling [KLNRS08,BBG18]. Since the sampling randomness is used to guarantee DP, it is crucial that the uniformity in the sampling step is of cryptographic strength. Another, (possibly) counterintuitive feature of DP-SGD is that for best privacy/utility trade-off it is in general better to have larger batch sizes. In fact, full-batch DP-gradient descent may provide the best privacy/utility trade-offs, albeit at the expense of computational feasibility.</p>
<p>For a fixed DP guarantee, the magnitude of the Gaussian noise that gets added to the gradient updates in each step in DP-SGD is proportional to \(\sqrt{the\ number\ of\ steps}\) the model is trained for. As a result, it is important to tune the number of training steps for best privacy/utility trade-offs.</p>
<p>In the <a href="https://github.com/tensorflow/privacy/blob/master/tutorials/mnist_dpsgd_tutorial.py">following tutorial</a>, we provide a small code snippet to train a model with DP-SGD.</p>
<h2 id="model-agnostic-private-learning">Model Agnostic Private Learning</h2>
<p>The Sample and Aggregate framework [NRS07] is a generic method to add differential privacy to a non-private algorithm without caring about the internal workings of it, a.k.a. model agnostic. In the context of machine learning, one can state the main idea as follows: Consider a multi-class classification problem. Take the training data, and split into k disjoint subsets of equal size. Train independent models \(\theta_1, \theta_2, …, \theta_k \) on the disjoint subsets. In order to predict on an test example x, first, compute a private histogram over the set of k predictions \(\theta_1(x), \theta_2(x), …, \theta_k(x) \). Then, select and output the bin in the histogram based on the highest count, after adding a small amount of Laplace/Gaussian noise to the counts. In the context of DP learning, this particular approach was used in two different lines of work: i) PATE [PAE16], and ii) Model agnostic private learning [BTT18]. While the latter focussed on obtaining theoretical privacy/utility trade-offs for a class of learning tasks (e.g., agnostic PAC learning), the PATE approach focuses on practical deployment. Both these lines of work make one common observation. If the predictions from \(\theta_1(x), \theta_2(x), …, \theta_k(x) \) are fairly consistent, then the privacy cost in terms of DP is very small. Hence, one can run a large number of prediction queries, without violating DP constraints. In the following, we describe the PATE approach in detail.</p>
<p>The private aggregation of teacher ensembles (PATE) demonstrated in particular that this approach allows one to learn deep neural networks with differential privacy. It proposes to have an ensemble of models trained without privacy predict with differential privacy by having these models predict in aggregate rather than revealing their individual predictions. In PATE, we start by partitioning the private dataset into smaller subsets of data. These subsets are partitions, so there is no overlap between the data included in any pair of partitions. If Jane Smith’s record was in our private dataset, then it is included in one of the partitions only. That is, only one of the teachers has analyzed Jane Smith’s record during training. We train a ML model, called a teacher, on each of these partitions. We now have an ensemble of teacher models that were trained independently, but without any guarantees of privacy. How do we use this ensemble to make predictions that respect privacy? In PATE, we add noise while aggregating the predictions made individually by each teacher to form a single common prediction. We count the number of teachers who voted for each class, and then perturb that count by adding random noise sampled from the Laplace or Gaussian distribution. Each label predicted by the noisy aggregation mechanism comes with rigorous differential privacy guarantees that bound the privacy budget spent to label that input. Again, stronger differential privacy is achieved when we are able to introduce more noise in the aggregation and are able to answer as few queries as possible. Let us now come back to our running example. Imagine that we’d like to use the output of PATE to know if Jane likes a particular movie. The only teacher trained on the partition containing Jane Smith’s data—has now learned that a record similar to Jane’s is characteristic of an individual who likes similar movies, and as a consequence changes its prediction on a test input which is similar to Jane’s to predict the movie rating assigned by Jane. However, because the teacher only contributes a single vote to the aggregation, and that the aggregation injects noise, we won’t be able to know whether the teacher changed its prediction to the movie rating assigned by Jane because the teacher indeed trained on Jane’s data or because the noise injected during the aggregation “flipped” that teacher’s vote. The random noise added to vote counts prevents the outcome of aggregation from reflecting the votes of any individual teachers to protect privacy.</p>
<h2 id="practically-deploying-differential-privacy-in-machine-learning">Practically deploying differential privacy in machine learning</h2>
<p>The two approaches we introduced have the advantage of being conceptually simple to understand. Fortunately, there also exist several open-source implementations of these approaches. For instance, DP-SGD is implemented in TensorFlow Privacy, Objax, and Opacus. This means that one is able to take an existing TensorFlow, JAX, or PyTorch pipeline for training a machine learning model and replace a non-private optimizer with DP-SGD. An example implementation of PATE is also available in TensorFlow Privacy. So what are the concrete potential obstacles to deploying machine learning with differential privacy?</p>
<p>The first obstacle is the accuracy of privacy-preserving models. Datasets are often sampled from distribution with heavy tails. For instance, in a medical application, there are typically (and fortunately) fewer patients with a given medical condition than patients without that condition. This means that there are fewer training examples for patients with each medical condition to learn from. Because differential privacy prevents us from learning patterns which are not found generally across the training data, it limits our ability to learn from these patients for which we have very few examples of [SPG]. More generally, there is often a trade-off between the accuracy of a model and the strength of the differential privacy guarantee it was trained with: the smaller the privacy budget is, the larger the impact on accuracy typically is. That said, this tension is not always inevitable and there are instances where privacy and accuracy are synergical because differential privacy implies generalization [DFH15] (but not vice versa).</p>
<p>The second obstacle to deploying differentially private machine learning can be the computational overhead. For instance, in DP-SGD one must compute per-example gradients rather than average gradients. This often means that optimizations implemented in machine learning frameworks to exploit matrix algebra supported by underlying hardware accelerators (e.g., GPUs) are harder to take advantage of. In another example, PATE requires that one train multiple models (the teachers) rather than a single model so this can also introduce overhead in the training procedure. Fortunately, this cost is mostly mitigated in recent implementations of private learning algorithms, in particular in Objax and Opacus.</p>
<p>The third obstacle to deploying differential privacy, in machine learning but more generally in any form of data analysis, is the choice of privacy budget. The smaller the budget, the stronger the guarantee is. This means one can compare two analyses and say which one is “more private”. However, this also means that it is unclear what is “small enough” of a privacy budget. This is particularly problematic given that applications of differential privacy to machine learning often require a privacy budget that provides little theoretical guarantees in order to train a model whose accuracy is large enough to warrant a useful deployment. Thus, it may be interesting for practitioners to evaluate the privacy of their machine learning algorithm by attacking it themselves. Whereas the theoretical analysis of an algorithm’s differential privacy guarantees provides a worst-case guarantee limiting how much private information the algorithm can leak against any adversary, implementing a specific attack can be useful to know how successful a particular adversary or class of adversaries would be. This helps interpret the theoretical guarantee but may not be treated as a direct substitute for it. Open-source implementations of such attacks are increasingly available: e.g., for membership inference <a href="https://github.com/tensorflow/privacy/tree/master/tensorflow_privacy/privacy/privacy_tests/membership_inference_attack">here</a> and <a href="https://github.com/cchoquette/membership-inference">here</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In the above, we discussed some of the algorithmic approaches towards differentially private model training which have been effective both in theoretical and practical settings. Since it is a rapidly growing field, we could not cover all the important aspects of the research space. Some prominent ones include: i) Choice of the best hyperparameters in the training of DP models.In order to ensure that the overall algorithm preserves differential privacy, one needs to ensure that the choice of hyperparameters itself preserves DP. Recent research has provided algorithms for selecting the best hyperparameters in a differentially private fashion [LT19]. ii) Choice of network architecture: it is not always true that the best known model architectures for non-private model training are indeed the best for training with differential privacy. In particular, we know that the number of model parameters may have adverse effects on the privacy/utility trade-offs [BST14]. Hence, choosing the right model architecture is important for providing a good privacy/utility trade-off [PTS21]. (iii) Training in the federated/distributed setting: in the above exposition, we assumed that the training data lies in a single centralized location. However, in settings like Federated Learning (FL) [MMRHA17], the data records can be highly distributed, e.g., across various mobile devices. Running DP-SGD in the FL setting, which is required for FL to provide privacy guarantees for the training data, raises a series of challenges [KMA19] which are often facilitated by distributed private learning algorithms designed specifically for FL settings [BKMTT20, KMSTTZ21]. Some of the specific challenges in the context of FL include, limited and non-uniform availability of clients (holding individual data records) and unknown (and variable) size of the training data [BKMTT18]. On the other hand, PATE style algorithms lend themselves naturally to the distributed setting once combined with existing cryptographic primitives, as demonstrated by the CaPC protocol [CDD21]. It is an active area of research to address these above challenges.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>The authors would like to thank Thomas Steinke and Andreas Terzis for detailed feedback and edit suggestions. Parts of this blog post previously appeared on <a href="www.cleverhans.io">www.cleverhans.io</a>.</p>
<h2 id="citations">Citations</h2>
<p>[ACG16] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308-318). ACM.</p>
<p>[BBG18] Balle, B., Barthe, G., & Gaboardi, M. (2018). Privacy amplification by subsampling: Tight analyses via couplings and divergences. arXiv preprint arXiv:1807.01647.</p>
<p>[BKMTT18] Balle, B., Kairouz P., McMahan M., Thakkar O. & Thakurta A. (2020). Privacy amplification via random check-ins. In NeurIPS.</p>
<p>[MMRHA17] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.</p>
<p>[KMSTTZ18] Kairouz P., McMahan M., Song S., Thakkar O., Thakurta A., & Xu Z. (2021). Practical and Private (Deep) Learning without Sampling or Shuffling. In ICML.</p>
<p>[BFTT19] Bassily, R., Feldman, V., Talwar, K., & Thakurta, A. Private Stochastic Convex Optimization with Optimal Rates. In NeurIPS 2019.</p>
<p>[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science.</p>
<p>[BTT18] Bassily, R., Thakurta, A. G., & Thakkar, O. D. (2018). Model-agnostic private learning. Advances in Neural Information Processing Systems.</p>
<p>[CDD21] Choquette-Choo, C. A., Dullerud, N., Dziedzic, A., Zhang, Y., Jha, S., Papernot, N., & Wang, X. (2021). CaPC Learning: Confidential and Private Collaborative Learning. arXiv preprint arXiv:2102.05188.</p>
<p>[CMS11] Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).</p>
<p>[CTW20] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … & Raffel, C. (2020). Extracting training data from large language models. arXiv preprint arXiv:2012.07805.</p>
<p>[DFH15] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). Generalization in adaptive data analysis and holdout reuse. arXiv preprint arXiv:1506.02629.</p>
<p>[DMNS06] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006, March). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (pp. 265-284). Springer, Berlin, Heidelberg.</p>
<p>[DNPR10] Dwork, C., Naor, M., Pitassi, T., & Rothblum, G. N. (2010, June). Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing (pp. 715-724).</p>
<p>[DTTZ14] Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014, May). Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (pp. 11-20).</p>
<p>[HSL20] Huang, Y., Song, Z., Li, K., & Arora, S. (2020, November). Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning (pp. 4507-4518). PMLR.</p>
<p>[HR12] Hardt, M., & Roth, A. (2012, May). Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (pp. 1255-1268).</p>
<p>[HR13] Hardt, M., & Roth, A. (2013, June). Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (pp. 331-340).</p>
<p>[JPS16] Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35</p>
<p>[JTT18] Jain, P., Thakkar, O. D., & Thakurta, A. (2018, July). Differentially private matrix completion revisited. In International Conference on Machine Learning (pp. 2215-2224). PMLR.</p>
<p>[INS19] Iyengar, R., Near, J. P., Song, D., Thakkar, O., Thakurta, A., & Wang, L. (2019, May). Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP) (pp. 299-316). IEEE.</p>
<p>[KST12] Kifer, D., Smith, A., & Thakurta, A. (2012, June). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory (pp. 25-1). JMLR Workshop and Conference Proceedings.</p>
<p>[KMA19] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., … & Zhao, S. (2019). Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977.</p>
<p>[KV05] Kalai, Adam, and Santosh Vempala. “Efficient algorithms for online decision problems.” Journal of Computer and System Sciences 71.3 (2005): 291-307.</p>
<p>[KLNRS08] Raskhodnikova, S., Smith, A., Lee, H. K., Nissim, K., & Kasiviswanathan, S. P. (2008). What can we learn privately. In Proceedings of the 54th Annual Symposium on Foundations of Computer Science (pp. 531-540).</p>
<p>[LLV07] Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 106-115). IEEE.</p>
<p>[LT19] Liu, J., & Talwar, K. (2019, June). Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (pp. 298-309).</p>
<p>[M17] Mironov, I. (2017, August). Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th (pp. 263-275). IEEE.</p>
<p>[MKG07] Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan (March 2007). “L-diversity: Privacy Beyond K-anonymity”. ACM Transactions on Knowledge Discovery from Data.</p>
<p>[NRS07] Nissim, K., Raskhodnikova, S., & Smith, A. (2007, June). Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (pp. 75-84).</p>
<p>[NS08] Narayanan, A., & Shmatikov, V. (2008, May). Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on (pp. 111-125). IEEE.</p>
<p>[PAE16] Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., & Talwar, K. (2016). Semi-supervised knowledge transfer for deep learning from private training data. ICLR 2017.</p>
<p>[PTS21] Papernot, N., Thakurta, A., Song, S., Chien, S., & Erlingsson, U. (2020). Tempered sigmoid activations for deep learning with differential privacy. AAAI 2021.</p>
<p>[RCR15] Rudi, A., Camoriano, R., & Rosasco, L. (2015, December). Less is More: Nyström Computational Regularization. In NIPS (pp. 1657-1665).</p>
<p>[SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP ’13, pages 245–248, Washington, DC, USA, 2013. IEEE Computer Society.</p>
<p>[SPG] Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings. Vinith Suriyakumar, Nicolas Papernot, Anna Goldenberg, Marzyeh Ghassemi. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.</p>
<p>[SS98] Samarati, Pierangela; Sweeney, Latanya (1998). “Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression” (PDF). Harvard Data Privacy Lab. Retrieved April 12, 2017</p>
<p>[SSS17] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017, May). Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 3-18). IEEE.</p>
<p>[SSTT21] Song, S., Thakkar, O., & Thakurta, A. (2020). Evading the Curse of Dimensionality in Unconstrained Private GLMs. In AISTATS 2021.</p>
<p>[XT07] Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets. In: SIGMOD conference, Beijing, China, pp 689–700</p>
<p>[YEO21] Yala, A., Esfahanizadeh, H., Oliveira, R. G. D., Duffy, K. R., Ghobadi, M., Jaakkola, T. S., … & Medard, M. (2021). NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training. arXiv preprint arXiv:2106.02484.</p>
<p>[ZHS19] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.</p>
Nicolas PapernotAbhradeep ThakurtaMon, 25 Oct 2021 11:00:00 -0700
https://differentialprivacy.org/how-to-deploy-ml-with-dp/
https://differentialprivacy.org/how-to-deploy-ml-with-dp/One-shot DP Top-k mechanisms<p>In the last <a href="https://differentialprivacy.org/exponential-mechanism-bounded-range/"><em>blog post</em></a>, we showed that the exponential mechanism enjoys improved composition bounds over general pure DP mechanisms due to a property called <strong>bounded range</strong>. For this post, we will present another useful, and somewhat surprising, property of the exponential mechanism in its application of top-\(k\) selection.</p>
<h2 id="differentially-private-top-k-selection">Differentially Private Top-\(k\) Selection</h2>
<p>We will focus on datasets that are a vector of counts \(h = (h_1, \cdots, h_d) \in \mathbb{N}^d\), which consist of counts \(h_i\) for elements from a universe \(\mathcal{U}\) where \(|\mathcal{U}| = d\). Let’s assume that a user’s data can modify each count by at most 1, yet can change all \(d\) counts, i.e. the \(\ell_\infty\)-sensitivity is 1 and the \(\ell_0\)-sensitivity is \(d\). The task here is to return the top-\(k\) elements from the input counts in a differentially private way.</p>
<p>For top-\(1\), this is simply returning the element with the max count, and this is precisely the problem that the exponential mechanism is set up to solve. Let’s write out the exponential mechanism \(M^{(1)}: \mathbb{N}^d \to [d]\) for this instance:
\[
\mathbb{P}[M^{(1)}(h) = i] = \frac{e^{ \varepsilon h_i }}{\sum_{j \in [d] } e^{ \varepsilon h_j } }, \qquad \forall i \in [d].
\]
For those wondering why this formula omits the factor of \(1/2\) in the exponent, we are using the <a href="https://dongjs.github.io/2020/02/10/ExpMech.html">stronger result</a> of the exponential mechanism which replaces global sensitivity with the range of the loss function \(\ell(i,h) = - h_i\), which is \(1\) in this case. Recall from the last blog post that the exponential mechanism is \(\varepsilon^2/8\)-CDP.</p>
<p>Hence, to generalize this to top-\(k\) selection, we can simply iteratively apply this exponential mechanism by removing the <em>discovered</em> element from each previous round. That is, we write \(M^{(k)}: \mathbb{N}^d \to [d]^k\) as the following for any outcome \( (i_1, i_2, \cdots, i_k) \in [d]^k\),</p>
<p>\[
\mathbb{P}[M^{(k)}(h) = (i_1, i_2, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
<a name="eq:peelingEM"></a>
\[\qquad = \frac{e^{ \varepsilon h_{i_1} }}{\sum_{j \in [d] } e^{ \varepsilon h_j } } \cdot \frac{ e^{\varepsilon h_{i_2} } }{\sum_{j\in [d]\setminus \{ i_1\}} e^{\varepsilon h_j } } \cdot \cdots \cdot \frac{ e^{ \varepsilon h_{i_2} } }{\sum_{j\in [d]\setminus \{ i_1, \cdots, i_{k-1}\}} e^{ \varepsilon h_j } }.
\tag{1}
\]</p>
<p>We can then apply composition to conclude that \(M^{(k)}\) is \(k \varepsilon^2/8\)-CDP.</p>
<h2 id="gumbel-noise-and-the-exponential-mechanism">Gumbel Noise and the Exponential Mechanism</h2>
<p>As we discussed in our last post, we can implement the exponential mechanism by adding <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel</a> noise to each count and reporting the noisy max element. A Gumbel random variable \(X \sim \text{Gumbel}(\beta) \), parameterized by scale parameter \(\beta>0\), has the following density function
<a name="eq:GumbelDensity"></a>
\[
p(x;\beta) = \frac{1}{\beta} \exp\left( - x/\beta - e^{-x/\beta} \right), \qquad \forall x \in \mathbb{R}.
\tag{2}
\]</p>
<p>Hence, we can write the exponential mechanism in the following way
\[
M^{(1)}(h) = \arg\max \{ h_i + X_i : i \in [d] \}, \qquad \{X_i \} \stackrel{i.i.d.}{\sim} \text{Gumbel}(1/\varepsilon).
\]</p>
<p>We can then extend this to top-\(k\) by repeatedly adding independent Gumbel noise to each count and removing the discovered element for the next round. However, something that would significantly improve run time would be to add Gumbel noise to each count <em>once</em> and then take the elements with the top-\(k\) noisy counts. We could then add only \(d\) many noise terms, rather than \(O(d^k)\) noise terms if we were to iteratively run \(k\) different exponential mechanisms. The question is, does this one-shot top-\(k\) Gumbel noise mechanism ensure the same level of privacy?</p>
<p>Let’s denote the one-shot Gumbel mechanism as \(\tilde{M}^{(k)}\). At first glance, it does not seem like the one-shot Gumbel mechanism \(\tilde{M}^{(k)}\) should be just as private as the iterative exponential mechanism \(M^{(k)}\), but it turns out they are exactly the same mechanism! The following result is due to <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a>.</p>
<blockquote>
<p><strong>Theorem 1</strong>
For any input vector of counts \(h \in \mathbb{N}^d\), the one-shot Gumbel mechanism \(\tilde{M}^{(k)}(h)\) and iteratively applying the exponential mechanism \(M^{(k)}(h)\) are equal in distribution.</p>
</blockquote>
<p><em>Proof.</em>
Recall the distribution of the iterative exponential mechanism \(M^{(k)}(h)\) from <a href="#eq:peelingEM">(1)</a>.
Now we consider the one-shot Gumbel mechanism \(\tilde{M}^{(k)}(h)\) where we use the density of \(X \sim \) Gumbel\( (1/\varepsilon)\) from <a href="#eq:GumbelDensity">(2)</a>.<br />
\[
\mathbb{P}[\tilde{M}^{(k)}(h) = (i_1, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \int_{-\infty}^\infty p(u_1 - h_{i_1}) \int_{-\infty}^{u_1} p(u_2 - h_{i_2}) \cdots \int_{-\infty}^{u_{k-1}} p(u_k - h_k)
\]
<a name="eq:integral"></a>
\[
\qquad \qquad \cdot \prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[ X < u_k - h_j]du_k \cdots du_2 du_1.
\tag{3}
\]
Note that we have
\[
\mathbb{P}[X < y] = \exp\left( - \exp\left( -\varepsilon y \right) \right).
\]
Let’s focus on the inner integral over \(u_k\) in <a href="#eq:integral">(3)</a>.<br />
\[
\int_{-\infty}^{u_{k-1}} p(u_k - h_{i_k} )\prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[X < u_k - h_j ]du_k \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\quad = \int_{-\infty}^{u_{k-1}} \varepsilon \cdot \exp\left( - \varepsilon (u_k - h_{i_k}) - e^{ -\varepsilon (u_k - h_{i_k}) } \right) \cdot \exp\left( -e^{-\varepsilon u_k} \sum_{j \in [d] \setminus \{i_1, \cdots, i_k \} } e^{\varepsilon h_j} \right) du_k
\]
\[
\quad = \varepsilon e^{\varepsilon h_{i_k}} \int_{-\infty}^{u_{k-1}} \exp\left( -\varepsilon u_k - e^{-\varepsilon u_k} \left( e^{\varepsilon h_{i_k}} + \sum_{j \in [d] \setminus \{i_1, \cdots i_k \} } e^{\varepsilon h_j} \right) \right) du_k \qquad
\]
<a name="eq:lastLine"></a>
\[
\qquad = \varepsilon e^{\varepsilon h_{i_k}} \int_{-\infty}^{u_{k-1}} \exp\left( -\varepsilon u_k - e^{-\varepsilon u_k} \left(\sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j} \right) \right) du_k. \qquad \qquad
\tag{4}
\]
We now integrate with a \(v\)-substitution,
\[
v =e^{-\varepsilon u_{k}} \sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j}<br />
\]
\[
dv = - \varepsilon \sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j} \cdot e^{-\varepsilon u_{k}} du_{k}.
\]</p>
<p>Continuing with <a href="#eq:lastLine">(4)</a>, we get
\[
\int_{-\infty}^{u_{k-1}} p(u_k - h_{i_k} )\prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[X < u_k - h_j ]du_k \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_k} }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}} \cdot \exp\left( - e^{-\varepsilon u_{k-1}} \cdot \sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j} \right)
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_k} }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}} \cdot \prod_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \} } \mathbb{P}[X < u_{k-1} - h_j ] .
\]
Note how this line has the last term in the expression for \(M^{(k)}(h)\) in <a href="#eq:peelingEM">(1)</a>, which is independent of \(u_{k-1}\) and can hence be pulled out of the larger integral in <a href="#eq:integral">(3)</a>. By induction, we have
\[
\mathbb{P}[\tilde{M}^{(k)}(h) = (i_1, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_1} }}{\sum_{j \in [d]} e^{\varepsilon h_j}} \cdot \frac{e^{\varepsilon h_{i_2} }}{\sum_{j \in [d] \setminus \{ i_1\}} e^{\varepsilon h_j}} \cdot \cdots \cdot \frac{e^{\varepsilon h_k }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}}
\]
\[
\qquad = \mathbb{P}[M^{(k)}(h) =(i_1, \cdots, i_k)].
\] ∎</p>
<p>So that’s great! We can now run the one-shot Gumbel mechanism for top-\(k\) and still get the improved composition bounds of the exponential mechanism. In addition to this achieving better runtime, this analysis can help with proving top-\(k\) DP algorithms over a large domain universe despite giving access to only the true top-\(\bar{k}\) items and their counts where \(\bar{k} > k \), see <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a> for more details.</p>
<h2 id="report-noisy-max-for-dp-top-k">Report Noisy Max for DP Top-\(k\)</h2>
<p>We now turn to comparing this algorithm to some natural alternatives. As we discussed in the last post, there is a family of mechanisms, report noisy max (RNM) mechanisms, that ensure differential privacy for the selection problem, and hence the top-\(1\) problem. We showed that the exponential mechanism is equivalent to RNM with Gumbel noise, there is also RNM with Laplace and with Exponential noise, the last being the recently discovered <em>permute-and-flip</em> mechanism <a href="https://arxiv.org/abs/2010.12603" title="Ryan McKenna, Daniel Sheldon. Permute-and-Flip: A new mechanism for differentially private selection
. NeurIPS 2020."><strong>[MS20]</strong></a> <a href="https://arxiv.org/abs/2105.07260" title="Zeyu Ding, Daniel Kifer, Sayed M. Saghaian N. E., Thomas Steinke, Yuxin Wang, Yingtai Xiao, Danfeng Zhang. The Permute-and-Flip Mechanism is Identical to Report-Noisy-Max with Exponential Noise. 2021."><strong>[DKSSWXZ21]</strong></a>.</p>
<p>To then use RNM mechanisms for top-\(k\), we can again iteratively apply them and use composition to get the overall privacy guarantee. However, it turns out that you can also use the Laplace noise version of RNM in one-shot <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>.</p>
<p>We can compare the relative noise that is added to each count in both the Laplace and Gumbel versions. Since <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a> gives their privacy guarantee in terms of approximate \((\varepsilon,\delta )\)-DP, we will now make the comparison there. We first look at the standard deviation for Laplace \( \sigma_{\text{Lap}}\) (using Theorem 2.2 in <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>).
\[
\sigma_{\text{Lap}} = \frac{8 \sqrt{2k \ln(d/\delta)}}{\varepsilon}.
\]
Note that the one-shot Laplace mechanism returns counts as well as the indices of the top-\(k\), both of which use Laplace noise with standard deviation \(\sigma_{\text{Lap}}\), so we will also include Laplace noise to the discovered elements in the Gumbel version. That is, we add Gumbel noise with scale \(\sqrt{k}/\varepsilon’\) for the discovery portion and Laplace noise with scale \(2\sqrt{k}/\varepsilon’\) for obtaining their counts, resulting in standard deviation noise \(\sigma_{\text{Gumb}}’\) and \(\sigma_{\text{Lap}}’\), respectively.<br />
\[
\sigma_{\text{Gumb}}’ = \frac{\pi \sqrt{k} }{\sqrt{6}\varepsilon’}, \qquad
\sigma_{\text{Lap}}’ = \frac{2\sqrt{2k}}{\varepsilon’}.
\]
Recall that adding this scale of Gumbel noise and Laplace noise will ensure \(\tfrac{\varepsilon’^2}{8} \)-CDP each, so combining will ensure \(\tfrac{\varepsilon’^2}{4}\)-CDP. We could also use Gaussian noise to return the counts since we are using CDP, but we will analyze it with Laplace noise for comparison. To ensure \((\varepsilon,\delta)\)-DP, we use the CDP to DP conversion from Lemma 3.5 in <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a> and solve for \(\varepsilon’\). Hence, we get for any \(\delta>0\)
\[
\varepsilon’^2/4 = \left( \sqrt{\ln(1/\delta) + \varepsilon} - \sqrt{\ln(1/\delta)} \right)^2
\]
\[ \implies \varepsilon’ = 2 \sqrt{\ln(1/\delta)} \left( \sqrt{1 + \tfrac{\varepsilon}{\ln(1/\delta)}} - 1 \right).
\]</p>
<p>Let’s consider a typical privacy setting where \(\varepsilon < \ln(1/\delta)\), and use the inequality \(\sqrt{1+x} \geq 1 + x/4\) for \(0<x<1\). Here is a short proof of this inequality:
\[
(1 + x/4)^2 = 1 + x/2 + x^2/16 \leq 1 + x/2 + x/2 = 1 + x.<br />
\]
Note that the privacy guarantee for one-shot Laplace noise only holds when \(\varepsilon < 0.2\) and \(\delta < 0.05\) as stated in Theorem 2.2 in <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>. In this case, we have
\[
\varepsilon’ \geq 2 \sqrt{\ln(1/\delta)} \left( 1 + 1/4 \cdot \tfrac{\varepsilon}{\ln(1/\delta)} - 1 \right) = 1/2 \cdot \tfrac{\varepsilon}{\sqrt{\ln(1/\delta)}}.
\]
Plugging \(\varepsilon’\) into the standard deviation of Gumbel and Laplace, we get
\[
\sigma_{\text{Gumb}}’ \leq \frac{2\pi\sqrt{k \ln(1/\delta)}}{\sqrt{6}\cdot \varepsilon}, \qquad \sigma_{\text{Lap}}’ \leq \frac{4\sqrt{2k\ln(1/\delta)}}{\varepsilon}.
\]</p>
<p>Putting this together, we can show that we add significantly less noise for the discovery and releasing noisy count phases,
\[
\sigma_{\text{Gumb}}’\leq \sigma_{\text{Lap}}/4 ,\qquad \sigma_{\text{Lap}}’ \leq \sigma_{\text{Lap}}/2.
\]
Note that these bounds can be improved further with similar analysis.</p>
<p>Although it has not been studied yet whether the permute-and-flip mechanism \(M_{\text{PF}} \) can also ensure DP in one shot by using Exponential noise, we briefly discuss whether it can be bounded range for a similar parameter as the Exponential Mechanism, and hence achieve similar composition bounds. Consider running permute-and-flip on two items \(\{1,2 \}\) with a monotonic quality score \(q: \mathcal{X} \times \{1,2 \} \to \mathbb{R} \) whose sensitivity is 1. Let \(x, x’ \in \mathcal{X} \) be neighbors where
\[
q(x,1) = q(x,2) = 0
\]
\[
q(x’,1) = 0 , \quad q(x’,2) = 1.
\]
Hence, permute-and-flip will return outcome \(1\) or \(2\) with half probability each on dataset \(x\), while with dataset \(x’\) outcome \(1\) occurs with probability \(1/2 \cdot e^{-\varepsilon}\) and outcome \(2\) occurs with probability \( 1/2 + 1/2 \cdot (1 - e^{-\varepsilon})\). We can then compute the bounded range parameter \(\alpha\) as
\[
\frac{\mathbb{P}[M_{\text{PF}}(x’) = 2 ] }{\mathbb{P}[M_{\text{PF}}(x) = 2 ]}\leq e^{\alpha}\frac{\mathbb{P}[M_{\text{PF}}(x’) = 1 ]}{\mathbb{P}[M_{\text{PF}}(x) = 1 ]} \implies \alpha \geq \varepsilon + \ln(2 - e^{-\varepsilon} ).
\]
Note that with \(\varepsilon \gg 1\), we get \(\alpha \) close to \(\varepsilon\), which would be the same bounded range parameter as the exponential mechanism. However, with \(\varepsilon< 1\), we get \(\alpha\) close to \(2\varepsilon\). This example provides a lower bound on the BR parameter for permute-and-flip.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have looked at the top-\(k\) selection problem subject to differential privacy and although there are many different mechanisms to use, the exponential mechanism stands out for several reasons:</p>
<ol>
<li>The exponential mechanism is \(\varepsilon\)-DP and \( \varepsilon^2/8\)-CDP and hence gets improved composition.</li>
<li>Iteratively applying the exponential mechanism for top-\(k\) can be implemented by adding Gumbel noise to each count and returning the elements with the top-\(k\) noisy counts in one-shot.</li>
<li>The one-shot Gumbel mechanism returns a ranked list of \(k\) elements, rather than a set of \(k\) elements.</li>
</ol>
David DurfeeRyan RogersMon, 09 Aug 2021 10:00:00 -0700
https://differentialprivacy.org/one-shot-top-k/
https://differentialprivacy.org/one-shot-top-k/A Better Privacy Analysis of the Exponential Mechanism<p>A basic and frequent task in data analysis is <em>selection</em> – given a set of options \(\mathcal{Y}\), output the (approximately) best one, where “best” is defined by some loss function \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\) and a dataset \(x \in \mathcal{X}^n\). That is, we want to output some \(y \in \mathcal{Y}\) that approximately minimizes \(\ell(y,x)\). Naturally, we are interested in <em>private selection</em> – i.e., the output should be differentially private in terms of the dataset \(x\).
This post discusses algorithms for private selection – in particular, we give an improved privacy analysis of the popular exponential mechanism.</p>
<h2 id="the-exponential-mechanism">The Exponential Mechanism</h2>
<p>The most well-known algorithm for private selection is the <a href="https://en.wikipedia.org/wiki/Exponential_mechanism_(differential_privacy)"><em>exponential mechanism</em></a> <a href="https://doi.org/10.1109/FOCS.2007.66" title="Frank McSherry, Kunal Talwar. Mechanism Design via Differential Privacy. FOCS 2007."><strong>[MT07]</strong></a>. The exponential mechanism \(M : \mathcal{X}^n \to \mathcal{Y} \) is a randomized algorithm given by \[\forall x \in \mathcal{X}^n ~ \forall y \in \mathcal{Y} ~~~~~ \mathbb{P}[M(x) = y] = \frac{\exp(-\frac{\varepsilon}{2\Delta} \ell(y,x))}{\sum_{y’ \in \mathcal{Y}} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x)) }, \tag{1}\] where \(\Delta\) is the sensitivity of the loss function \(\ell\) given by \[\Delta = \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \max_{y\in\mathcal{Y}} |\ell(y,x) - \ell(y,x’)|,\tag{2}\] where the supremum is taken over all datasets \(x\) and \(x’\) differing on the data of a single individual (which we denote by \(d(x,x’)\le 1\)).</p>
<p>In terms of utility, we can easily show that <a href="https://arxiv.org/abs/1511.02513" title="Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, Jonathan Ullman. Algorithmic Stability for Adaptive Data Analysis. STOC 2016."><strong>[BNSSSU16]</strong></a> \[\mathbb{E}[\ell(M(x),x)] \le \min_{y \in \mathcal{Y}} \ell(y,x) + \frac{2\Delta}{\varepsilon} \log |\mathcal{Y}|\] for all \(x \in \mathcal{X}^n\) (and we can also give high probability bounds).</p>
<p>It is easy to show that the exponential mechanism satisfies \(\varepsilon\)-differential privacy.
But there is more to this story! We’re going to look at a more refined privacy analysis.</p>
<h2 id="bounded-range">Bounded Range</h2>
<p>The privacy guarantee of the exponential mechanism is more precisely characterized by <em>bounded range</em>. This was observed and defined by David Durfee and Ryan Rogers <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a> and further analyzed later <a href="https://arxiv.org/abs/1909.13830" title="Jinshuo Dong, David Durfee, Ryan Rogers. Optimal Differential Privacy Composition for Exponential Mechanisms. ICML 2020."><strong>[DDR20]</strong></a>.</p>
<blockquote>
<p><strong>Definition 1 (Bounded Range).</strong><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\eta\)-bounded range if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, there exists some \(t \in \mathbb{R}\) such that \[\forall y \in \mathcal{Y} ~~~~~ \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right) \in [t, t+\eta].\] Here \(t\) may depend on the pair of input datasets \(x,x’\), but not on the output \(y\).</p>
</blockquote>
<p>To interpret this definition, we <a href="/flavoursofdelta/">recall the definition of the privacy loss random variable</a>: Define \(f : \mathcal{Y} \to \mathbb{R}\) by \[f(y) = \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right).\] Then the privacy loss random variable \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is given by \(Z = f(M(x))\).</p>
<p>Pure \(\varepsilon\)-differential privacy is equivalent to demanding that the privacy loss is bounded by \(\varepsilon\) – i.e., \(\mathbb{P}[|Z|\le\varepsilon]=1\). Approximate \((\varepsilon,\delta)\)-differential privacy is, roughly, equivalent to demanding that \(\mathbb{P}[Z\le\varepsilon]\ge1-\delta\).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>Now \(\eta\)-bounded range is simply demanding that the privacy loss \(Z\) is supported on some interval of length \(\eta\). This interval \([t,t+\eta]\) may depend on the pair \(x,x’\).</p>
<p>Bounded range and pure differential privacy are equivalent up to a factor of 2 in the parameters:</p>
<blockquote>
<p><strong>Lemma 2 (Bounded Range versus Pure Differential Privacy).</strong></p>
<ul>
<li>\(\varepsilon\)-differential privacy implies \(\eta\)-bounded range with \(\eta \le 2\varepsilon\).</li>
<li>\(\eta\)-bounded range implies \(\varepsilon\)-differential privacy with \(\varepsilon \le \eta\).</li>
</ul>
</blockquote>
<p><em>Proof.</em> The first part of the equivalence follows from the fact that pure \(\varepsilon\)-differential privacy implies the privacy loss is supported on the interval \([-\varepsilon,\varepsilon]\). Thus, if we set \(t=-\varepsilon\) and \(\eta=2\varepsilon\), then \([t,t+\eta] = [-\varepsilon,\varepsilon]\).
The second part follows from the fact that the support of the privacy loss \([t,t+\eta]\) must straddle \(0\). That is, the privacy loss cannot be always positive nor always negative, so \(0 \in [t,t+\eta]\) and, hence, \([t,t+\eta] \subseteq [-\eta,\eta]\). Otherwise \(\forall y ~ f(y)>0\) or \(\forall y ~ f(y)<0\) would imply \(\forall y ~ \mathbb{P}[M(x)=y]>\mathbb{P}[M(x’)=y]\) or \(\forall y ~ \mathbb{P}[M(x)=y]<\mathbb{P}[M(x’)=y]\), contradicting the fact that \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x)=y] = 1\) and \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x’)=y] = 1\). ∎</p>
<p>OK, back to the exponential mechanism:</p>
<blockquote>
<p><strong>Lemma 3 (The Exponential Mechanism is Bounded Range).</strong>
The exponential mechanism (given in Equation 1 above) satisfies \(\varepsilon\)-bounded range .<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
</blockquote>
<p><em>Proof.</em>
We have \[e^{f(y)} = \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x))}{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x’))} \cdot \frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}.\]
Setting \(t = \log\left(\frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}\right) - \frac{\varepsilon}{2}\), we have \[ f(y) = \frac{\varepsilon}{2\Delta} (\ell(y,x’)-\ell(y,x)+\Delta) + t.\]
By the definition of sensitivity (given in Equation 2), we have \( 0 \le \ell(y,x’)-\ell(y,x)+\Delta \le 2\Delta\), whence \(t \le f(y) \le t + \varepsilon\). ∎</p>
<p>Bounded range is not really a useful privacy definition on its own. Thus we’re going to relate it to a relaxed version of differential privacy next.</p>
<h2 id="concentrated-differential-privacy">Concentrated Differential Privacy</h2>
<p>Concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a> and its variants <a href="https://arxiv.org/abs/1603.01887" title="Cynthia Dwork, Guy N. Rothblum. Concentrated Differential Privacy. 2016."><strong>[DR16]</strong></a> <a href="https://arxiv.org/abs/1702.07476" title="Ilya Mironov. Rényi Differential Privacy. CCS 2017."><strong>[M17]</strong></a> are relaxations of pure differential privacy with many nice properties. In particular, it composes very cleanly.</p>
<blockquote>
<p><strong>Definition 4 (Concentrated Differential Privacy).</strong>
A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual,
\[\forall \lambda > 0 ~~~~~ \mathbb{E}[\exp( \lambda Z)] \le \exp(\lambda(\lambda+1)\rho),\tag{3}\]
where \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is the privacy loss random variable.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
</blockquote>
<p>Intuitively, concentrated differential privacy requires that the privacy loss is subgaussian. Specifically, the bound on the moment generating function of \(\rho\)-concentrated differential privacy is tight if the privacy loss \(Z\) follows the distribution \(\mathcal{N}(\rho,2\rho)\). Indeed, the privacy loss random variable of the Gaussian mechanism has such a distribution.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p>OK, back to the exponential mechanism:
We know that \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a>.
This, of course, applies to the exponential mechaism. A cool fact – that we want to draw more attention to – is that we can do better!
Specifically, \(\eta\)-bounded range implies \(\frac18 \eta^2\)-concentrated differential privacy <a href="https://arxiv.org/abs/2004.07223" title="Mark Cesar, Ryan Rogers. Bounding, Concentrating, and Truncating: Unifying Privacy Loss Composition for Data Analytics. ALT 2021."><strong>[CR21]</strong></a>.
What follows is a proof of this fact following that of Mark Cesar and Ryan Rogers, but with some simplification.</p>
<blockquote>
<p><strong>Theorem 5 (Bounded Range implies Concentrated Differential Privacy).</strong>
If \(M\) is \(\eta\)-bounded range, then it is \(\frac18\eta^2\)-concentrated differentially private.</p>
</blockquote>
<p><em>Proof.</em>
Fix datasets \(x,x’ \in \mathcal{X}^n\) differing on a single individual’s data.
Let \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) be the privacy loss random variable of the mechanism \(M\) on this pair of datasets.
By the definition of bounded range (Definition 1), there exists some \(t \in \mathbb{R}\) such that \(Z \in [t, t+\eta]\) with probability 1.
Now we employ <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_lemma">Hoeffding’s Lemma</a> <a href="https://doi.org/10.1080%2F01621459.1963.10500830" title="Wassily Hoeffding. Probability inequalities for sums of bounded random variables. JASA 1963."><strong>[H63]</strong></a>:</p>
<blockquote>
<p><strong>Lemma 6 (Hoeffding’s Lemma).</strong>
Let \(X\) be a random variable supported on the interval \([a,b]\). Then, for all \(\lambda \in \mathbb{R}\), we have \[\mathbb{E}[\exp(\lambda X)] \le \exp \left( \mathbb{E}[X] \cdot \lambda + \frac{(b-a)^2}{8} \cdot \lambda^2 \right).\]</p>
</blockquote>
<p>Applying the lemma to the privacy loss gives \[\forall \lambda \in \mathbb{R} ~~~~~ \mathbb{E}[\exp(\lambda Z)] \le \exp \left( \mathbb{E}[Z] \cdot \lambda + \frac{\eta^2}{8} \cdot \lambda^2 \right).\]
The only remaining thing we need is to show is that \(\mathbb{E}[Z] \le \frac18 \eta^2\).<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>
<p>If we set \(\lambda = -1 \), then we get \( \mathbb{E}[\exp( - Z)] \le \exp \left( -\mathbb{E}[Z] + \frac{\eta^2}{8} \right)\), which rearranges to \(\mathbb{E}[Z] \le \frac18 \eta^2 - \log \mathbb{E}[\exp( - Z)]\).
Now we have \[ \mathbb{E}[\exp( - Z)] \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \exp(-f(y)) \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \!\cdot\! \frac{\mathbb{P}[M(x’)\!=\!y]}{\mathbb{P}[M(x)\!=\!y]} \!=\! 1.\]
∎</p>
<p>This brings us to the TL;DR of this post:</p>
<blockquote>
<p><strong>Corollary 7.</strong> The exponential mechanism (given by Equation 1) is \(\frac18 \varepsilon^2\)-concentrated differentially private.</p>
</blockquote>
<p>This is great news. The standard analysis only gives \(\frac12 \varepsilon^2\)-concentrated differential privacy. Constants matter when applying differential privacy, and we save a factor of 4 in the concentrated differential privacy analysis of the exponential mechanism for free with this improved analysis.</p>
<p>Combining Lemma 2 with Theorem 5 also gives a simpler proof of the conversion from pure differential privacy to concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a>:</p>
<blockquote>
<p><strong>Corollary 8.</strong> \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy.</p>
</blockquote>
<h2 id="beyond-the-exponential-mechanism">Beyond the Exponential Mechanism</h2>
<p>The exponential mechanism is not the only algorithm for private selection. A closely-related algorithm is <em>report noisy max/min</em>:<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup> Draw independent noise \(\xi_y\) from some distribution for each \(y \in \mathcal{Y}\) then output \[M(x) = \underset{y \in \mathcal{Y}}{\mathrm{argmin}} ~ \ell(y,x) - \xi_y.\]</p>
<p>If the noise distribution is an appropriate <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel distribution</a>, then report noisy max is exactly the exponential mechanism. (This equivalence is known as the “Gumbel max trick.”)</p>
<p>We can also use the Laplace distribution or the exponential distribution. Report noisy max with the exponential distribution is equivalent to the <em>permute and flip</em> algorithm <a href="https://arxiv.org/abs/2010.12603" title="Ryan McKenna, Daniel Sheldon. Permute-and-Flip: A new mechanism for differentially private selection
. NeurIPS 2020."><strong>[MS20]</strong></a> <a href="https://arxiv.org/abs/2105.07260" title="Zeyu Ding, Daniel Kifer, Sayed M. Saghaian N. E., Thomas Steinke, Yuxin Wang, Yingtai Xiao, Danfeng Zhang. The Permute-and-Flip Mechanism is Identical to Report-Noisy-Max with Exponential Noise. 2021."><strong>[DKSSWXZ21]</strong></a>. However, these algorithms don’t enjoy the same improved bounded range and concentrated differential privacy guarantees as the exponential mechanism.</p>
<p>There are also other variants of the selection problem. For example, in some cases we can assume that only a few options have low loss and the rest of the options have high loss – i.e., there is a gap between the minimum loss and the second-lowest loss (or, more generally, the \(k\)-th lowest loss). In this case there are algorithms that attain better accuracy than the exponential mechanism under relaxed privacy definitions <a href="https://arxiv.org/abs/1409.2177" title="Kamalika Chaudhuri, Daniel Hsu, Shuang Song. The Large Margin Mechanism for Differentially Private Maximization. NIPS 2014."><strong>[CHS14]</strong></a> <a href="https://dl.acm.org/doi/10.1145/3188745.3188946" title=" Mark Bun, Cynthia Dwork, Guy N. Rothblum, Thomas Steinke. Composable and versatile privacy via truncated CDP. STOC 2018."><strong>[BDRS18]</strong></a> <a href="https://arxiv.org/abs/1905.13229" title="Mark Bun, Gautam Kamath, Thomas Steinke, Zhiwei Steven Wu. Private Hypothesis Selection. NeurIPS 2019."><strong>[BKSW19]</strong></a>.</p>
<p>There are a lot of interesting aspects of private selection, including questions for further research! We hope to have further posts about some of these topics.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>For simplicity, we restrict our discussion here to finite sets of outputs, although the definitions, algorithms, and results can be extended to infinite sets. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>To be more precise, \((\varepsilon,\delta)\)-differential privacy is equivalent to demanding that \(\mathbb{E}[\max\{0,1-\exp(\varepsilon-Z)\}]\le\delta\) <a href="https://arxiv.org/abs/2004.00010" title="Clément L. Canonne, Gautam Kamath, Thomas Steinke. The Discrete Gaussian for Differential Privacy. NeurIPS 2020."><strong>[CKS20]</strong></a>. (To be completely precise, we must appropriately deal with the \(Z=\infty\) case, which we ignore in this discussion for simplicity.) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>This proof actually gives <a href="https://dongjs.github.io/2020/02/10/ExpMech.html">a slightly stronger result</a>: We can replace the sensitivity \(\Delta\) (defined in Equation 2) by half the range \[\hat\Delta = \frac12 \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \left( \max_{\overline{y}\in\mathcal{Y}} \ell(\overline{y},x) - \ell(\overline{y},x’) - \min_{\underline{y}\in\mathcal{Y}} \ell(\underline{y},x) - \ell(\underline{y},x’) \right).\] We always have \(\hat\Delta \le \Delta\) but it is possible that \(\hat\Delta < \Delta\) and the privacy analysis of the exponential mechanism still works if we replace \(\Delta\) by \(\hat\Delta\). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Equivalently, a randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, \[\forall \lambda > 0 ~~~~~ \mathrm{D}_{\lambda+1}(M(x)\|M(x’)) \le \lambda(\lambda+1)\rho,\] where \(\mathrm{D}_{\lambda+1}(M(x)\|M(x’)))\) is the order \(\lambda+1\) Rényi divergence of \(M(x)\) from \(M(x’)\). <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>To be precise, if \(M(x) = q(x) + \mathcal{N}(0,\sigma^2I)\), then \(M : \mathcal{X}^n \to \mathbb{R}^d\) satisfies \(\frac{\Delta_2^2}{2\sigma^2}\)-concentrated differential privacy, where \(\Delta_2 = \sup_{x,x’\in\mathcal{X}^n : d(x,x’)\le1} \|q(x)-q(x’)\|_2\) is the 2-norm sensitivity of \(q:\mathcal{X}^n \to \mathbb{R}^d\). Furthermore, the privacy loss of the Gaussian mechanism is itself a Gaussian and it makes the inequality defining concentrated differential privacy (Equation 3) an equality for all \(\lambda\) <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Note that the expectation of the privacy loss is simply the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>: \(\mathbb{E}[Z] = \mathrm{D}_1( M(x) \| M(x’) )\). <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>We have presented selection here in terms of minimization, but most of the literature is in terms of maximization. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Ryan RogersThomas SteinkeMon, 12 Jul 2021 10:00:00 -0700
https://differentialprivacy.org/exponential-mechanism-bounded-range/
https://differentialprivacy.org/exponential-mechanism-bounded-range/Open Problem - Optimal Query Release for Pure Differential Privacy<p>Releasing large sets of statistical queries is a centerpiece of the theory of differential privacy. Here, we are given a <em>dataset</em> \(x = (x_1,\dots,x_n) \in [T]^n\), and a set of <em>statistical queries</em> \(f_1,\dots,f_k\), where each query is defined by some bounded function \(f_j : [T] \to [-1,1]\), and (abusing notation) is defined as
\[
f_j(x) = \frac{1}{n} \sum_{i=1}^{n} f_j(x_i).
\]
We use \(f(x) = (f_1(x),\dots,f_k(x))\) to denote the vector consisting of the true answers to all these queries.
Our goal is to design an \((\varepsilon, \delta)\)-differentially private algorithm \(M\) that takes a dataset \(x\in [T]^n\) and outputs a random vector \(M(x)\in \mathbb{R}^k\) such that \(\| M(x) - f(x) \|\) is small in expectation for some norm \(\|\cdot\|\). Usually algorithms for this problem also give high probability bounds on the error, but we focus on expected error for simplicity.</p>
<p>This problem has been studied for both <em>pure differential privacy</em> (\(\delta = 0\)) and <em>appproximate differential privacy</em> (\(\delta > 0\)), and for both \(\ell_\infty\)-error
\[
\mathbb{E}( \| M(x) - f(x)\|_{\infty} ) \leq \alpha,
\]
and \(\ell_2\)-error
\[
\mathbb{E}( \| M(x) - f(x)\|_{2} ) \leq \alpha k^{1/2},
\]
giving four variants of the problem. By now we know tight worst-case upper and lower bounds for two of these variants, and nearly tight bounds (up to logarithmic factors) for a third. The tightest known upper bounds are given in the following table.</p>
<table>
<tbody>
<tr>
<td> </td>
<td>Pure DP</td>
<td>Approx DP</td>
</tr>
<tr>
<td>\( \ell_2 \)<br />error</td>
<td>\( \alpha \lesssim \left(\frac{\log^2 k ~\cdot~ \log^{3/2}T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://arxiv.org/abs/1212.0297">NTZ13</a>]</td>
<td>\( \alpha \lesssim \left(\frac{\log^{1/2} T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://guyrothblum.files.wordpress.com/2014/11/drv10.pdf">DRV10</a>]</td>
</tr>
<tr>
<td>\( \ell_\infty \)<br />error</td>
<td>\( \alpha \lesssim \left(\frac{\log k ~\cdot~ \log T}{\varepsilon n} \right)^{1/3} \) <br /> [<a href="https://arxiv.org/abs/1109.2229">BLR13</a>]</td>
<td>\( \alpha \lesssim \left(\frac{\log k ~\cdot~ \log^{1/2} T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://guyrothblum.files.wordpress.com/2014/11/hr10.pdf">HR10</a>, <a href="https://arxiv.org/abs/1107.3731">GRU12</a>]</td>
</tr>
</tbody>
</table>
<p>The bounds for approximate DP are known to be tight [<a href="https://arxiv.org/abs/1311.3158">BUV14</a>]. Our two open problems both involve improving the best known upper bounds for pure differential privacy.</p>
<blockquote>
<p><b>Open Problem 1:</b> What is the best possible \(\ell_\infty\)-error for answering a worst-case set of \(k\) statistical queries over a domain of size \(T\) subject to \((\varepsilon,0)\)-differential privacy?</p>
</blockquote>
<p>We conjecture that the known upper bound in the table can be improved to
\[
\alpha = \left(\frac{\log k \cdot \log T}{\varepsilon n} \right)^{1/2},
\]
which is known to be the best possible [<a href="https://dataspace.princeton.edu/handle/88435/dsp01vq27zn422">Har11</a>, Theorem 4.5.1].</p>
<blockquote>
<p><b>Open Problem 2:</b> What is the best possible \(\ell_2\)-error for answering a worst-case set of \(k\) statistical queries over a domain of size \(T\) subject to \((\varepsilon,0)\)-differential privacy?</p>
</blockquote>
<p>We conjecture that the upper bound can be improved to
\[
\alpha = \left(\frac{\log T}{\varepsilon n} \right)^{1/2}.
\]
The construction used in [<a href="https://dataspace.princeton.edu/handle/88435/dsp01vq27zn422">Har11</a>, Theorem 4.5.1] can be analyzed to show this bound would be tight. Note, in particular, that this conjecture implies that the tight upper bound has no dependence on the number of queries, similarly to the case of \(\ell_2\) error and approximate DP.</p>
Sasho NikolovJonathan UllmanWed, 07 Jul 2021 13:45:00 -0400
https://differentialprivacy.org/open-problem-optimal-query-release/
https://differentialprivacy.org/open-problem-optimal-query-release/Conference Digest - ICML 2021<p><a href="https://icml.cc/Conferences/2021">ICML 2021</a>, one of the biggest conferences in machine learning, naturally has a ton of interesting sounding papers on the topic of differential privacy.
We went through this year’s <a href="https://icml.cc/Conferences/2021/AcceptedPapersInitial">accepted papers</a> and aggregated all the relevant papers we could find.
In addition, this year features three workshops on the topic of privacy, as well as a tutorial.
As always, please inform us if we overlooked any papers on differential privacy.</p>
<h2 id="workshops">Workshops</h2>
<ul>
<li>
<p><a href="http://federated-learning.org/fl-icml-2021/">Federated Learning for User Privacy and Data Confidentiality</a></p>
</li>
<li>
<p><a href="https://sites.google.com/view/ml4data">Machine Learning for Data: Automated Creation, Privacy, Bias</a></p>
</li>
<li>
<p><a href="https://tpdp.journalprivacyconfidentiality.org/2021/">Theory and Practice of Differential Privacy</a></p>
</li>
</ul>
<h2 id="tutorial">Tutorial</h2>
<ul>
<li><a href="https://icml.cc/Conferences/2021/Schedule?showEvent=10839">Privacy in Learning: Basics and the Interplay</a><br />
<a href="https://www.microsoft.com/en-us/research/people/huzhang/">Huishuai Zhang</a>, <a href="https://www.microsoft.com/en-us/research/people/weic/">Wei Chen</a></li>
</ul>
<h2 id="papers">Papers</h2>
<ul>
<li>
<p><a href="https://arxiv.org/abs/2009.02668">A Framework for Private Matrix Analysis in Sliding Window Model</a><br />
<a href="https://sites.google.com/view/jalajupadhyay/home">Jalaj Upadhyay</a>, <a href="https://www.fujitsu.com/us/about/businesspolicy/tech/rd/research-staff/sarvagya.html">Sarvagya Upadhyay</a></p>
</li>
<li>
<p>Accuracy, Interpretability, and Differential Privacy via Explainable Boosting<br />
<a href="https://scholar.google.com/citations?user=HmxjgMAAAAAJ">Harsha Nori</a>, <a href="https://www.microsoft.com/en-us/research/people/rcaruana/">Rich Caruana</a>, <a href="https://sites.google.com/view/zhiqi-bu">Zhiqi Bu</a>, <a href="https://heyyjudes.github.io/">Judy Hanwen Shen</a>, <a href="https://www.microsoft.com/en-us/research/people/jakul/">Janardhan Kulkarni</a></p>
</li>
<li>
<p>Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message<br />
<a href="https://sites.google.com/view/badihghazi/home">Badih Ghazi</a>, <a href="https://sites.google.com/site/ravik53/">Ravi Kumar</a>, <a href="https://pasin30055.github.io/">Pasin Manurangsi</a>, <a href="https://rasmuspagh.net/">Rasmus Pagh</a>, <a href="https://www.linkedin.com/in/amersinha/">Amer Sinha</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2011.00467">Differentially Private Bayesian Inference for Generalized Linear Models</a><br />
<a href="https://warwick.ac.uk/fac/sci/dcs/people/u1554597">Tejas Kulkarni</a>, <a href="https://users.aalto.fi/~jalkoj1/">Joonas Jälkö</a>, <a href="https://scholar.google.com/citations?user=Y_EvCPAAAAAJ">Antti Koskela</a>, <a href="https://people.aalto.fi/samuel.kaski">Samuel Kaski</a>, <a href="https://www.cs.helsinki.fi/u/ahonkela/">Antti Honkela</a></p>
</li>
<li>
<p>Differentially-Private Clustering of Easy Instances<br />
<a href="http://www.cohenwang.com/edith/">Edith Cohen</a>, <a href="http://www.cs.tau.ac.il/~haimk/">Haim Kaplan</a>, <a href="https://www.tau.ac.il/~mansour/">Yishay Mansour</a>, <a href="https://www.uri.co.il/">Uri Stemmer</a>, <a href="https://www.linkedin.com/in/eliad-tsfadia-21482b96/">Eliad Tsfadia</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.08885">Differentially Private Correlation Clustering</a><br />
<a href="https://cs-people.bu.edu/mbun/">Mark Bun</a>, <a href="https://elias.ba30.eu/">Marek Elias</a>, <a href="https://www.microsoft.com/en-us/research/people/jakul/">Janardhan Kulkarni</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.13287">Differentially Private Densest Subgraph Detection</a><br />
<a href="https://biocomplexity.virginia.edu/person/dung-nguyen">Dung Nguyen</a>, <a href="https://engineering.virginia.edu/faculty/anil-vullikanti">Anil Vullikanti</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.08244">Differentially Private Quantiles</a><br />
<a href="http://jgillenw.com/">Jennifer Gillenwater</a>, <a href="https://www.majos.net/">Matthew Joseph</a>, <a href="https://www.alexkulesza.com/">Alex Kulesza</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.06641">Differentially Private Query Release Through Adaptive Projection</a><br />
<a href="https://sergulaydore.github.io/">Sergul Aydore</a>, <a href="https://wibrown.github.io/">William Brown</a>, <a href="https://www.cis.upenn.edu/~mkearns/">Michael Kearns</a>, <a href="http://www-cs-students.stanford.edu/~kngk/">Krishnaram Kenthapadi</a>, <a href="https://www.lucamel.is/">Luca Melis</a>, <a href="https://www.cis.upenn.edu/~aaroth/">Aaron Roth</a>, <a href="https://ankitsiva.xyz/">Ankit Siva</a></p>
</li>
<li>
<p>Differentially Private Sliced Wasserstein Distance<br />
<a href="http://asi.insa-rouen.fr/enseignants/~arakoto/">Alain Rakotomamonjy</a>, <a href="https://pageperso.lif.univ-mrs.fr/~liva.ralaivola/doku.php">Liva Ralaivola</a></p>
</li>
<li>
<p>Large Scale Private Learning via Low-rank Reparametrization<br />
<a href="https://scholar.google.com/citations?user=FcRGdiwAAAAJ">Da Yu</a>, <a href="https://www.microsoft.com/en-us/research/people/huzhang/">Huishuai Zhang</a>, <a href="https://www.microsoft.com/en-us/research/people/weic/">Wei Chen</a>, Jian Yin, <a href="https://www.microsoft.com/en-us/research/people/tyliu/">Tie-Yan Liu</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.08598">Leveraging Public Data for Practical Private Query Release</a><br />
<a href="https://www.linkedin.com/in/terrance-liu-26796974/">Terrance Liu</a>, <a href="https://sites.google.com/umn.edu/giuseppe-vietri/home">Giuseppe Vietri</a>, <a href="http://www.thomas-steinke.net/">Thomas Steinke</a>, <a href="https://www.ccs.neu.edu/home/jullman/">Jonathan Ullman</a>, <a href="https://zstevenwu.com/">Steven Wu</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2104.09734">Locally Private k-Means in One Round</a><br />
Alisa Chang, <a href="https://sites.google.com/view/badihghazi/home">Badih Ghazi</a>, <a href="https://sites.google.com/site/ravik53/">Ravi Kumar</a>, <a href="https://pasin30055.github.io/">Pasin Manurangsi</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.12099">Lossless Compression of Efficient Private Local Randomizers</a><br />
<a href="http://vtaly.net/">Vitaly Feldman</a>, <a href="http://kunaltalwar.org/">Kunal Talwar</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.08233">Oneshot Differentially Private Top-k Selection</a><br />
<a href="https://lsa.umich.edu/stats/people/phd-students/qiaogang.html">Gang Qiao</a>, <a href="http://www-stat.wharton.upenn.edu/~suw/">Weijie Su</a>, <a href="https://research.google/people/LiZhang/">Li Zhang</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2002.12321">PAPRIKA: Private Online False Discovery Rate Control</a><br />
<a href="https://wanrongz.github.io/">Wanrong Zhang</a>, <a href="http://www.gautamkamath.com/">Gautam Kamath</a>, <a href="https://sites.gatech.edu/rachel-cummings/">Rachel Cummings</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.00039">Practical and Private (Deep) Learning without Sampling or Shuffling</a><br />
<a href="https://kairouzp.github.io/">Peter Kairouz</a>, <a href="https://research.google/people/author35837/">Brendan McMahan</a>, <a href="https://shs037.github.io/">Shuang Song</a>, <a href="http://www.omthakkar.com/">Om Thakkar</a>, <a href="https://athakurta.squarespace.com/">Abhradeep Thakurta</a>, <a href="https://research.google/people/106689/">Zheng Xu</a></p>
</li>
<li>
<p>Private Adaptive Gradient Methods for Convex Optimization<br />
<a href="http://web.stanford.edu/~asi/">Hilal Asi</a>, <a href="https://web.stanford.edu/~jduchi/">John Duchi</a>, <a href="https://afallah.lids.mit.edu/">Alireza Fallah</a>, <a href="https://scholar.google.com/citations?user=_JXjrEp9FhYC">Omid Javidbakht</a>, <a href="http://kunaltalwar.org/">Kunal Talwar</a></p>
</li>
<li>
<p>Private Alternating Least Squares: (Nearly) Optimal Privacy/Utility Trade-off for Matrix Completion<br />
Steve Chien, <a href="https://www.prateekjain.org/">Prateek Jain</a>, <a href="http://walid.krichene.net/">Walid Krichene</a>, <a href="https://scholar.google.com/citations?user=yR-ugIoAAAAJ">Steffen Rendle</a>, <a href="https://shs037.github.io/">Shuang Song</a>, <a href="https://athakurta.squarespace.com/">Abhradeep Thakurta</a>, <a href="https://research.google/people/LiZhang/">Li Zhang</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.01516">Private Stochastic Convex Optimization: Optimal Rates in L1 Geometry</a><br />
<a href="http://web.stanford.edu/~asi/">Hilal Asi</a>, <a href="http://vtaly.net/">Vitaly Feldman</a>, <a href="https://tomerkoren.github.io/">Tomer Koren</a>, <a href="http://kunaltalwar.org/">Kunal Talwar</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.06387">The Distributed Discrete Gaussian Mechanism for Federated Learning with Secure Aggregation</a><br />
<a href="https://kairouzp.github.io/">Peter Kairouz</a>, <a href="https://kenziyuliu.github.io/">Ziyu Liu</a>, <a href="http://www.thomas-steinke.net/">Thomas Steinke</a></p>
</li>
</ul>
Gautam KamathMon, 07 Jun 2021 12:30:00 -0400
https://differentialprivacy.org/icml2021/
https://differentialprivacy.org/icml2021/Statistical Inference is Not a Privacy Violation<p>On April 28, 2021, the US Census Bureau <a href="https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance/2020-das-updates.html">released</a> a new demonstration of its differentially private Disclosure Avoidance System (DAS) for the 2020 US Census. The public were given a month to submit feedback before the system is finalized.
This demonstration data and the feedback has generated a lot of discussion, including media coverage on <a href="https://www.npr.org/2021/05/19/993247101/for-the-u-s-census-keeping-your-data-anonymous-and-useful-is-a-tricky-balance">National Public Radio</a>, in <a href="https://www.washingtonpost.com/local/social-issues/2020-census-differential-privacy-ipums/2021/06/01/6c94b46e-c30d-11eb-93f5-ee9558eecf4b_story.html">the Washington Post</a>, and via <a href="https://apnews.com/article/business-census-2020-technology-e701e313e841674be6396321343b7e49">the Associated Press</a>. The DAS is also the subject of an <a href="https://www.courtlistener.com/docket/59728874/state-v-united-states-department-of-commerce/">ongoing lawsuit</a>.</p>
<p>The following is a response from experts on differential privacy and cryptography to the <a href="https://alarm-redist.github.io/posts/2021-05-28-census-das/Harvard-DAS-Evaluation.pdf">working paper of Kenny et al.</a> on the impact of the 2020 U.S. Census Disclosure Avoidance System (DAS) on redistricting.</p>
<p>This paper makes a <a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-06-14.md">common but serious mistake</a>, from which the authors wrongfully conclude the Census Bureau should not modernize its privacy-protection technology. Not only do the results not support this conclusion, but they instead show the power of the methodology, known as differential privacy, adopted by the Bureau, precisely the opposite of the authors’ erroneous conclusions.</p>
<p>Trust is essential; once destroyed it can be nearly impossible to rebuild, and getting privacy wrong in this Census will have an impact on all future government surveys. The Census Bureau has shown that their <a href="https://desfontain.es/privacy/index.html">2010 (DAS) does not survive modern privacy threats</a>, and in fact was roughly equivalent to publishing nearly three quarters of the responses. The Census Bureau’s decision to modernize its Disclosure Avoidance System (DAS) for the 2020 Decennial Census to be differentially private is the correct response to decades of theoretical and empirical work on the privacy risks inherent in releasing large numbers of statistics derived from a dataset.</p>
<p>The importance of the Census, and the reality that no technology competing with differential privacy exists for meeting their confidentiality obligations, makes it very important that the public and policy makers have accurate information. We imagine you will be reporting on this topic in the future. Others have <a href="https://gerrymander.princeton.edu/DAS-evaluation-Kenny-response">addressed flaws</a> in the paper regarding implications for redistricting; we want to provide you with an understanding of the privacy mistake in the study.</p>
<p>To understand the flaw in the paper’s argument, consider the role of smoking in determining cancer risk. Statistical study of medical data has taught us that smoking causes cancer. Armed with this knowledge, if we are told that 40 year old Mr. S is a smoker, we can conclude that he has an elevated cancer risk. The statistical inference of elevated cancer risk—made before Mr. S was born—did not violate Mr. S’s privacy. To conclude otherwise is to define science to be a privacy attack. This is the mistake made in the paper.</p>
<p>This is basically what Kenny et al. found.</p>
<p>The authors looked at three different predictors: one built directly from (swapped) 2010 Census data and the other two built using differential privacy applied to (swapped) 2010 Census data, and evaluated all three “on approximately 5.8 million registered voters included in the North Carolina February 2021 voter file.” What did they find?</p>
<blockquote>
<p>“Our analysis shows that across three main racial and ethnic groups, the predictions based on the [differential privacy based] DAS data appear to be as accurate as those based on the 2010 Census data.”</p>
</blockquote>
<p>This makes perfect sense. Bayesian Improved Surname Geocoding, or BISG, is a statistical method of building a predictor inferring ethnicity (or race) from name and geography. Here, name and geography play the role of the information as to whether or not one smokes, and the prediction of ethnicity corresponds to the cancer risk prediction. The predictor is constructed from census data on the ethnic makeup of individual census blocks and statistical information about the popularity of individual surnames within different ethnic groups. With such a predictor, moving across the country can change the outcome, as can changing one’s name. But a BISG prediction is not about the individual, it is about the statistical—population-level—relationship between name, geography, and ethnicity.</p>
<p>The differentially private DAS enabled learning to make statistical inferences about ethnicity from name and geography, without compromising the privacy of any Census respondent, exactly as it was intended to do. In other words, the paper establishes fitness-for-use of the DAS data for the BISG statistical method! Because differential privacy permits learning statistical patterns without compromising the privacy of individual members of the dataset, it should not interfere with learning the predictor, which is exactly what the authors found. Returning to our “smoking causes cancer” example, the researchers found that it was just as easy to detect this statistical pattern with a modern disclosure avoidance system in place as it was with the older, less protective system.</p>
<p>The authors’ conclusions –“ the DAS data may not provide universal privacy protection” – are simply not supported by their findings.</p>
<p>They have confused learning that smoking causes cancer—and applying this predictor to an individual smoker—with learning medical details of individual patients in the dataset. Change the input to the predictor—replace “smoker” with “non-smoker” or move across the country, for example—and the prediction changes.</p>
<p>The BISG prediction is not about the individual, it does not accompany her as she relocates from one neighborhood to another, it is a statistical relationship between name, geography, and ethnicity. It is not a privacy compromise, it is science.</p>
<p>Signed:</p>
<ul>
<li>Mark Bun, Assistant Professor of Computer Science, Boston University</li>
<li>Damien Desfontaines, Privacy Engineer, Google</li>
<li>Cynthia Dwork, Professor of Computer Science, Harvard University</li>
<li>Moni Naor, Professor of Computer Science, The Weizmann Institute of Science</li>
<li>Kobbi Nissim, Professor of Computer Science, Georgetown University</li>
<li>Aaron Roth, Professor of Computer and Information Science, University of Pennsylvania</li>
<li>Adam Smith, Professor of Computer Science, Boston University</li>
<li>Thomas Steinke, Research Scientist, Google</li>
<li>Jonathan Ullman, Assistant Professor of Computer Science, Northeastern University</li>
<li>Salil Vadhan, Professor of Computer Science and Applied Mathematics, Harvard University</li>
</ul>
<p>Please contact Cynthia Dwork for contact information for authors happy to speak about this on the record.</p>
Jonathan UllmanThu, 03 Jun 2021 18:30:00 -0500
https://differentialprivacy.org/inference-is-not-a-privacy-violation/
https://differentialprivacy.org/inference-is-not-a-privacy-violation/Call for Papers - Workshop on the Theory and Practice of Differential Privacy (TPDP 2021)<p>Work on differential privacy spans a number of different research communities, including theoretical computer science, machine learning, statistics, security, law, databases, cryptography, programming languages, social sciences, and more.
Each of these communities may choose to publish their work in their own community’s venues, which could result in small groups of differential privacy researchers becoming isolated.
To alleviate these issues, we have the Workshop on the <a href="https://tpdp.journalprivacyconfidentiality.org/">Theory and Practice of Differential Privacy</a> (TPDP), which is intended to bring these subcommunities together under one roof (well, a virtual one at least for 2020 and 2021).</p>
<p>We have just posted the <a href="https://tpdp.journalprivacyconfidentiality.org/2021/TPDP2021CfP.pdf">Call for Papers</a> for <a href="https://tpdp.journalprivacyconfidentiality.org/2021/">TPDP 2021</a>, which will be a workshop affiliated with <a href="https://icml.cc/Conferences/2021/">ICML 2021</a>.
The submission deadline is Friday, May 28, 2021, Anywhere on Earth (conveniently, two days after the deadline for NeurIPS 2021).
Submissions are extended abstracts of up to four pages in length, and will undergo a lightweight review process, based mostly on relevance and interest to the differential privacy community.
The workshop is non-archival, so feel free to submit recent work at any stage of publication.
Submissions will be on <a href="https://openreview.net/group?id=ICML.cc/2021/Workshop/TPDP">OpenReview</a>, but since submitted work may be preliminary, the process will be “closed” similar to traditional review processes.
One goal of the workshop is to be inclusive and welcoming to newcomers to the differential privacy community, so please consider participating even if you are new to the field.</p>
<p>Most papers will be presented as posters at a (virtual) poster session, while a few papers will be selected for spotlight talks.
There will also be plenary talks by <a href="https://www.cs.huji.ac.il/~katrina/">Katrina Ligett</a> (Hebrew University of Jerusalem) and <a href="https://www2.math.upenn.edu/~ryrogers/">Ryan Rogers</a> (LinkedIn).
The program co-chairs are <a href="https://sites.gatech.edu/rachel-cummings/">Rachel Cummings</a> and <a href="http://www.gautamkamath.com/">myself</a>.
Please submit your best work on differential privacy, and hope to see you there!</p>
<p align="center">
<img src="/images/Ligett.png" />
<img src="/images/Rogers.png" /> <br />
<i>Invited speakers Katrina Ligett and Ryan Rogers</i>
</p>
Gautam KamathWed, 05 May 2021 10:00:00 -0400
https://differentialprivacy.org/tpdp21-cfp/
https://differentialprivacy.org/tpdp21-cfp/ALT Highlights - An Equivalence between Private Learning and Online Learning (ALT '21 Tutorial)<p>Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference <a href="http://algorithmiclearningtheory.org/alt2021/">ALT 2021</a>, including plenary talks, tutorials, trends in learning theory, and more!
To reach a broad audience, the series will be disseminated as guest posts on different blogs in machine learning and theoretical computer science.
Given the topic of this post, we felt <a href="https://differentialprivacy.org/">DifferentialPrivacy.org</a> was a great fit.
This initiative is organized by the <a href="https://www.let-all.com/">Learning Theory Alliance</a>, and overseen by <a href="http://www.gautamkamath.com/">Gautam Kamath</a>.
All posts in ALT Highlights are indexed on the official <a href="https://www.let-all.com/blog/2021/04/20/alt-highlights-2021/">Learning Theory Alliance blog</a>.</p>
<p>The second post is coverage of <a href="http://www.cs.technion.ac.il/~shaymrn">Shay Moran</a>’s <a href="https://www.youtube.com/watch?v=wk910Aj559A">tutorial</a>, by <a href="https://people.eecs.berkeley.edu/~kush/">Kush Bhatia</a> and <a href="https://web.stanford.edu/~mglasgow/">Margalit Glasgow</a>.</p>
<hr />
<p>The tutorial at ALT was given by <a href="http://www.cs.technion.ac.il/~shaymrn/">Shay Moran</a>, an assistant professor at the Technion in Haifa.
His <a href="https://www.youtube.com/watch?v=wk910Aj559A">talk</a> focused on recent results showing a deep connection between two important areas in learning theory: online learning and differentially private learning.
While online learning is a well-established area that has been studied since the invention of the Perceptron algorithm in 1954, differential privacy (DP) - introduced in the seminal work of Dwork, McSherry, Nissim, and Smith in 2006 <a href="https://journalprivacyconfidentiality.org/index.php/jpc/article/view/405"><strong>[DMNS06]</strong></a> - has received increasing attention in recent years from both theoretical and applied research communities along with industry and the government.
This recent interest in differential privacy comes from a need to protect the privacy rights of the individuals, while still allowing one to derive useful conclusions from the dataset.
Shay, in a sequence of papers with co-authors Noga Alon, Mark Bun, Roi Livni, and Maryanthe Malliaris, revealed a surprising qualitative connection between these two models of learning: a concept class can be learned in an online fashion if and only if this concept class can be learned in an offline fashion by a differentially private algorithm.</p>
<p>The main objectives of this tutorial were to give an in-depth foray into this recent work, and present an opportunity to young researchers to identify interesting research problems at the intersection of these two fields. This line of work (<a href="https://arxiv.org/abs/1806.00949"><strong>[ALMM19]</strong></a>, <a href="https://arxiv.org/abs/2003.00563"><strong>[BLM20]</strong></a>) - which has been primarily featured in general CS Theory conferences, including recently winning a best paper award at FOCS - introduced several new techniques which could be of use to the machine learning theory community. The first part of this article focuses on the technical challenges of characterizing DP learnability and the solutions used in Shay’s work, which originate from combinatorics and model theory. In the rest of this article, we highlight some of the exciting open directions Shay sees more broadly in learning theory.</p>
<h2 id="background-pac-learning-online-learning-and-dp-pac-learning">Background: PAC Learning, Online Learning, and DP PAC Learning</h2>
<p>We begin by reviewing the classical setting of PAC learning.
The goal of PAC learning is to learn some function from a <em>concept class</em> \(\mathcal{H} \), a set of functions from some domain \( \mathcal{X} \) to \( \{0, 1\} \).
<img align="right" src="/images/PACLearning.png" style="width:300px;height:300px;" />
In the realizable setting of PAC learning, which we will focus on here, the learner is presented with \( n \) labeled training samples \( \{(x_i, y_i)\}_{i=1}^{n} \) where \( x_i \in \mathcal{X} \) and \( y_i = h(x_i) \) for some function \( h \in \mathcal{H} \). While the learner knows the concept class \( \mathcal{H} \), the function \( h \) is unknown, and after seeing the \( n \) samples, the learner algorithm \( A \) must output some function \( \hat{h}: \mathcal{X} \rightarrow \{0, 1\} \). A concept class is <em>PAC-learnable</em> if there exists an algorithm \( A \) such that for any distribution over samples, as the number of labeled training samples goes to infinity, the probability that \( \hat{h} \) incorrectly labels a new random sample goes to 0. We will say that \( A \) is a <em>proper</em> learner if \( A \) outputs a function in \(\mathcal{H}\), while \( A \) is <em>improper</em> if it may output a function outside of \( \mathcal{H} \). More quantitative measures of learnability concern the exact number of samples \( n \) needed for the learner to correctly predict future labels with nontrivial probability.</p>
<p>DP PAC learning imposes an additional restriction on PAC learning: the learner must output a function which does not reveal too much about any one sample in the input. Formally, we say an algorithm \( A \) is \( (\varepsilon, \delta) \)-DP if for any two neighboring inputs \( X = (X_1, \cdots, X_i, \cdots, X_n) \) and \( X’ = (X_1, \cdots, X_i’,\cdots, X_n) \) which differ at exactly one sample, for any set \( S \), \( \Pr[A(X) \in S] \leq e^\varepsilon Pr[A(X’) \in S] + \delta \) . A concept class is <em>DP PAC learnable</em> if it can be PAC-learned by a \( (0.1, o(1/n)) \)-DP algorithm \( A \). That is, changing any one training sample \( (x_i, y_i) \) should not affect the distribution over concepts output by \( A \) by too much.</p>
<p>Online learning considers a setting where the samples arrive one-by-one, and the learner must make predictions as this process unfolds. <img align="right" src="/images/OnlineLearning.png" style="width:300px;height:300px;" />Formally, in realizable online learning, at each round \( i = 1,\dots,T \), the learner is presented with a sample \( x_i\). The learner must predict the label, and then the true label \( y_i \) is revealed to the learner. The sequence of examples \( x_i \) and the labels \( y_i \) may be chosen adversarially, but they must be consistent with some function \( h \in \mathcal{H} \). The goal of the learner is to minimize the total number of mistakes made by round \( T \), also called the <em>mistake bound</em>. If there is a learner such that at \( T \rightarrow \infty \), the number of mistakes is \( o(T) \), we say that the class of functions \( \mathcal{H} \) is <em>online learnable</em>. Because of the adversarial nature of the examples, online learning is well known to be much harder than PAC learning, and is possible precisely when the <em>Littlestone Dimension</em> of the concept class is finite. Figure 1 below illustrates the definition of the Littlestone Dimension. One important PAC-learnable concept class which is not online learnable is the infinite class of thresholds on \( \mathbb{R} \): The set of functions \(\{h_t\}_{t \in \mathbb{R}} \) where \( h_t(x) = \textbf{1}(x > t) \).</p>
<p><img src="/images/LD.png" alt="Figure 1: Littlestone Dimension" title="Figure 1: Littlestone Dimension" /></p>
<p><strong>Figure 1</strong>: The Littlestone dimension is the maximum depth of a tree shattered by \( \mathcal{H} \), where each node may contain a unique element from \(\mathcal{X}\). The tree is shattered if each leaf can be labeled by a function in \(\mathcal{H} \) that labels each element on the path to the root according to the value of the edge entering it. In this figure, we show how the set of thresholds on \( [0, 1] \) shatters this tree of depth 3.</p>
<p>It’s worth taking a moment to understand intuitively why learning thresholds might be hard for both an online learner and a DP learner. In an online setting, suppose the adversary chooses the next example to be any value of \( x \) in between all previously 0-labeled examples and all 1-labeled examples. Then no matter what label \( A \) chooses, the adversary can say \( A \) was incorrect. In a DP setting, a simple proper learner that outputs a threshold that makes no errors on the training data will reveal too much information about the samples at the boundary of 0-labeled samples and 1-labeled samples.</p>
<h2 id="a-challenging-question">A Challenging Question</h2>
<p>The journey to characterizing DP PAC learnability started soon after the introduction of DP, in <a href="https://arxiv.org/abs/0803.0924"><strong>[KLNRS08]</strong></a>, where the concept of DP PAC learning was introduced. This work primarily considered the more stringent <em>pure</em> DP-learning, where \( \delta = 0\). This work established that any finite concept class could be learned privately by applying the <em>exponential mechanism</em> (a standard technique in DP) to the empirical error of each candidate concept. Using the exponential mechanism, the learner could output each function \( h \in \mathcal{H}\) with probability proportional to \( \exp(- \varepsilon(\# \text{ of errors h on the training data})) \). This was sufficiently private because the empirical error is not too sensitive to a change of one training sample. Later works <a href="https://arxiv.org/abs/1402.2224"><strong>[BNS14]</strong></a> combinatorially characterized the complexity of pure DP PAC learning in terms of a measure called <em>representation dimension</em>, showing that in some cases where PAC learning was possible, pure DP PAC learning was not. <a href="https://arxiv.org/abs/1504.07553"><strong>[BNSV15]</strong></a> further yielded lower bounds on the limits of proper DP PAC learning.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup><br />
Both pure and proper DP learning though are significantly more stringent than improper DP learning, and proving lower bounds against an improper DP learner posed a serious challenge. Even the following simple-sounding question was unsolved:</p>
<blockquote>
<p>Can an improper DP algorithm learn the infinite class of thresholds over [0, 1]? (*)</p>
</blockquote>
<p>This question stands at the center of Shay’s work, which unfolded while Shay was residing at the Institute for Advanced Study (IAS) at Princeton from 2017 to 2019.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> At the time, Shay was working on understanding the expressivity of limited mutual-information algorithms, that is, algorithms which expose little information about the total input. “If we were to have directly worked on this problem, I believe we wouldn’t have solved it,” Shay says. Instead, they came from the angle of mutual-information, a concept qualitatively similar to DP, but armed with a rich toolkit from 70 years of information theory. One of Shay’s prior works with Raef Bassily, Ido Nachum, Jonathan Shafer, and Amir Yehudayof established lower bounds on the mutual information of any proper algorithm that learns thresholds <a href="https://arxiv.org/abs/1710.05233"><strong>[BMNSY18]</strong></a>, though this didn’t yet address the challenge presented by improper learners.</p>
<p>Unlike most lower bounds in theoretical computer science, proving hardness of learning the infinite class of thresholds on the line against an improper DP algorithm would require coming up with algorithm-specific distributions over samples. That is, instead of showing that one distribution over samples would be impossible to learn for all algorithms — in the way the a uniform distribution over the set in \(\mathcal{X}\) shattered by the concept class is hard to PAC-learn for any algorithm — they would have to come up with a distribution on \(\mathcal{X}\) specific to each candidate learning algorithm. Indeed, if the distribution was known to the learner, it was possible to devise a DP algorithm using the exponential mechanism (again!) which could learn any PAC-learnable concept class. Similar to the case of finite concept classes, here we can apply the exponential mechanism to some finite set of representative functions forming a cover of the concept class.</p>
<h2 id="uncovering-a-solution">Uncovering a Solution</h2>
<p>At the IAS, Shay met his initial team to tackle this obstacle: Noga Alon, a combinatorialist, Roi Livni, a learning theorist, and Maryanthe Malliaris, a model theorist. Shay and Maryanthe would walk together from IAS to their combinatorics class at Princeton taught by Noga and discuss mathematics. While Marayathe studied the abstract mathematical field of model theory, Shay and Maryanthe soon noticed a connection between model theory and machine learning: the Littlestone dimension. “There is applied math, then pure math, and then model theory is way over there,” Shay elaborates. “If theoretical machine learning is between applied math and pure math, model theory is on the other extreme.” This surprising interdisciplinary connection led them to use a result from model theory: a concept class had finite Littlestone Dimension precisely when the concept class had finite threshold dimension (formally, the maximum number of threshold functions that could be embedded in the class). This meant that answering (*) negatively was enough to show that DP PAC learning was as hard as online learning.</p>
<p>The idea for showing a lower bound for improper DP learning of thresholds came from Ramsey Theory, a famous area in combinatorics. Ramsey theory guarantees the existence of structured subsets among large, but arbitrary, combinatorial objects. A toy example of Ramsey Theory is that in any graph on 6 or more nodes, there must be a group of 3 nodes which form either a clique or an independent set. In the case of DP learning thresholds, the learning algorithm \( A \) is the arbitrary (and unstructured) combinatorial object. Ramsey Theory guarantees that for any PAC learner \( A \), there exists some large subset \( \mathcal{X}’ \) of the domain \(\mathcal{X}=[0, 1]\) on which \( A \) is close to behaving “normally”. The next step is to show that behaving normally contradicts differentially privacy. We’ll dig more technically into this argument in the next couple paragraphs. Note that we invent a couple definitions for the sake of exposition (“proper-normal” and “improper-normal”) that don’t appear in <a href="https://arxiv.org/abs/1504.07553"><strong>[BNSV15]</strong></a> or <a href="https://arxiv.org/abs/1806.00949"><strong>[ALMM19]</strong></a>.</p>
<p>Let’s start by seeing how a simpler version of this argument from <a href="https://arxiv.org/abs/1504.07553"><strong>[BNSV15]</strong></a> works to show that proper DP algorithms cannot learn thresholds. Recall that in a proper algorithm, after seeing \( n \) labeled samples, \( A \) must output a threshold. To be a PAC learner, if exactly half of the \( n \) samples are labeled \( 1 \) (which we will call a <em>balanced</em> sample), \( A \) must output a threshold between the smallest and largest sample with constant probability. Indeed, otherwise the empirical error of \( A \) will be too large to even hope to generalize. This implies that for a balanced list of samples \( S = [(x_1, 0), \ldots ,(x_{n/2}, 0), (x_{n/2+1}, 1), \ldots ,(x_n, 1)] \) (ordered by \(x\)-value), there must exist an integer \( k \in [n] \) for which with probability \( \Omega(1/n) \), \(A \) outputs a threshold in between
\( x_k \) and \( x_{k+1} \). We’ll say that \( A \) is <em>\(k\)-proper-normal</em> on a set \( \mathcal{X}’ \) if for any balanced sample \( S \) of \( n \) points in \( \mathcal{X}’ \), \( A \) outputs a threshold in between the \(k\)th and \(k+1\)th ordered samples with probability \( \Omega(1/n) \). For example, the naive algorithm that always outputs a threshold between the 0-labeled samples and 1-labeled samples is \(n/2\)-normal on the entire domain \([0, 1]\). Ramsey theory guarantees that there is an arbitrary large subset of the domain \([0, 1]\) on which \(A \) is \(k\)-proper-normal for some \(k\). (See Figure 2).</p>
<p><img src="/images/RamseyTheorem.png" alt="Figure 2: Ramsey's Theorem" title="Figure 2: Ramsey's Theorem" />
<strong>Figure 2</strong>: Ramsey’s Theorem guarantees a large subset \( \mathcal{X}’ \) on which \( A \) is \( k \)-proper-normal for some \( k \).</p>
<p>The second part of the argument shows that being \( k \)-proper-normal on a large set is in direct conflict with differential privacy. We show this argument in Figure 3 by constructing a set of \( n \) samples \( S^* \) on which \( A \) must output a threshold in many distinct regions with substantial probability.</p>
<p><img src="/images/DP_Construction.png" alt="Figure 3: A Hard Distribution" title="Figure 2: Ramsey's Theorem" /></p>
<p><strong>Figure 3:</strong> A distribution showing the conflict between \(A \) being DP and \(k\)-proper normal. By DP, the behaviour of \(A\) on \(S^* \) must be similar to its behaviour on \(S_i\) for \(i = 1 \ldots \Omega(n).\) Namely, since \(S^* \) and \( S_i \) differ by at most two points, \( A(S^*) \) must output a threshold in \( I_i \) with probability \( p \geq (q - 2\delta)e^{-(2\varepsilon)} \) for each \( i \), where \( q \) is a lower bound on the probability that \( A(S_i) \) outputs a threshold in \( I_i \). Because \( A \) is \(k\)-proper-normal, \( q > c/n \), so \( p = \Omega(1/n) \). This yields a contradiction for \( m >1/p \) because \( A \) cannot simultaneously output a threshold in two of the intervals \( I_i \).</p>
<p>For the case of improper algorithms, Shay and his coauthors considered an alternative notion of normality, which we will term <em>improper-normal</em>. Recall that in this case the output \( A(S) \) of the learner on a sample \( S \) is <em>any function</em> from \( [0, 1] \) to \( \{0, 1\} \) and not necessarly a threshold. We’ll say that \( A \) is \(k\)-improper-normal on a set \( \mathcal{X}’ \) if for any balanced sample \( S = [(x_1, 0), \ldots, (x_{n/2}, 0), (x_{n/2 +1}, 1), \ldots, (x_n, 1)] \) of \( n \) points in \( \mathcal{X}’, \Pr[A(S)(w) = 1] - \Pr[A(S)(v) = 1] > \Omega(1/n) \) for any \( w \in (x_{k - 1}, x_k) \) and \( v \in (x_k, x_{k + 1}) \) in \( \mathcal{X}’ \setminus \{x_1, … x_n\} \). To PAC learn, A must be \(k\)-improper-normal for some \(k\) on any set of \(n + 1\) points. Applying the same Ramsey Theorem to a graph with colored hyperedges of size \(n + 1\) shows that there must exist an arbitrarily large set \( \mathcal{X}’ \) on which \( A \) is \(k\)-improper normal for some \( k \). A similar (but more nuanced) argument as before shows that a learner cannot be simultaneously private and \(k\)-improper-normal on some distribution over \( \mathcal{X}’ \).</p>
<p>For the last piece of the puzzle, showing the upper bound converse, Mark Bun, an expert in differential privacy at Princeton at the time, joined in. Beginning in the fall of 2019, Mark, Shay, and Roi worked out an upper bound that showed that any class with finite Littlestone dimension could be learned privately. Their technique introduced a new notion of stability, called <em>global stability</em>, which is a property of algorithms that frequently output the same hypothesis. Given a globally stable PAC learner \( A \), to obtain a DP PAC learner, one can run \( A \) many times and produce a histogram of the output hypotheses, add noise to this histogram for the sake of privacy, and then select the most frequent output hypothesis. The construction for the globally stable learner uses the Standard Optimal Algorithm for online learning as a black box - though the reduction is very computationally intensive and results in a sample complexity depending exponentially on the Littlestone Dimension. This reduction was improved in <a href="https://arxiv.org/abs/2012.03893"><strong>[GGKM20]</strong></a>, where the authors gave a reduction requiring only polynomially many samples in the Littlestone dimension.</p>
<h2 id="outlook">Outlook</h2>
<p>Shay mentions that these results are only a first step towards understanding differentially private PAC learning. This work establishes a deep connection between online learning and DP learning, qualitatively showing that these two problems have similar underlying complexity. Recent works <a href="https://arxiv.org/abs/1905.11311"><strong>[GHM19]</strong></a> have gone a step further and established polynomial time reductions from DP learning to online learning under certain conditions. At the same time, Bun <a href="https://arxiv.org/abs/2007.05665"><strong>[B20]</strong></a> demonstrates a computational gap between the two problems: they exhibit a concept class which is DP PAC learnable in polynomial time, but no algorithm can learn it online in polynomial time and sample complexity.</p>
<p>An interesting question here is whether a polynomial time online learning algorithm implies a polynomial time DP learning algorithm? With regards to sample complexity, tighter quantitative bounds relating the sample complexity of DP learning to the Threshold dimension and the Littlestone Dimension, along with constructive reductions from DP learning to online learning are wide open. An interesting conjecture, also highlighted in the talk, is whether DP PAC learning and PAC learning are actually equivalent up to a \(log*\) factor of the Littlestone dimension? Solving this would mean that for most natural function classes, one need not pay a very high price for private learning as compared to PAC learning.</p>
<p>From a more practical perspective, studying such qualitative equivalences can lay the groundwork allowing one to use the vast existing knowledge in the field of online learning to design better algorithms for DP learning, and vice versa. Despite its abstractness, Shay believes this result still has significance for engineers: “If they want to do something differentially privately, if they already have a good online learning algorithm for this, maybe they can modify it,” Shay says. “It gives some kind of inspiration.”</p>
<p>From a broader perspective, Shay believes that discovering clean, beautiful mathematical models that are more realistic than the PAC learning model and understanding the price one pays for privacy in those models are important directions for future research. One model, highlighted in the tutorial by Shay is that of <em>Universal Learning</em>, introduced in a recent work <a href="https://arxiv.org/abs/2011.04483"><strong>[BHMVY21]</strong></a> with Olivier Bousquet, Steve Hanneke, Ramon van Handel and Amir Yehudayoff. This model considers a learning task with a fixed distribution of samples and studies the hardness of the problem as the number of samples \( n \to \infty \). This setup better captures the practical aspects of modern machine learning and overcomes the limitations of the PAC model, which studies worst-case distributions for each sample size.</p>
<p>And how should one go about identifying such better mathematical models? “Pick the simplest problem that you don’t know how to solve and see where that leads you”, says Shay. One should begin by trying to understand the deficiencies of existing learning models, by identifying simple examples which go beyond these existing models. For example, Livni and Moran <a href="https://arxiv.org/abs/2006.13508"><strong>[LM20]</strong></a> exhibit the limitations of PAC-Bayes framework through a simple 1D linear classification problem. Fixing such limitations can often lead one to discover better learning models.</p>
<hr />
<p><em>Thanks to Gautam Kamath, Shay Moran, and Keziah Naggita for helpful conversations and comments.</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>For a more complete background on the progress in DP PAC learning, we refer the reader to the excellent survey blog post <a href="https://differentialprivacy.org/private-pac/">here</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>At the time, Shay was additionally affiliated with Princeton University and Google Brain. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Kush BhatiaMargalit GlasgowMon, 26 Apr 2021 12:00:00 -0400
https://differentialprivacy.org/alt-highlights/
https://differentialprivacy.org/alt-highlights/What is δ, and what δifference does it make?<p>There are many variants or flavours of differential privacy (DP) some weaker than others: often, a given variant comes with own guarantees and “conversion theorems” to the others. As an example, “pure” DP has a single parameter \(\varepsilon\), and corresponds to a very stringent notion of DP:</p>
<blockquote>
<p>An algorithm \(M\) is \(\varepsilon\)-DP if, for all neighbouring inputs \(D,D'\) and all measurable \(S\), \( \Pr[ M(D) \in S ] \leq e^\varepsilon\Pr[ M(D’) \in S ] \).</p>
</blockquote>
<p>By relaxing this a little, one obtains the standard definition of approximate DP, a.k.a. \((\varepsilon,\delta)\)-DP:</p>
<blockquote>
<p>An algorithm \(M\) is \((\varepsilon,\delta)\)-DP if, for all neighbouring inputs \(D,D'\) and all measurable \(S\), \( \Pr[ M(D) \in S ] \leq e^\varepsilon\Pr[ M(D’) \in S ]+\delta \).</p>
</blockquote>
<p>This definition is very useful, as in many settings achieving the stronger \(\varepsilon\)-DP guarantee (i.e., \(\delta=0\)) is impossible, or comes at a very high utility cost. But how to interpret it? The above definition, on its face, doesn’t preclude what one may call “<em>catastrophic failures of privacy</em> 💥:” most of the time, things are great, but with some small probability \(\delta\) all hell breaks loose. For instance, the following algorithm is \((\varepsilon,\delta)\)-DP:</p>
<ul>
<li>Get a sensitive database \(D\) of \(n\) records</li>
<li>Select uniformly at random a fraction \(\delta\) of the database (\(\delta n\) records)</li>
<li>Output that subset of records in the clear 💥</li>
</ul>
<p>(actually, this is even \((0,\delta)\)-DP!). This sounds preposterous, and obviously something that one would want to avoid in practice (lest one wants to face very angry customers or constituents). This is one of the rules of thumb for picking \(\delta\) small enough (or even “cryptographically small”), typically \(\delta \ll 1/n\), so that the records are safe (hard to disclose \(\delta n \ll 1\) records).</p>
<p>So: good privacy most of the time, but with probably \(\delta\) then all bets are off.</p>
<p>However, those catastrophic failure of privacy, while technically allowed by the definition of \((\varepsilon,\delta)\)-DP, <strong>are not something that can really happen with the DP algorithms and techniques used both in practice and in theoretical work.</strong> Before explaining why, let’s see what is the kind of desirable behaviour one would expect: a <em>“smooth, manageable tradeoff of privacy parameters.”</em> For that discussion, let’s introduce the <em>privacy loss random variable</em>: given an algorithm M and two neighbouring inputs D,D’, let \(f(y)\) be defined as
\[
f(y) = \log\frac{\Pr[M(D)=y]}{\Pr[M(D’)=y]}
\]
for every possible output \(y\in\Omega\). Now, define the random variable \(Z := f(M(D))\) (implicitly, \(Z\) depends on \(D,D',M\)). This random variable quantifies how much observing the output of the algorithm \(M\) helps distinguishing between \(D\) and \(D'\).</p>
<p>Now, going a little bit fast, you can check that saying that \(M\) is \(\varepsilon\)-DP corresponds to the guarantee “<em>\(\Pr[Z > \varepsilon] = 0\) for all neighbouring inputs \(D,D'\).</em>”
Similarly, \(M\) being \((\varepsilon,\delta)\)-DP is the guarantee \(\Pr[Z > \varepsilon] \leq \delta\).\({}^{(\dagger)}\) For instance, the “catastrophic failure of privacy” corresponds to the scenario below, which depicts a possible distribution for \(Z\): \(Z\leq \varepsilon\) with probability \(1-\delta\), but then with probability \(\delta\) we have \(Z\gg 1\).</p>
<p><img src="/images/flavours-delta-fig1.png" width="600" alt="The type of (bad) distribution of Z corresponding to 'our catastrophic failure of privacy'" style="margin:auto;display: block;" /></p>
<p>What we would like is a smoother thing, where even when \(Z>\varepsilon\) is still remains reasonable and doesn’t immediately become large. A nice behaviour of the tails, ideally something like this:</p>
<p><img src="/images/flavours-delta-fig2.png" width="600" alt="A distribution for Z with nice tails, leading to smooth tradeoffs between ε and δ" style="margin:auto;display: block;" /></p>
<p>For instance, if we had a bound on \(\mathbb{E}[|Z|]\), we could use Markov’s inequality to get, well, <em>something</em>. For instance, imagine we had \(\mathbb{E}[|Z|]\leq \varepsilon\delta\): then
\[
\Pr[ |Z| > \varepsilon ] \leq \frac{\mathbb{E}[|Z|]}{\varepsilon }\leq \delta
\]
<em>(great! We have \((\varepsilon,\delta)\)-DP)</em>; but also \(\Pr[ |Z| > 10\varepsilon ] \leq \frac{\delta}{10}\). Privacy violations do not blow up out of proporxtion immediately, we can trade \(\varepsilon\) for \(\delta\). That seems like the type of behaviour we would like our algorithms to exhibit.</p>
<p><img src="/images/flavours-delta-fig3.png" width="600" alt="The type of privacy guarantees a Markov-type tail bound would give" style="margin:auto;display: block;" /></p>
<p>But why stop at Markov’s inequality then, which gives some nice but still weak tail bounds? Why not ask for <em>stronger</em>: Chebyshev’s inequality? Subexponential tail bounds? Hell, <em>subgaussian</em> tail bounds? This is, basically, what some stronger notions of differential privacy than approximate DP give.</p>
<ul>
<li>
<p><strong>Rényi DP</strong> <a href="https://arxiv.org/abs/1702.07476" title="Ilya Mironov. Renyi Differential Privacy. CSF 2017"><strong>[Mironov17]</strong></a>, for instance, is a guarantee on the moment-generating function (MGF) of the privacy random variable \(Z\): it has two parameters, \(\alpha>1\) and \(\tau\), and requires that \(\mathbb{E}[e^{(\alpha-1)Z}] \leq e^{(\alpha-1)\tau}\) for all neighbouring \(D,D'\). In turn, by applying for instance Markov’s inequality to the MGF of \(Z\), we can control the tail bounds, and get a nice, smooth tradeoff in terms of \((\varepsilon,\delta)\)-DP.</p>
</li>
<li>
<p><strong>Concentrated DP</strong> (CDP) <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun and Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016"><strong>[BS16]</strong></a> is an even stronger requirement, which roughly speaking requires the algorithm to be Rényi DP <em>simultaneously</em> for all \(1< \alpha \leq \infty\). More simply, this is “morally” a requirement on the MGF of \(Z\) which asks it to be subgaussian.</p>
</li>
</ul>
<p>The above two examples are not just fun but weird variants of DP: they actually capture the behaviour of many well-known differentially private algorithms, and in particular that of the Gaussian mechanism. While the guarantees they provide are less easy to state and interpret than \(\varepsilon\)-DP or \((\varepsilon,\delta)\)-DP, they are incredibly useful to analyze those algorithms, and enjoy very nice composition properties… and, of course, lead to that smooth tradeoff between \(\varepsilon\) and \(\delta\) for \((\varepsilon,\delta)\)-DP.</p>
<p><strong>To summarize:</strong></p>
<ul>
<li>\(\varepsilon\)-DP gives great guarantees, but is a very stringent requirement. Corresponds to the privacy loss random variable supported on \([-\varepsilon,\varepsilon]\) (no tails!)</li>
<li>\((\varepsilon,\delta)\)-DP gives guarantees easy to parse, but on its face allows for very bad behaviours. Corresponds to the privacy loss random variable in \([-\varepsilon,\varepsilon]\) with probability \(1-\delta\) (but outside, all bets are off!)</li>
<li>Rényi DP and Concentrated DP correspond to something in between, controlling the tails of the privacy loss random variable by a guarantee on its MGF. A bit harder to interpret, but capture the behaviour of many DP building blocks can be converted to \((\varepsilon,\delta)\)-DP (with nice trade-offs between \(\varepsilon\) and \(\delta\).</li>
</ul>
<hr />
<p>\({}^{(\dagger)}\) The astute reader may notice that this is not <em>quite</em> true. Namely, the guarantee \(\Pr[Z > \varepsilon] \leq \delta\) on the privacy loss random variable (PLRV) does imply \((\varepsilon,\delta)\)-differential privacy, but the converse does not hold. See, for instance, Lemma 9 of <a href="https://arxiv.org/abs/2004.00010" title="Clément L. Canonne, Gautam Kamath, Thomas Steinke. The Discrete Gaussian for Differential Privacy. NeurIPS 2020"><strong>[CKS20]</strong></a> for an exact characterization of \((\varepsilon,\delta)\)-DP in terms of the PLRV.</p>
Clément CanonneThu, 11 Mar 2021 21:00:00 -0400
https://differentialprivacy.org/flavoursofdelta/
https://differentialprivacy.org/flavoursofdelta/