Internet-Draft | DP-PPM | June 2024 |
Chen, et al. | Expires 12 December 2024 | [Page] |
Differential Privacy (DP) is a property of a secure aggregation mechanism that ensures that no single input measurement significantly impacts the distribution of the aggregate result. This is a stronger property than what systems like the Distributed Aggregation Protocol (DAP) are designed to achieve. This document describes a variety of DP mechansisms applicable to DAP, and, for a variety of common use cases, lifts DAP to a protocol that also achieves DP.¶
This note is to be removed before publishing as an RFC.¶
The latest revision of this draft can be found at https://wangshan.github.io/draft-wang-ppm-differential-privacy/draft-wang-ppm-differential-privacy.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-wang-ppm-differential-privacy/.¶
Discussion of this document takes place on the Privacy Preserving Measurement Working Group mailing list (mailto:ppm@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/ppm/. Subscribe at https://www.ietf.org/mailman/listinfo/ppm/.¶
Source for this draft and an issue tracker can be found at https://github.com/wangshan/draft-wang-ppm-differential-privacy.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 12 December 2024.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
[TO BE REMOVED BY RFC EDITOR: The source for this draft and and the reference implementations of mechanisms and policies can be found at https://github.com/wangshan/draft-wang-ppm-differential-privacy.]¶
Multi-party computation systems like the Distributed Aggregation Protocol [DAP] enable secure aggregation of measurements generated by individuals without handling the measurements in the clear. This is made possible by using a Verifiable Distributed Aggregation Function [VDAF], the core cryptographic component of DAP. Execution of A VDAF involves: a large set of "Clients" who produce cryptographically protected measurements, called "reports"; a small number of "Aggregators" who consume reports and produce the cryptographically protected aggregate; and a "Collector" who consumes the plaintext aggregate result. Distributing the computation of the aggregate in this manner ensures that, as long as one Aggregator is honest, no attacker can learn an honest Client's measurement.¶
Depending on the application, protecting the measurements may not be sufficient for privacy, since the aggregate itself can reveal privacy-sensitive information. As an illustrative example, consider using DAP/VDAF to summarize the distribution of the heights of respondents to a survey. If one of the respondents is especially short or tall, then their contribution is likely to skew the summary statistic in a way that reveals their height. Ideally, no individual measurement would have such a significant impact on the aggregate result, but in general such leakage is inevitable for exact aggregates. Adding some carefully chosen noise to the aggregates can however help hide the contribution of one respondent.¶
This intuition can be formalized by the notion of differential privacy [DMNS06]. Differentially privacy is a property of an algorithm or protocol that computes some function of a set of measurements. We say the algorithm or protocol is "differentially private", or "DP", if the probability of observing a particular output does not change significantly as a result of removing one of the measurements (or substituting it with another).¶
VDAFs are not DP on their own, but they can be composed with a variety of mechanisms that endow them with this property. All such mechanisms work by introducing noise into the computation that is carefully calibrated for a number of application-specific parameters, including the structure and number of measurements, the desired aggregation function, and the degree of "utility" required. Intuitively, a high degree of privacy can be achieved by adding a lot of noise; but adding too much noise can reduce the usefulness of the aggregate result.¶
Noise can be introduced at various steps at the computation, and by various parties. Depending on the mechanism: the Clients might add noise to their own measurements; and the Aggregators might add noise to their aggregate shares (the values they produce for the Collector).¶
In this document, we shall refer to the composition of DP mechanisms into a scheme that provides (some notion of) DP as a "DP policy". For some policies, noise is added only by the Clients or only by the Aggregators, but for others, both Clients and Aggregators may participate in generating the noise.¶
The primary goal of this document is to specify how DP policies are implemented in DAP. It does so in the following stages:¶
Section 3 describes the notion(s) of DP that are compatible with DAP. Security is defined in a few different "trust models" in which we assume that some fraction of the parties execute the protocol honestly. Of course in reality, whether such assumptions hold is usually outside of our control. Thus our goal is to design DP policies that still provide some degree of protection in more pessimistic trust models. (We call this "hedging".)¶
Section 4 specifies various mechanisms required for building DP systems, including algorithms for sampling from discrete Laplace and Gaussian distributions.¶
Section 5 defines DP policies that are implemented with DP mechanisms, their composition with VDAFs, and their execution semantics for DAP. Section 5.1 then demonstrates how to execute VDAFs with DP policies.¶
Section 6 specifies compositions of concrete VDAFs with concrete DP policies for achieving DP for specific DAP use cases.¶
The following considerations are out-of-scope for this document:¶
DP policies have a few parameters that need to be tuned in order to meet the privacy/utility tradeoff required by the application. This document provides no guidance for this process.¶
This document describes a particular class of narrowly-scoped DP policies. Other, more sophisticated policies are possible. [TODO: Add citations. Here we're thinking of things like DPrio, which may be more appropriate to specify as DAP report extensions.]¶
The mechanisms described in Section 4 are intended for use beyond DAP/VDAF. However, this document does not describe general-purpose DP policies; those described in Section 5 are tailored to specific VDAFs or classes of VDAFs.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document uses Python3 for describing algorithms.¶
Let exp(EPSILON)
denote raising the numeric constant e
to the power of
EPSILON
.¶
DP formalizes a property of any randomized algorithm that takes in a sequence of measurements, aggregates them, and outputs an aggregate result. Intuitively, this property guarantees that no single measurement significantly impacts the value of the aggregate result. This intuition is formalized by [DMNS06] as follows.¶
CP: The following is might be too jargony for an informative RFC, but for now I think we're all just trying to agree on the definition. Once we have consensus among ourselves, we can punt this to the appendix and leave a less formal description here.¶
DP requires specifying the notion of "neighboring" datasets, that determines what information is being hidden. The most common notion for our setting would be the following:¶
We say that two batches of measurements D1
and D2
are "neighboring" if they
are the same length and contain all the same measurements except one (i.e., the
symmetric difference between the multisets contains two elements). We denote
this definition as "replacement-DP" (or "substitution-DP").
Appendix B.2 discusses other notions of adjacency that may be
appropriate in some settings.¶
Let p(S, D, r)
denote the probability that randomized algorithm S
, on input
of measurements D
, outputs aggregate result r
.¶
A randomized algorithm S
is called "EPSILON
-DP" if for all neighboring D1
and D2
and all possible aggregate results r
, the following inequality
holds:¶
p(S, D1, r) <= exp(EPSILON) * p(S, D2, r)¶
In other words, the probability that neighboring inputs produce a given
aggregate result differs by at most a constant factor, exp(EPSILON)
.¶
One can think of EPSILON
as a measure of how much information about the
measurements is leaked by the aggregate result: the smaller the EPSILON
, the
less information is leaked by S
. For most DP applications, EPSILON
will
be a small constant, e.g. 0.1 or 0.5. See Appendix B for details.¶
This notion of EPSILON
-DP is sometimes referred to as "pure-DP". The
following is a relaxation of pure-DP, called "approximate-DP", from [DR14]. A
randomized algorithm S
is called "(EPSILON, DELTA)
-DP" if for all
neighboring D1
and D2
and all possible aggregate results r
, the following
inequality holds:¶
p(S, D1, r) <= exp(EPSILON) * p(S, D2, r) + DELTA¶
Compared to pure-DP, approximate-DP loses an additive factor of DELTA
in the
bound. DELTA
can intuitively be understood as the probability that a piece of
information is leaked (e.g. a Client measurement is leaked), so DELTA
is
typically taken to be polynomially small in the batch size or smaller, i.e.,
some value much smaller than 1 / batch_size
. Allowing for a small DELTA
can
in many cases allow for much smaller noise compared to pure-DP mechanisms. See
Section 4 for details.¶
Other variants of DP are possible; see the literature review in Appendix B for details.¶
Differential privacy noise sometimes needs to be calibrated based on the
SENSITIVITY
of the aggregation function used to compute the aggregate result
over Client measurements. Sensitivity characterizes how much a change in a
Client measurement can affect the aggregate result. In this document, we focus
on the L1 and L2 sensitivity. We define them as the maximum of some distance
between two aggregate results computed from any neighboring datasets:¶
L1 Sensitivity: the distance is defined as the sum of the absolute values of differences at all coordinates of the neighboring aggregate results.¶
L2 Sensitivity: the distance is defined as the square root of the sum of squares of the differences at all coordinates of the neighboring aggregate results.¶
When considering whether a given DP policy is sufficient for DAP, it is not enough to consider the mechanisms used in isolation. It is also necessary to consider the process by which the policy is executed. In particular, our threat model for DAP considers an attacker that participates in the upload, aggregation, and collection protocols and may use its resources to attack the underlying cryptographic primitives (VDAF [VDAF], HPKE [RFC9180], and TLS [RFC8446]).¶
To account for such attackers, our goal for DAP is "computational-DP" as described by [MPRV09] (Definition 4, "SIM-CDP"). This formalizes the amount of information a computationally bounded attacker can glean about the measurements generated by honest Clients over the course of its attack on the protocol. We consider an attacker that controls the network and statically corrupts a subset of the parties.¶
We say the protocol is pure-DP (resp. approximate-DP) if the view of any computationally bounded attacker can be efficiently simulated by a simulator that itself is pure-DP (or approximate-DP) as defined above. (The simulator takes the measurements as input and outputs a value that should be computationally indistinguishable from the transcript of the protocol's execution in the presence of the attacker.)¶
Whether this property holds for a given DP policy depends on which parties can be trusted to execute the protocol correctly (i.e., which parties are not corrupted by the attacker). We consider three, increasingly pessimistic trust models.¶
KT(issue#28): Here we seem to be assuming corrupted = malicious. Is there any benefit to a more refined distinction (i.e. honest-but-curious vs malicious). I suspect we would always want secure against malicious, but perhaps there are settings where we are OK with security against bad behavior that is not catchable during an audit.¶
Assume that most Clients and one Aggregator are honest and that the other
Aggregator (DAP involves just two Aggregators) and the Collector are controlled
by the attacker. When all Clients are honest, this corresponds to the same
trust model as the base DAP protocol. The degree of privacy provided (i.e., the
value of EPSILON
for pure-DP) for most protocols in this setting would
degrade gracefully as the number of honest Clients decreases.¶
Assume that at least one Aggregator is honest. The other Aggregator, the Collector, and all but one Clients are controlled by the attacker. The goal is to provide protection for the honest Client's measurement.¶
TODO(issue#15) For the simple Aggregator-noise mechanism, if the malicious Aggregator cheats by not adding noise, then the aggregate result is not DP from the point of view of the honest Aggregator, unless the honest Aggregator "forgets" the randomness it used. Describe this "attack" in "Security Considerations" and say why it's irrelevant.¶
Assume that all parties, including all but one Client, both Aggregators, and
the Collector are controlled by the attacker. The best the honest Client can
hope for is that its measurement has "local-DP". This property is defined the
same way as pure- or approximate-DP, but the bound on EPSILON
that we would
aim to achieve for local-DP would typically be larger than that in a more
optimistic trust model.¶
If a trust model's assumptions turn out to be false in practice, then it is desirable for a DP policy to maintain some degree of privacy in a more pessimistic trust model. For example, a deployment of DAP might provide some mechanism to ensure that all reports that are consumed were generated by trusted Clients (e.g., a Trusted Execution Environment (TEE) at each Client). This gives us confidence that our assumptions in the OAMC trust model hold. But if this mechanism is broken (e.g., a flaw is found in the TEE), then it is desirable if the policy remains DP in the OAOC or OC model, perhaps with a weaker bound.¶
This section describes various mechanisms required for implementing DP policies. The algorithms are designed to securely expand a short, uniform random seed into a sample from the target given distribution.¶
TODO For now we don't actually expand a seed into a sample; we just use coin flips that are local to the relevant algorithm. Update the API so that take a random seed instead so that we can derive test vectors.¶
Each mechanism has internal parameters that determine how much noise will be
added to its input data. Note that a mechanism that is initialized with its
internal parameters can achieve different combinations of DP parameters, e.g.
(EPSILON, DELTA)
-DP, or (EPSILON', DELTA')
-DP, where EPSILON < EPSILON'
and DELTA > DELTA'
, because if we make EPSILON
larger (i.e., weaker
privacy), we may achieve a smaller DELTA
and thereby a smaller overall DP
bound (i.e., stronger privacy).¶
A concrete DP mechanism implements the following methods:¶
DpMechanism.add_noise(data: DataType) -> DataType
adds noise to the input
data
(i.e., a measurement or an aggregate share). Some DP mechanisms apply
noise based on the input data.¶
DpMechanism.sample_noise(dimension: int) -> DataType
samples noise of
length dimension
, with the DP mechanism. The interpretation of dimension
generally depends on the data type. It may be called by DpMechanism.add_noise()
.¶
DpMechanism.debias(data: DataType, meas_count: int) -> DebiasedDataType
debiases the noised data
based on the number of measurements meas_count
.
Note that not all mechanisms will implement this method. Some do, such as
Section 4.3.¶
Putting this together a DP mechanism is a concrete subclass of DPMechanism
defined below:¶
class DpMechanism: # The data type applicable to this `DpMechanism`. The type is the # same for the noised data and the un-noised data. DataType = None # Debiased data type after removing bias added by the noise. For # most of the mechanisms, `DebiasedDataType == DataType`. DebiasedDataType = None def add_noise(self, data: DataType) -> DataType: """Add noise to a piece of input data. """ raise NotImplementedError() def sample_noise(self, dimension: int) -> DataType: """ Sample noise with the initialized `DpMechanism`. `dimension` is used to determine the length of the output if `DataType` is a list. """ raise NotImplementedError() def debias(self, data: DataType, meas_count: int) -> DebiasedDataType: """ Debias the data due to the added noise, based on the number of measurements `meas_count`. This doesn't apply to all DP mechanisms. Some Client randomization mechanisms need this functionality. """ return data¶
This section describes Symmetric RAPPOR, a DP mechanism first proposed in
[EPK14], and refined in Appendix C.1 of [MJTB_22]. (The specification here
reflects the refined version.) It is initialized with a parameter EPSILON_0
.
It takes in a bit vector and outputs a noisy version with randomly flipped
bits.¶
Symmetric RAPPOR applies "binary randomized response mechanism" at each
coordinate. Binary randomized response takes in a single bit x
. The bit is
flipped to 1 - x
with probability 1 / (exp(EPSILON_0) + 1)
. For example, if
EPSILON_0
is configured to be 3, and the input to binary randomized response
is a 0
, the bit will be flipped to be 1 with probability 1 / (exp(3) + 1)
,
otherwise, it will stay as a 0
.¶
Under OC trust model, by applying binary randomized response with EPSILON_0
parameter to its measurement, the Client gets EPSILON_0
-DP with deletion
(Definition II.4 of [EFMR_20], and Definition C.1 of [MJTB_22]). A formal
definition of deletion-DP is elaborated in Section 4.3.1.¶
Symmetric RAPPOR generalizes binary randomized response mechanism by applying
it independently at all coordinates of a Client's bit vector. Under OAMC trust
model, it is proven in Appendix C.1.3 of [MJTB_22] that strong (EPSILON,
DELTA)
-DP can be achieved by aggregating a batch of noisy Client measurements,
each of which is a bit vector with exactly one bit set, and is noised with
symmetric RAPPOR. The final aggregate result needs to be "debiased" due to the
noise added by the Clients. Although the raw aggregate result is a vector of
integers, the debiased aggregate result is a vector of floats, because the best
estimate of the true sum depends on exp(EPSILON_0)
, and generally won't be an
integer.¶
Since the noise generated by each Client at each coordinate is independent, and
as the number of Clients n
grows, the noise distribution at each coordinate
approximates a Gaussian distribution, with mean 0, and standard deviation
sqrt(n * exp(EPSILON_0) / (exp(EPSILON_0) - 1)^2)
, as proved by Theorem C.2 of
[MJTB_22].¶
EPSILON_0
-DP in Deletion-DP
JC: We only add a definition of deletion-DP here since this is likely the only mechanism that provides deletion-DP in the OC trust model. Putting it in overview might confuse readers early on, because we said only replacement-DP applies to DAP.¶
Let P1(S, D, E)
denote the probability that a randomized algorithm S
, on an
input measurement D
, outputs a noisy measurement E
. Let P2(R, E)
denote
the probability that a noisy output from the reference distribution R
is
equal to E
. A randomized algorithm S
is said to be EPSILON_0
-DP in the
deletion-DP model, if there exists a reference distribution R
, such that for
all possible Client measurements D
, and all possible noisy outputs E
, we
have:¶
-EPSILON_0 <= ln(P1(S, D, E) / P2(R, E)) <= EPSILON_0¶
Intuitively, if we think of the reference distribution R
as an average, noisy
Client measurement, deletion-DP makes it hard to distinguish the Client
measurement D
from the measurement from an average Client.¶
Symmetric RAPPOR is specified by the DP mechanism SymmetricRappor
below.¶
TODO: We could make the sampler more efficient if we use binomial.¶
import random class SymmetricRappor(DpMechanism): DataType = list[int] # Debiasing produces an array of floats. DebiasedDataType = list[float] def __init__(self, eps0: float): self.eps0 = eps0 self.p = 1.0 / (math.exp(eps0) + 1.0) def add_noise(self, data: DataType) -> DataType: # Apply binary randomized response at each coordinate, based # on Appendix C.1.1 of {{MJTB+22}}. return list(map( lambda x: 1 - x if self.coin_flip() else x, data )) def sample_noise(self, dimension: int) -> DataType: # Sample binary randomized response at each coordinate on an # all zero vector. return [int(self.coin_flip()) for coord in range(dimension)] def debias(self, data: DataType, meas_count: int) -> DebiasedDataType: # Debias the data based on Appendix C.1.2 of {{MJTB+22}}. exp_eps = math.exp(self.eps0) return list(map( lambda x: (x * (exp_eps + 1) / (exp_eps - 1) - meas_count / (exp_eps - 1)), data )) def coin_flip(self): return random.random() < self.p¶
The section defines a generic interface for DP policies for VDAFs. A DP policy composes the following operations with execution of a VDAF:¶
An optional Client randomization mechanism that adds noise to Clients' measurements prior to sharding.¶
An optional Aggregator randomization mechanism that adds noise to an Aggregator's aggregate share prior to unsharding.¶
An optional debiasing step that removes the bias in DP mechanisms (i.e.
DpMechanism.debias
) after unsharding.¶
The composition of Client and Aggregator randomization mechanisms defines the
DP policy for a VDAF, and enforces the DP guarantee. In particular, a concrete
DP policy is a subclass of DpPolicy
:¶
class DpPolicy: # Client measurement type. Measurement = None # Aggregate share type, owned by an Aggregator. AggShare = None # Aggregate result type, unsharded result from all Aggregators. AggResult = None # Debiased aggregate result type. DebiasedAggResult = None def add_noise_to_measurement(self, meas: Measurement, ) -> Measurement: """ Add noise to measurement, if required by the Client randomization mechanism. The default implementation is to do nothing. """ return meas def add_noise_to_agg_share(self, agg_share: AggShare, ) -> AggShare: """ Add noise to aggregate share, if required by the Aggregator randomization mechanism. The default implementation is to do nothing. """ return agg_share def debias_agg_result(self, agg_result: AggResult, meas_count: int, ) -> DebiasedAggResult: """ Debias aggregate result, if either of the Client or Aggregator randomization mechanism requires this operation, based on the number of measurements `meas_count`. The default implementation is to do nothing. """ return agg_result¶
The execution of DpPolicy
with a Vdaf
can thus be summarized by the
following computational steps (these are carried out by DAP in a distributed
manner):¶
def run_vdaf_with_dp_policy( dp_policy: DpPolicy, Vdaf: Vdaf, agg_param: Vdaf.AggParam, measurements: list[DpPolicy.Measurement], ): """Run the DP policy with VDAF on a list of measurements.""" nonces = [gen_rand(Vdaf.NONCE_SIZE) for _ in range(len(measurements))] verify_key = gen_rand(Vdaf.VERIFY_KEY_SIZE) out_shares = [] for (nonce, measurement) in zip(nonces, measurements): assert len(nonce) == Vdaf.NONCE_SIZE # Each Client adds Client randomization noise to its # measurement. noisy_measurement = \ dp_policy.add_noise_to_measurement(measurement) # Each Client shards its measurement into input shares. rand = gen_rand(Vdaf.RAND_SIZE) (public_share, input_shares) = \ Vdaf.shard(noisy_measurement, nonce, rand) # Each Aggregator initializes its preparation state. prep_states = [] outbound = [] for j in range(Vdaf.SHARES): (state, share) = Vdaf.prep_init(verify_key, j, agg_param, nonce, public_share, input_shares[j]) prep_states.append(state) outbound.append(share) # Aggregators recover their output shares. for i in range(Vdaf.ROUNDS-1): prep_msg = Vdaf.prep_shares_to_prep(agg_param, outbound) outbound = [] for j in range(Vdaf.SHARES): out = Vdaf.prep_next(prep_states[j], prep_msg) (prep_states[j], out) = out outbound.append(out) # The final outputs of the prepare phase are the output shares. prep_msg = Vdaf.prep_shares_to_prep(agg_param, outbound) outbound = [] for j in range(Vdaf.SHARES): out_share = Vdaf.prep_next(prep_states[j], prep_msg) outbound.append(out_share) out_shares.append(outbound) num_measurements = len(measurements) # Each Aggregator aggregates its output shares into an # aggregate share, and adds any Aggregator randomization # mechanism to its aggregate share. In a distributed VDAF # computation, the aggregate shares are sent over the network. agg_shares = [] for j in range(Vdaf.SHARES): out_shares_j = [out[j] for out in out_shares] agg_share_j = Vdaf.aggregate(agg_param, out_shares_j) # Each Aggregator independently adds noise to its aggregate # share. noised_agg_share_j = \ dp_policy.add_noise_to_agg_share(agg_share_j) agg_shares.append(noised_agg_share_j) # Collector unshards the aggregate. agg_result = Vdaf.unshard(agg_param, agg_shares, num_measurements) # Debias aggregate result. debiased_agg_result = dp_policy.debias_agg_result( agg_result, num_measurements ) return debiased_agg_result¶
TODO: Specify integration of a DpPolicy
into DAP.¶
Many applications require aggregating histograms in which each Client submits a bit vector with exactly one bit set, also known as, "one-hot vector".¶
We describe two policies that achieve (EPSILON, DELTA)
-DP for this use case.
The first uses only a Client randomization mechanism and targets the OAMC trust
model. The second uses only an Aggregator randomization mechanism and
targets the more stringent OAOC trust model. We discover that both policies in
different settings of EPSILON
and DELTA
provide comparable utility, except
that the policy in the OAOC trust model requires all Aggregators to
independently add noise, so we lose some utility when more than one Aggregator
is honest.¶
Client randomization allows Clients to protect their privacy by adding noise to their measurements directly, as described in Appendix B.1. Analyses ([FMT20] and [FMT22]) have shown that, in the OAMC trust model, we get good approximate-DP by aggregating noisy Clients' measurements with Client randomization. In this policy, we will describe how to achieve approximate-DP, with each Client applying symmetric RAPPOR to its measurement.¶
Our target VDAF is Prio3Histogram as specified in [VDAF]. This uses the
Histogram
circuit to enforce one-hotness of the measurement. Due to the
noising mechanism, a less strict circuit is required that tolerates a bounded
number of non-0
entries in the vector. We call this Prio3MultiHotHistogram.¶
JC: Specify Prio3MultiHotHistogram. This may end up in the base VDAF draft. In the meantime, there is a reference implementation here: https://github.com/cfrg/draft-irtf-cfrg-vdaf/blob/main/poc/vdaf_prio3.py¶
The Client randomization we will use here is the symmetric RAPPOR mechanism of
Section 4.3, which is initialized with a EPSILON_0
parameter. We get
(EPSILON, DELTA)
-DP in the aggregate result, as long as there are at least
batch_size
number of honest Clients, each of which applies symmetric RAPPOR
to its measurement, and contributes the noisy measurement to the batch. The
(EPSILON, DELTA)
-DP degrades gracefully as the number of honest Clients
decreases, i.e., we can still achieve (EPSILON', DELTA)
-DP, where EPSILON'
is larger than EPSILON
.¶
TODO(junyechen1996): Justify why RR with EPSILON_0
+ batch_size
can
achieve (EPSILON, DELTA)
-DP in the aggregate result.¶
Because applying symmetric RAPPOR to an one-hot Client measurement can cause the
noisy measurement to have multiple bits set, we need to check the noisy
measurement has at most m
number of 1s, per Section 4.5 of [TWMJ_23], to
ensure robustness against malicious Clients, who attempt to bias the final
histogram by setting many coordinates to be 1.¶
Assume the length of the Client measurement is d
, and there is exactly one bit
set. For the d - 1
coordinates with 0s, the probability p_0
of changing a
coordinate from 0 to 1 is 1 / (exp(EPSILON_0) + 1)
per Section 4.3,
so we can model the number of 1s in the noisy measurement as a binomial random
variable C
with number of trials d - 1
, and probability p_0
, plus the one
bit that is already set. Our goal is to ensure the probability p
of 1 + C
exceeding m
is small enough, i.e., the false positive rate of a noisy
measurement from an honest Client having more than m
bits is at most p
. This
is equivalent to finding m
and p
, such that the cumulative distribution
function (CDF) satisfies Pr(C <= m - 1) >= 1 - p
.¶
Once we find m
, we will use it to instantiate Prio3MultiHotHistogram
to
perform verification and aggregation. The final aggregate result is debiased
based on the number of measurements according to Section 4.3, in order
to reduce the bias introduced during Client randomization.¶
class MultiHotHistogramWithClientRandomization(DpPolicy): Field = MultiHotHistogram.Field Measurement = MultiHotHistogram.Measurement AggShare = list[Field] AggResult = MultiHotHistogram.AggResult DebiasedAggResult = SymmetricRappor.DebiasedDataType def __init__(self, eps0: float): # TODO(junyechen1996): Justify how eps0 + batch_size can # achieve `(EPSILON, DELTA)`-DP. self.rappor = SymmetricRappor(eps0) def add_noise_to_measurement(self, meas: Measurement, ) -> Measurement: return self.rappor.add_noise(meas) def debias_agg_result(self, agg_result: AggResult, meas_count: int, ) -> DebiasedAggResult: return self.rappor.debias(agg_result, meas_count)¶
As discussed in Section 4.3, as the number of Clients n
increases,
the noise at each coordinate of the debiased aggregate result approximates a
Gaussian distribution with mean 0, and standard deviation
sqrt(n * exp(EPSILON_0) / (exp(EPSILON_0) - 1)^2)
. We will look at the
standard deviation generated by symmetric RAPPOR from n
Clients, with
parameter choices that achieve various combinations of (EPSILON, DELTA)
-DP.¶
EPSILON | DELTA | Standard deviation | Internal Parameters |
---|---|---|---|
0.317 | 1e-9 | 26.1337 | n = 100000, EPSILON_0 = 5.0 |
0.906 | 1e-9 | 12.2800 | n = 100000, EPSILON_0 = 6.5 |
1.528 | 1e-9 | 9.5580 | n = 100000, EPSILON_0 = 7.0 |
Aggregator Randomization requires Aggregators to add noise to their aggregate
shares before outputting them. Under OAOC trust model, we can achieve good
EPSILON
-DP, or (EPSILON, DELTA)
-DP, as long as at least one of the
Aggregators is honest. The amount of noise needed by the Aggregator randomizer
typically depends on the target DP parameters EPSILON
and DELTA
, and also
the SENSITIVITY
Section 3.2 of the aggregation function.¶
In this section, we describe how to achieve (EPSILON, DELTA)
-DP for
Prio3Histogram
Section 7.4.4 of [VDAF] by asking each Aggregator to
independently add discrete Gaussian noise to its aggregate share.¶
We use the discrete Gaussian mechanism described in Section 4.2, which
has a mean of 0, and is initialized with a SIGMA
parameter that stands for the
standard deviation of the Gaussian distribution. In order to achieve
(EPSILON, DELTA)
-DP in the OAOC trust model, all Aggregators need to
independently add discrete Gaussian noise to all coordinates of their aggregate
shares.¶
Theorem 8 in [BW18] shows how to compute SIGMA
parameter for continuous
Gaussian, based on the given (EPSILON, DELTA)
-DP parameters and
L2-sensitivity, and Theorem 7 in [CKS20] shows a similar result for discrete
Gaussian. For the current use case, the L2-sensitivity is sqrt(2)
, because
transforming an one-hot vector into another will affect two coordinates, e.g.,
transforming an one-hot vector [1, 0]
to [0, 1]
changes L2-sensitivity by
sqrt((1 - 0)^2 + (0 - 1)^2) = sqrt(2)
. Algorithm 1 in [BW18] elaborates on
how to compute SIGMA
, and we will refer to the calculation in the open
sourced code [AGM].¶
JC: We will need to provide an explanation of the parameter calculation in the draft itself, instead of merely referring to the code.¶
KT: [CKS20] mentions in Theorem 7 about the difference of approximate-DP
achieved by discrete [CKS20] and continuous Gaussian [BW18] that:
The discrete and continuous Gaussian attain almost identical guarantees for
large SIGMA
, but the discretization creates a small difference that becomes
apparent for small SIGMA
.
Therefore, if we use Algorithm 1 from [BW18], we will need to be careful
when SIGMA
is small, i.e., the desired (EPSILON, DELTA)
-DP is weak.¶
JC: It's also worth exploring the utility of discrete Gaussian proven by Definition 3 in [CKS20], which defines the concentrated-DP achieved by discrete Gaussian. Concentrated-DP is then converted to approximate-DP, which is our target here. As a first draft, we won't overwhelm readers with other types of DP.¶
class HistogramWithAggregatorRandomization(DpPolicy): Field = Histogram.Field # A measurement is an unsigned integer, indicating an index less # than `Histogram.length`. Measurement = Histogram.Measurement AggShare = list[Field] AggResult = Histogram.AggResult # The final aggregate result should be a vector of signed # integers, because discrete Gaussian could produce negative # noise that may have a larger absolute value than the count # before noise. DebiasedAggResult = list[int] def __init__(self, epsilon: float, delta: float): # TODO(junyechen1996): Consider using fixed precision or large # decimal for parameters like `epsilon` and `delta`. (#23) # Transforming an one-hot vector into another will affect two # coordinates, e.g. from [1, 0, 0] to [0, 1, 0], so the change # in L2-sensitivity is `sqrt(1 + 1) = sqrt(2)`. dgauss_sigma = agm.calibrateAnalyticGaussianMechanism( epsilon, delta, math.sqrt(2.0) ) self.dgauss_mechanism = \ DiscreteGaussianWithZeroMean(dgauss_sigma) def add_noise_to_agg_share(self, agg_share: AggShare, ) -> AggShare: """ Sample discrete Gaussian noise, and merge it with the aggregate share. """ noise_vec = self.dgauss_mechanism.sample_noise(len(agg_share)) result = [] for (agg_share_i, noise_vec_i) in zip(agg_share, noise_vec): if noise_vec_i < 0: noise_vec_i = Field.MODULUS + noise_vec_i result.append(agg_share_i + self.Field(noise_vec_i)) return result def debias_agg_result(self, agg_result: AggResult, meas_count: int, ) -> DebiasedAggResult: # TODO(junyechen1996): Interpret large unsigned integers as # negative values or 0 properly. For now, directly return it, # since we haven't fully implemented discrete Gaussian # mechanism (#10). return agg_result¶
We will demonstrate the utility of this policy in the table below in terms of
the standard deviation SIGMA
in discrete Gaussian needed to achieve various
combinations of (EPSILON, DELTA)
-DP, based on the open-sourced code in
[AGM].¶
It's worth noting that if more than one Aggregator is honest, we will lose some
utility because each Aggregator is independently adding noise. The standard
deviation in the aggregate result thus becomes SIGMA * sqrt(c)
, where c
is
the number of honest Aggregators. In the table below, the numbers in
"Standard deviation" column assume c = 2
. The numbers in
"Standard deviation (OAOC)" column assume only one Aggregator is honest.¶
EPSILON | DELTA | Standard deviation | Standard deviation (OAOC) |
---|---|---|---|
0.317 | 1e-9 | 33.0788 | 23.3903 |
0.906 | 1e-9 | 12.0777 | 8.5402 |
1.528 | 1e-9 | 7.3403 | 5.1904 |
TODO Security¶
This document has no IANA actions.¶
Pierre Tholoniat Columbia University pierre@cs.columbia.edu¶
Differential privacy is a set of techniques used to protect the privacy of individuals when analyzing user's data. It provides a mathematical framework that ensures the analysis of a dataset does not reveal identifiable information about any specific individuals. The advantage of differential privacy is that it provides a strong, quantifiable and composable privacy guarantee. The main idea of differential privacy is to add carefully calibrated noise to the results, which makes it difficult to determine with high certainty whether a specific individual's data was included in the results or not.¶
KT: I think we should distinguish between the randomizer and the DP guarantee. So I have attempted to use Client randomizer and Aggregator randomizer to describe the noise addition by those two, and Local DP and Aggregator DP to refer to the privacy guarantee. The distinction is important because Client randomizer + aggregate gives an aggregator DP guarantee.¶
There are two levels of privacy protection: local differential privacy (local DP) and aggregator differential privacy (aggregator DP).¶
OPEN ISSUE: or call it secure aggregator dp, or central dp.¶
In the local-DP settings, Clients apply noise to their own measurements. In this way, Clients have some control to protect the privacy of their own data. Any measurement uploaded by a Client will be have local dp, Client's privacy is protected even if none of the Aggregators is honest (although this protection may be weak). Furthermore, one can analyze the aggregator DP guarantee with privacy amplification by aggregation, assuming each Client has added the required amount of local DP noise, and there are at least minimum batch size number of Clients in the aggregation.¶
In Aggregator randomization settings, an Aggregator applies noise on the aggregation. This approach relies on the server being secure and trustworthy. Aggregators built using DAP protocol is ideal for this setting because DAP ensures no server can access any individual data, but only the aggregation.¶
If there are no local DP added from client, noise added to the aggregation provides the privacy guarantee of the aggregation.¶
JC: For now, we have been assuming either Client randomization or Aggregator randomization gives the target DP parameters. Theoretically one can use the Aggregator randomization together with Client randomization to achieve DP. For example, if the DP guarantee can be achieved with Client randomization from a batch size number of Clients, and batch size is not reached when a data collection task expires, each Aggregator can "compensate" the remaining noise, by using the same Client randomizer, on the missing number of Clients being the gap between actual number of Clients and target batch size.¶
There are primarily two models in the literature for defining two "neighboring
batches": deletion (or removal) of one measurement, and replacement (or
substitution) of one measurement with another [KM11]. In the DAP setting, the
protocol leaks the number of measurements in each batch collected and the
appropriate version of deletion-DP considers substitution by a fixed value
(e.g. zero). In other words, two batches of measurements D1
and D2
are
"neighboring" for deletion-DP if D2
can be obtained from D1
by replacing one
measurement by a fixed reference value.¶
In some cases, a weaker notion of adjacency may be appropriate. For example, we may be interested in hiding single coordinates of the measurement, rather than the whole vector of measurements. In this case, neighboring datasets differ in one coordinate of one measurement.¶
TODO: Chris P to fill: user or report, given time¶
There are various types of DP guarantees and budgets that can be enforced. Many applications need to query the Client data multiple times, for example:¶
Federated machine learning applications require multiple aggregates to be computed over the same underlying data, but with different machine learning model parameters.¶
[MJTB_22] describes an interactive approach of building histograms over multiple iterations, and Section 4.3 describes a way to track Client-side budget when the Client data is queried multiple times.¶
TODO: have citations for machine learning¶
It’s hard for Aggregator(s) to keep track of the privacy budget over time, because different Clients can participate in different data collection tasks, and only Clients know when their data is queried. Therefore, Clients must enforce the privacy budget.¶
There could be multiple ways to compose DP guarantees, based on different DP composition theorems. In the various example DP guarantees below, we describe the following:¶
EPSILON
-DP, or (EPSILON, DELTA)
-approximate DP
Pure EPSILON
-DP was first proposed in [DMNS06], and a formal definition of
(EPSILON, DELTA)
-DP can be found in Definition 2.4 of [DR14].¶
The EPSILON
parameter quantifies the "privacy loss" of observing the outcomes
of querying two databases differing by one element. The smaller EPSILON
is,
the stronger the privacy guarantee is, that is, the outcomes of querying two
adjacent databases are more or less the same.
The DELTA
parameter provides a small probability of the privacy loss
exceeding EPSILON
.¶
One can compose multiple (EPSILON, DELTA)
-approximate DP guarantees, per
Theorem 3.4 of [KOV15].
One can also compose the guarantees in other types of guarantee first, such as
Rényi DP Appendix B.5.1, and then convert the composed guarantee to approximate
DP guarantee.¶
(ALPHA, TAU)
-Rényi DP
A formal definition of Rényi DP can be found in Definitions 3 and 4 of [Mir17].¶
The intuition behind Rényi-DP is to use TAU
parameter to measure the
divergence of probability distributions of querying two adjacent databases,
given Rényi order parameter ALPHA
. The smaller the TAU
parameter,
the harder it is to distinguish the outputs from querying two adjacent
databases, and thus the stronger the privacy guarantee is.¶
One can compose multiple Rényi DP guarantees based on Proposition 1 of
[Mir17].
After composition, one can convert the (ALPHA, TAU)
-Rényi DP guarantee to
(EPSILON, DELTA)
-approximate DP, per Proposition 12 of [CKS20].¶
A formal definition of zero Concentrated-DP can be found in Definition 1.1 of [BS16].¶
Zero Concentrated-DP uses different parameters from Rényi-DP, but uses a similar idea to measure the output distribution divergence of querying two adjacent databases.¶
One can compose multiple zCDP guarantees, per Lemma 1.7 of [BS16].¶
Differential Privacy guarantee can only be achieved if data type is applied with the correct noise type.¶
TODO: Junye to fill, mention DAP is expected to ensure the right pair of VDAF and DP mechanism¶
TODO: Chris P: we will mention Prio3SumVec because that's what we use to describe aggregator DP with amplification¶