-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add noisy_avg_gaussian aggregation #20865
Conversation
Codenotify: Notifying subscribers in CODENOTIFY files for diff 629070d...a2a3b4f.
|
39cfd46
to
35e7d87
Compare
Need help adding this issue to backlog: #20869 similar to what requested in this comment #20810 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a bunch of nitpicky naming sorts of comments! There are a lot of references to sum rather than avg in the unit tests, so I didn't explicitly flag them all.
...a/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAvgAggregationUtils.java
Outdated
Show resolved
Hide resolved
...a/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAvgAggregationUtils.java
Outdated
Show resolved
Hide resolved
...a/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAvgAggregationUtils.java
Outdated
Show resolved
Hide resolved
...a/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAvgAggregationUtils.java
Outdated
Show resolved
Hide resolved
...rc/main/java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAvgState.java
Outdated
Show resolved
Hide resolved
...cebook/presto/operator/aggregation/noisyaggregation/TestNoisyAvgGaussianLongAggregation.java
Outdated
Show resolved
Hide resolved
...cebook/presto/operator/aggregation/noisyaggregation/TestNoisyAvgGaussianLongAggregation.java
Outdated
Show resolved
Hide resolved
...cebook/presto/operator/aggregation/noisyaggregation/TestNoisyAvgGaussianLongAggregation.java
Outdated
Show resolved
Hide resolved
...resto/operator/aggregation/noisyaggregation/TestNoisyAvgGaussianShortDecimalAggregation.java
Outdated
Show resolved
Hide resolved
...resto/operator/aggregation/noisyaggregation/TestNoisyAvgGaussianShortDecimalAggregation.java
Outdated
Show resolved
Hide resolved
35e7d87
to
d7778a9
Compare
Fixed all the comments |
@mlyublena @pranjalssh please take a look when you have a chance |
This commit adds `noisy_avg_gaussian` aggregation. It can be used to replace `avg(col)` with `noisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed])`. This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto. `col` can be of numerical types: INT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL. Because noise is of type `double`, all values are converted to `double` before being added to the avg, and the return type is `double`. When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum (which is later used to compute the avg). Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used. Why we want these functions: The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features. While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.
d7778a9
to
a2a3b4f
Compare
Description
This commit adds
noisy_avg_gaussian
aggregation. It can be used to replaceavg(col)
withnoisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed])
.This is a continuation of the previous PR(s) supporting
noisy_count_gaussian
andnoisy_sum_gaussian
.col
can be of numerical types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.Because noise is of type
double
, all values are converted todouble
before being added to the sum which is used to compute the avg, and the return type isdouble
.When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum, which is used to compute the avg.
Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.
Motivation and Context
This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.
The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.
While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.
Impact
This commit adds
noisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed])
aggregation which calculates the average (arithmetic mean) of all the input values, and then adds random Gaussian noise with 0 mean and standard deviation ofnoise_scale
to the true avg. This also provides options to clip values to a range[lower, upper]
and a random seed for reproducibility.Test Plan
noisy_avg_gaussian(col, noiseScale)
,noisy_avg_gaussian(col, noiseScale, randomSeed)
,noisy_avg_gaussian(col, noiseScale, lower, upper)
,noisy_avg_gaussian(col, noiseScale, lower, upper randomSeed)
:avg(col)
, and tested when input has 0 rows, with and without GROUP BYnoisy_avg_gaussian(col, noiseScale)
,noisy_avg_gaussian(col, noiseScale, randomSeed)
compared to normal AVG:Contributor checklist
Release Notes