-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add noisy_sum_gaussian function #20810
Conversation
Codenotify: Notifying subscribers in CODENOTIFY files for diff a6eb6bc...f63bcd1.
|
Most of line changes are kinda similar because
Not sure how to make those better but I think unittest is necessary Updated:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to the Presto maintainers for the nitty gritty review of the Presto stuff, but I've left a few comments below, mostly relating to the docs, noise addition, and unit tests.
...java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAggregationUtils.java
Outdated
Show resolved
Hide resolved
...java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAggregationUtils.java
Show resolved
Hide resolved
...book/presto/operator/aggregation/noisyaggregation/TestNoisySumGaussianDoubleAggregation.java
Outdated
Show resolved
Hide resolved
...book/presto/operator/aggregation/noisyaggregation/TestNoisySumGaussianDoubleAggregation.java
Outdated
Show resolved
Hide resolved
...book/presto/operator/aggregation/noisyaggregation/TestNoisySumGaussianDoubleAggregation.java
Outdated
Show resolved
Hide resolved
a1f9e55
to
fd67677
Compare
@duykienvp : Please can you add an issue in https://github.com/orgs/prestodb/projects/3 to track implementing this function in Prestissimo (Presto native engine). |
@aditi-pandit I created 2 issues for this and the previous function noisy_count_gaussian() but cannot edit those issues to assign a project to them. Could you please help me add them to the project? |
fd67677
to
ae00a80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
2a0ad18
to
b653d07
Compare
...n/src/main/java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumState.java
Show resolved
Hide resolved
Please add documentation to the new classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me on the algo side!
552e4c2
to
ab77184
Compare
@@ -27,6 +32,13 @@ protected QueryRunner createQueryRunner() | |||
return TpchQueryRunnerBuilder.builder().build(); | |||
} | |||
|
|||
@Override | |||
protected QueryRunner createExpectedQueryRunner() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Short answer is to make sure that the actual and expected queries are run with the same runner.
queryRunner is a TpchQueryRunnerBuilder (which I assume would give us TpchQueryRunner)
expectedQueryRunner is by default a H2QueryRunner.
I am not super what their difference is, but I think it is better for them to be the same.
In addition, in my WIP code to implement the next function in this series, noisy_avg_gaussian, without this override to make sure they are the same runner, I got an error where:
SELECT noisy_avg_gaussian(linenumber, 0) FROM lineitem
returns 3.004270876609888SELECT avg(linenumber) FROM lineitem
returns 3.0
I needed this override to get that test working. So I think it is better for them to be the same anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, let's address this in the PR for noisy_avg_gaussian. H2QueryRunner is returning 3.0 for SELECT avg(linenumber) FROM lineitem
which is weird and needs to be investigated. However, we can do that in parallel and not block this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.0 is fishy, maybe its a type issue that it did a floor division? can you try avg(linenumber+.0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a type issue, avg(linenumber+.0) returns the correct value. So its ok to use TpchQueryRunner to compare restults, and in the comment we can add that H2QueryRunner is doing averages as ints instead of floats(not sure why).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I tried avg(linenumber + 0.0) or avg(cast(linenumber as double)) and both worked.
Just for the next PR for avg, do you suggest overriding expectedQueryRunner or casting to double like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duykienvp overriding is fine.
This commit adds NOISY_SUM_GAUSSIAN aggregation and refactors some files related to noisy_count_gaussian function to support this new function. This noisy_sum_gaussian is to replace `SUM(col)` with `NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])`. This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto. `col` can be of numerical types: INT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL. Because noise is of type Double, all values are converted to Double before being added to the sum, and the return type is Double. When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum. Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used. Why we want these functions: The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features. While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.
ab77184
to
f63bcd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This commit adds NOISY_SUM_GAUSSIAN aggregation and refactors some files related to noisy_count_gaussian function to support this new function. This noisy_sum_gaussian is to replace
SUM(col)
withNOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])
. This is a continuation of the previous PR supporting noisy_count_gaussian.col
can be of numerical types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.Because noise is of type Double, all values are converted to Double before being added to the sum, and the return type is Double.
When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum. The sum is then processed to make sure the sign of the final sum is consistent with the range. So, if
lower >= 0
, the sum is then equal to the max of the sum and 0; ifupper <= 0
, the sum is then equal to min of the sum and 0.Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.
Motivation and Context
This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.
The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.
While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.
Impact
This commit adds
NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])
aggregation which calculates the sum over the input values, and then adds random Gaussian noise with 0 mean and standard deviation ofnoise_scale
to the true sum. This also provides options to clip values to a range[lower, upper]
and a random seed for reproducibilityTest Plan
NOISY_SUM_GAUSSIAN(col, noiseScale)
,NOISY_SUM_GAUSSIAN(col, noiseScale, randomSeed)
,NOISY_SUM_GAUSSIAN(col, noiseScale, lower, upper)
,NOISY_SUM_GAUSSIAN(col, noiseScale, lower, upper randomSeed)
:NOISY_SUM_GAUSSIAN(col, noiseScale)
,NOISY_SUM_GAUSSIAN(col, noiseScale, randomSeed)
compared to normal SUM:Contributor checklist
Release Notes