Add noisy_sum_gaussian function #20810

duykienvp · 2023-09-08T19:30:22Z

Description

This commit adds NOISY_SUM_GAUSSIAN aggregation and refactors some files related to noisy_count_gaussian function to support this new function. This noisy_sum_gaussian is to replace SUM(col) with NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed]). This is a continuation of the previous PR supporting noisy_count_gaussian.

col can be of numerical types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.

Because noise is of type Double, all values are converted to Double before being added to the sum, and the return type is Double.

When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum. The sum is then processed to make sure the sign of the final sum is consistent with the range. So, if lower >= 0, the sum is then equal to the max of the sum and 0; if upper <= 0, the sum is then equal to min of the sum and 0.

Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.

Motivation and Context

This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.

The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.

While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.

Impact

This commit adds NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed]) aggregation which calculates the sum over the input values, and then adds random Gaussian noise with 0 mean and standard deviation of noise_scale to the true sum. This also provides options to clip values to a range [lower, upper] and a random seed for reproducibility

Test Plan

Unittest: Tested all of NOISY_SUM_GAUSSIAN(col, noiseScale), NOISY_SUM_GAUSSIAN(col, noiseScale, randomSeed), NOISY_SUM_GAUSSIAN(col, noiseScale, lower, upper), NOISY_SUM_GAUSSIAN(col, noiseScale, lower, upper randomSeed):
- Available for all standard numerical data types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.
- All following tests are for BIGINT, REAL, DOUBLE, DECIMAL types, while TINYINT, SMALLINT, INTEGER were only tested if such type can be processed because it is equivalent to BIGINT input
  - Tested with noiseScale < 0, noiseScale = 0, noiseScale = fixed value, with some noiseScale the output is within 50 x noiseScale
  - Tested with NULL rows
  - Tested with valid and invalid clipping bounds
  - Tested with random seed
  - Run local queries to test its output compared to SUM(col), and tested when input has 0 rows, with and without GROUP BY
- Tested on tpch schema for NOISY_SUM_GAUSSIAN(col, noiseScale), NOISY_SUM_GAUSSIAN(col, noiseScale, randomSeed) compared to normal SUM:
  - Clipping was not tested because normal SUM does not offer clipping
  - Only tested INTEGER, BIGINT, DOUBLE. Other types couldn't be found in the schema

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Adds `NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])` aggregation which calculates the sum over the input values, and then adds random Gaussian noise with 0 mean and standard deviation of ``noise_scale`` to the true sum. 
  This also provides options to clip values to a range `[lower, upper]` and a random seed for reproducibility.
  All values are converted to `DOUBLE` before being added to the sum, and the return type is `DOUBLE`. 
  When there are no input rows, this function returns ``NULL``.
  When generating noise, if randomSeed is omitted, SecureRandom is used; otherwise, Random is used.

github-actions · 2023-09-08T19:31:26Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff a6eb6bc...f63bcd1.

Notify	File(s)
@steveburnett	presto-docs/src/main/sphinx/functions/aggregate.rst

duykienvp · 2023-09-08T19:38:22Z

Most of line changes are kinda similar because

1 file structure of NoisySumGaussianDoubleAggregation was copied for other types: Real, Long, Integer, SmallInt, TinyInt.
1 file structure of NoisySumGaussianDecimalAggregation was copied for other variations: NoisySumGaussianDecimalLowerUpperAggregation, NoisySumGaussianDecimalLowerUpperRandomSeedAggregation, NoisySumGaussianDecimalRandomSeedAggregation
1 file structure of unittest TestNoisySumGaussianDoubleAggregation was copied for testing other types

Not sure how to make those better but I think unittest is necessary

Updated:

Move the implementation of 1 function signature but for multiple types into 1 files (similar to what we did with noisy_count_gaussian)
Unittest is still separated though

jonhehir

I'll defer to the Presto maintainers for the nitty gritty review of the Presto stuff, but I've left a few comments below, mostly relating to the docs, noise addition, and unit tests.

presto-docs/src/main/sphinx/functions/aggregate.rst

...java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumAggregationUtils.java

...book/presto/operator/aggregation/noisyaggregation/TestNoisySumGaussianDoubleAggregation.java

aditi-pandit · 2023-09-11T15:21:14Z

@duykienvp : Please can you add an issue in https://github.com/orgs/prestodb/projects/3 to track implementing this function in Prestissimo (Presto native engine).

duykienvp · 2023-09-11T16:16:36Z

@duykienvp : Please can you add an issue in https://github.com/orgs/prestodb/projects/3 to track implementing this function in Prestissimo (Presto native engine).

@aditi-pandit I created 2 issues for this and the previous function noisy_count_gaussian() but cannot edit those issues to assign a project to them. Could you please help me add them to the project?
#20826
#20827

steveburnett

LGTM! (docs)

...n/src/main/java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumState.java

mlyublena · 2023-09-13T22:56:40Z

Please add documentation to the new classes.
Apart from that, looks good to me from Presto perspective, I trust domain experts to do a more thorough review of the algorithms

jonhehir

Looks good to me on the algo side!

pranjalssh · 2023-09-14T00:03:59Z

presto-tests/src/test/java/com/facebook/presto/tests/TestNoisyAggregations.java

@@ -27,6 +32,13 @@ protected QueryRunner createQueryRunner()
        return TpchQueryRunnerBuilder.builder().build();
    }

+    @Override
+    protected QueryRunner createExpectedQueryRunner()


Why is this needed?

Thanks. Short answer is to make sure that the actual and expected queries are run with the same runner.

queryRunner is a TpchQueryRunnerBuilder (which I assume would give us TpchQueryRunner)
expectedQueryRunner is by default a H2QueryRunner.

I am not super what their difference is, but I think it is better for them to be the same.

In addition, in my WIP code to implement the next function in this series, noisy_avg_gaussian, without this override to make sure they are the same runner, I got an error where:

SELECT noisy_avg_gaussian(linenumber, 0) FROM lineitem returns 3.004270876609888

SELECT avg(linenumber) FROM lineitem returns 3.0

I needed this override to get that test working. So I think it is better for them to be the same anyway.

Discussed offline, let's address this in the PR for noisy_avg_gaussian. H2QueryRunner is returning 3.0 for SELECT avg(linenumber) FROM lineitem which is weird and needs to be investigated. However, we can do that in parallel and not block this PR

3.0 is fishy, maybe its a type issue that it did a floor division? can you try avg(linenumber+.0)?

Looks like a type issue, avg(linenumber+.0) returns the correct value. So its ok to use TpchQueryRunner to compare restults, and in the comment we can add that H2QueryRunner is doing averages as ints instead of floats(not sure why).

yeah, I tried avg(linenumber + 0.0) or avg(cast(linenumber as double)) and both worked.

Just for the next PR for avg, do you suggest overriding expectedQueryRunner or casting to double like this?

@duykienvp overriding is fine.

This commit adds NOISY_SUM_GAUSSIAN aggregation and refactors some files related to noisy_count_gaussian function to support this new function. This noisy_sum_gaussian is to replace `SUM(col)` with `NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])`. This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto. `col` can be of numerical types: INT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL. Because noise is of type Double, all values are converted to Double before being added to the sum, and the return type is Double. When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum. Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used. Why we want these functions: The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features. While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.

pranjalssh

LGTM

duykienvp requested a review from a team as a code owner September 8, 2023 19:30

duykienvp requested a review from presto-oss September 8, 2023 19:30

jonhehir reviewed Sep 8, 2023

View reviewed changes

duykienvp force-pushed the noisy_sum_gaussian branch from a1f9e55 to fd67677 Compare September 8, 2023 23:21

duykienvp mentioned this pull request Sep 11, 2023

[Native][Aggregate functions] Implement noisy_sum_gaussian() #20827

Open

duykienvp force-pushed the noisy_sum_gaussian branch from fd67677 to ae00a80 Compare September 11, 2023 17:15

steveburnett approved these changes Sep 12, 2023

View reviewed changes

duykienvp force-pushed the noisy_sum_gaussian branch 5 times, most recently from 2a0ad18 to b653d07 Compare September 13, 2023 22:26

mlyublena reviewed Sep 13, 2023

View reviewed changes

...n/src/main/java/com/facebook/presto/operator/aggregation/noisyaggregation/NoisySumState.java Show resolved Hide resolved

mlyublena approved these changes Sep 13, 2023

View reviewed changes

jonhehir approved these changes Sep 13, 2023

View reviewed changes

duykienvp force-pushed the noisy_sum_gaussian branch 2 times, most recently from 552e4c2 to ab77184 Compare September 13, 2023 23:35

pranjalssh reviewed Sep 14, 2023

View reviewed changes

duykienvp force-pushed the noisy_sum_gaussian branch from ab77184 to f63bcd1 Compare September 14, 2023 01:07

pranjalssh approved these changes Sep 14, 2023

View reviewed changes

pranjalssh merged commit 1cfda19 into prestodb:master Sep 14, 2023

duykienvp deleted the noisy_sum_gaussian branch September 14, 2023 04:16

duykienvp mentioned this pull request Sep 14, 2023

Add noisy_avg_gaussian aggregation #20865

Merged

6 tasks

wanglinsong mentioned this pull request Oct 3, 2023

Add release notes for 0.284 #21027

Merged

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add noisy_sum_gaussian function #20810

Add noisy_sum_gaussian function #20810

duykienvp commented Sep 8, 2023 •

edited

Loading

github-actions bot commented Sep 8, 2023 •

edited

Loading

duykienvp commented Sep 8, 2023 •

edited

Loading

jonhehir left a comment •

edited

Loading

aditi-pandit commented Sep 11, 2023

duykienvp commented Sep 11, 2023

steveburnett left a comment

mlyublena commented Sep 13, 2023

jonhehir left a comment

pranjalssh Sep 14, 2023

duykienvp Sep 14, 2023

pranjalssh Sep 14, 2023

pranjalssh Sep 14, 2023 •

edited

Loading

pranjalssh Sep 14, 2023

duykienvp Sep 14, 2023

pranjalssh Sep 14, 2023

pranjalssh left a comment

Add noisy_sum_gaussian function #20810

Add noisy_sum_gaussian function #20810

Conversation

duykienvp commented Sep 8, 2023 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

github-actions bot commented Sep 8, 2023 • edited Loading

duykienvp commented Sep 8, 2023 • edited Loading

jonhehir left a comment • edited Loading

Choose a reason for hiding this comment

aditi-pandit commented Sep 11, 2023

duykienvp commented Sep 11, 2023

steveburnett left a comment

Choose a reason for hiding this comment

mlyublena commented Sep 13, 2023

jonhehir left a comment

Choose a reason for hiding this comment

pranjalssh Sep 14, 2023

Choose a reason for hiding this comment

duykienvp Sep 14, 2023

Choose a reason for hiding this comment

pranjalssh Sep 14, 2023

Choose a reason for hiding this comment

pranjalssh Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

pranjalssh Sep 14, 2023

Choose a reason for hiding this comment

duykienvp Sep 14, 2023

Choose a reason for hiding this comment

pranjalssh Sep 14, 2023

Choose a reason for hiding this comment

pranjalssh left a comment

Choose a reason for hiding this comment

duykienvp commented Sep 8, 2023 •

edited

Loading

github-actions bot commented Sep 8, 2023 •

edited

Loading

duykienvp commented Sep 8, 2023 •

edited

Loading

jonhehir left a comment •

edited

Loading

pranjalssh Sep 14, 2023 •

edited

Loading