Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add noisy_sum_gaussian function #20810

Merged
merged 1 commit into from
Sep 14, 2023

Conversation

duykienvp
Copy link
Contributor

@duykienvp duykienvp commented Sep 8, 2023

Description

This commit adds NOISY_SUM_GAUSSIAN aggregation and refactors some files related to noisy_count_gaussian function to support this new function. This noisy_sum_gaussian is to replace SUM(col) with NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed]). This is a continuation of the previous PR supporting noisy_count_gaussian.

col can be of numerical types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.

Because noise is of type Double, all values are converted to Double before being added to the sum, and the return type is Double.

When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum. The sum is then processed to make sure the sign of the final sum is consistent with the range. So, if lower >= 0, the sum is then equal to the max of the sum and 0; if upper <= 0, the sum is then equal to min of the sum and 0.

Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.

Motivation and Context

This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.

The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.

While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.

Impact

This commit adds NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed]) aggregation which calculates the sum over the input values, and then adds random Gaussian noise with 0 mean and standard deviation of noise_scale to the true sum. This also provides options to clip values to a range [lower, upper] and a random seed for reproducibility

Test Plan

  • Unittest: Tested all of NOISY_SUM_GAUSSIAN(col, noiseScale), NOISY_SUM_GAUSSIAN(col, noiseScale, randomSeed), NOISY_SUM_GAUSSIAN(col, noiseScale, lower, upper), NOISY_SUM_GAUSSIAN(col, noiseScale, lower, upper randomSeed):
    • Available for all standard numerical data types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.
    • All following tests are for BIGINT, REAL, DOUBLE, DECIMAL types, while TINYINT, SMALLINT, INTEGER were only tested if such type can be processed because it is equivalent to BIGINT input
      • Tested with noiseScale < 0, noiseScale = 0, noiseScale = fixed value, with some noiseScale the output is within 50 x noiseScale
      • Tested with NULL rows
      • Tested with valid and invalid clipping bounds
      • Tested with random seed
      • Run local queries to test its output compared to SUM(col), and tested when input has 0 rows, with and without GROUP BY
    • Tested on tpch schema for NOISY_SUM_GAUSSIAN(col, noiseScale), NOISY_SUM_GAUSSIAN(col, noiseScale, randomSeed) compared to normal SUM:
      • Clipping was not tested because normal SUM does not offer clipping
      • Only tested INTEGER, BIGINT, DOUBLE. Other types couldn't be found in the schema

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Adds `NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])` aggregation which calculates the sum over the input values, and then adds random Gaussian noise with 0 mean and standard deviation of ``noise_scale`` to the true sum. 
  This also provides options to clip values to a range `[lower, upper]` and a random seed for reproducibility.
  All values are converted to `DOUBLE` before being added to the sum, and the return type is `DOUBLE`. 
  When there are no input rows, this function returns ``NULL``.
  When generating noise, if randomSeed is omitted, SecureRandom is used; otherwise, Random is used.

@duykienvp duykienvp requested a review from a team as a code owner September 8, 2023 19:30
@duykienvp duykienvp requested a review from presto-oss September 8, 2023 19:30
@github-actions
Copy link

github-actions bot commented Sep 8, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff a6eb6bc...f63bcd1.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/functions/aggregate.rst

@duykienvp
Copy link
Contributor Author

duykienvp commented Sep 8, 2023

Most of line changes are kinda similar because

  • 1 file structure of NoisySumGaussianDoubleAggregation was copied for other types: Real, Long, Integer, SmallInt, TinyInt.
  • 1 file structure of NoisySumGaussianDecimalAggregation was copied for other variations: NoisySumGaussianDecimalLowerUpperAggregation, NoisySumGaussianDecimalLowerUpperRandomSeedAggregation, NoisySumGaussianDecimalRandomSeedAggregation
  • 1 file structure of unittest TestNoisySumGaussianDoubleAggregation was copied for testing other types

Not sure how to make those better but I think unittest is necessary

Updated:

  • Move the implementation of 1 function signature but for multiple types into 1 files (similar to what we did with noisy_count_gaussian)
  • Unittest is still separated though

Copy link
Contributor

@jonhehir jonhehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to the Presto maintainers for the nitty gritty review of the Presto stuff, but I've left a few comments below, mostly relating to the docs, noise addition, and unit tests.

@aditi-pandit
Copy link
Contributor

@duykienvp : Please can you add an issue in https://github.com/orgs/prestodb/projects/3 to track implementing this function in Prestissimo (Presto native engine).

@duykienvp
Copy link
Contributor Author

@duykienvp : Please can you add an issue in https://github.com/orgs/prestodb/projects/3 to track implementing this function in Prestissimo (Presto native engine).

@aditi-pandit I created 2 issues for this and the previous function noisy_count_gaussian() but cannot edit those issues to assign a project to them. Could you please help me add them to the project?
#20826
#20827

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

@duykienvp duykienvp force-pushed the noisy_sum_gaussian branch 5 times, most recently from 2a0ad18 to b653d07 Compare September 13, 2023 22:26
@mlyublena
Copy link
Contributor

Please add documentation to the new classes.
Apart from that, looks good to me from Presto perspective, I trust domain experts to do a more thorough review of the algorithms

Copy link
Contributor

@jonhehir jonhehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me on the algo side!

@duykienvp duykienvp force-pushed the noisy_sum_gaussian branch 2 times, most recently from 552e4c2 to ab77184 Compare September 13, 2023 23:35
@@ -27,6 +32,13 @@ protected QueryRunner createQueryRunner()
return TpchQueryRunnerBuilder.builder().build();
}

@Override
protected QueryRunner createExpectedQueryRunner()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Short answer is to make sure that the actual and expected queries are run with the same runner.

queryRunner is a TpchQueryRunnerBuilder (which I assume would give us TpchQueryRunner)
expectedQueryRunner is by default a H2QueryRunner.

I am not super what their difference is, but I think it is better for them to be the same.

In addition, in my WIP code to implement the next function in this series, noisy_avg_gaussian, without this override to make sure they are the same runner, I got an error where:

  • SELECT noisy_avg_gaussian(linenumber, 0) FROM lineitem returns 3.004270876609888
  • SELECT avg(linenumber) FROM lineitem returns 3.0

I needed this override to get that test working. So I think it is better for them to be the same anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, let's address this in the PR for noisy_avg_gaussian. H2QueryRunner is returning 3.0 for SELECT avg(linenumber) FROM lineitem which is weird and needs to be investigated. However, we can do that in parallel and not block this PR

Copy link
Contributor

@pranjalssh pranjalssh Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.0 is fishy, maybe its a type issue that it did a floor division? can you try avg(linenumber+.0)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a type issue, avg(linenumber+.0) returns the correct value. So its ok to use TpchQueryRunner to compare restults, and in the comment we can add that H2QueryRunner is doing averages as ints instead of floats(not sure why).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I tried avg(linenumber + 0.0) or avg(cast(linenumber as double)) and both worked.

Just for the next PR for avg, do you suggest overriding expectedQueryRunner or casting to double like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duykienvp overriding is fine.

This commit adds NOISY_SUM_GAUSSIAN aggregation and refactors some files related to noisy_count_gaussian function to support this new function. This noisy_sum_gaussian is to replace `SUM(col)` with `NOISY_SUM_GAUSSIAN(col, noiseScale[, lower, upper][, randomSeed])`.

This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.

`col` can be of numerical types: INT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.

Because noise is of type Double, all values are converted to Double before being added to the sum, and the return type is Double.

When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum.

Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.

Why we want these functions:
The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.

While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pranjalssh pranjalssh merged commit 1cfda19 into prestodb:master Sep 14, 2023
@duykienvp duykienvp deleted the noisy_sum_gaussian branch September 14, 2023 04:16
@wanglinsong wanglinsong mentioned this pull request Oct 3, 2023
54 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants