Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add noisy_avg_gaussian aggregation #20865

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

duykienvp
Copy link
Contributor

@duykienvp duykienvp commented Sep 14, 2023

Description

This commit adds noisy_avg_gaussian aggregation. It can be used to replace avg(col) with noisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed]).
This is a continuation of the previous PR(s) supporting noisy_count_gaussian and noisy_sum_gaussian.

col can be of numerical types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.

Because noise is of type double, all values are converted to double before being added to the sum which is used to compute the avg, and the return type is double.

When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum, which is used to compute the avg.

Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.

Motivation and Context

This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.

The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.

While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.

Impact

This commit adds noisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed]) aggregation which calculates the average (arithmetic mean) of all the input values, and then adds random Gaussian noise with 0 mean and standard deviation of noise_scale to the true avg. This also provides options to clip values to a range [lower, upper] and a random seed for reproducibility.

Test Plan

  • Unittest: Tested all of noisy_avg_gaussian(col, noiseScale), noisy_avg_gaussian(col, noiseScale, randomSeed), noisy_avg_gaussian(col, noiseScale, lower, upper), noisy_avg_gaussian(col, noiseScale, lower, upper randomSeed):
    • Available for all standard numerical data types: TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.
    • All following tests are for BIGINT, REAL, DOUBLE, DECIMAL types, while TINYINT, SMALLINT, INTEGER were only tested if such type can be processed because it is equivalent to BIGINT input
      • Tested with noiseScale < 0, noiseScale = 0, noiseScale = fixed value, with some noiseScale the output is within 50 x noiseScale
      • Tested with NULL rows
      • Tested with valid and invalid clipping bounds
      • Tested with random seed
      • Run local queries to test its output compared to avg(col), and tested when input has 0 rows, with and without GROUP BY
    • Tested on tpch schema for noisy_avg_gaussian(col, noiseScale), noisy_avg_gaussian(col, noiseScale, randomSeed) compared to normal AVG:
      • Clipping was not tested because normal AVG does not offer clipping
      • Only tested INTEGER, BIGINT, DOUBLE. Other types couldn't be found in the schema

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Adds `noisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed])` aggregation 
  which calculates the average (arithmetic mean) of all over the input values, 
  and then adds random Gaussian noise with 0 mean and standard deviation of ``noise_scale`` to the true avg. 
  This also provides options to clip values to a range `[lower, upper]` and a random seed for reproducibility.
  All values are converted to `double` before being added to the sum, which which is used to compute the avg, and the return type is `double`. 
  When there are no input rows, this function returns ``NULL``.
  When generating noise, if randomSeed is omitted, SecureRandom is used; otherwise, Random is used.

@duykienvp duykienvp requested a review from a team as a code owner September 14, 2023 04:38
@github-actions
Copy link

github-actions bot commented Sep 14, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 629070d...a2a3b4f.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/functions/aggregate.rst

@duykienvp
Copy link
Contributor Author

Need help adding this issue to backlog: #20869 similar to what requested in this comment #20810 (comment)

Copy link
Contributor

@jonhehir jonhehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bunch of nitpicky naming sorts of comments! There are a lot of references to sum rather than avg in the unit tests, so I didn't explicitly flag them all.

@duykienvp
Copy link
Contributor Author

Just a bunch of nitpicky naming sorts of comments! There are a lot of references to sum rather than avg in the unit tests, so I didn't explicitly flag them all.

Fixed all the comments

@duykienvp
Copy link
Contributor Author

@mlyublena @pranjalssh please take a look when you have a chance

This commit adds `noisy_avg_gaussian` aggregation. It can be used to replace `avg(col)` with `noisy_avg_gaussian(col, noiseScale[, lower, upper][, randomSeed])`.

This is one of aggregations in our effort to add Presto UDF for noisy aggregations, used as building block for differential privacy in Presto.

`col` can be of numerical types: INT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL.

Because noise is of type `double`, all values are converted to `double` before being added to the avg, and the return type is `double`.

When a bound [lower, upper] is provided, each value is clipped to this range before being added to the sum (which is later used to compute the avg).

Optional randomSeed is used to get a fixed value of noise, often for reproducibility purposes. If randomSeed is omitted, SecureRandom is used. If randomSeed is provided, Random is used.

Why we want these functions:
The purpose is to help build systems/tools/framework that provide differential privacy guarantees. Differential privacy has been used by multiple teams within Meta to develop privacy-preserving systems. Current implementation involves complicated SQL operation even for simplest aggregations, increasing development time, complexity, maintenance and sharing cost, and sometimes completely blocking development of new features.

While these functions on their own do not guarantee 100% differential privacy, they are the building blocks for other systems. That is also why we do not call these functions “differentially private aggregations” but only “noisy aggregations” to avoid a wrong impression of achieving differential privacy solely by using these functions.
@pranjalssh pranjalssh merged commit 92f51c0 into prestodb:master Sep 19, 2023
@duykienvp duykienvp deleted the noisy_avg_gaussian branch September 19, 2023 17:58
@wanglinsong wanglinsong mentioned this pull request Oct 3, 2023
54 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants