How to implement a consistent profile sample #11864

TeddyCr · 2023-06-02T07:16:14Z

TeddyCr
Jun 2, 2023
Maintainer

Context

In the existing implementation of the profiler sampling, the data sample is computed (general case) by assigning a random number between 0-1 to all the rows (ABS(RANDOM()) * 100 % 100 AS RDN) and filtering in a WHERE clause for RDN <= <profileSamplePercentage>. This sample is recomputed for each group of metrics and columns when running the profiler.

While this allows to efficiently compute sampled profiles, when consistency for the sampling is required it might prevent users to use the sampling functionality. More context can be found in the following ticket #10633.

Problem

Is the current computation of sampling for the profiler a concern/issue for existing users? If so, what approach would be the most acceptable?

Solutions that have been suggested

To solve this limitation 2 options have initially been suggested, one of which is non-applicable, and the other presents some important drawbacks:

Save sample data in memory (non-applicable): One option could have been to load the sampled data directly in memory and share it across groups of metrics/columns. While this option would solve the consistency problem, the probability for this approach to end up raising memory error is high (as profiled data could potentially be very large).
Add the option for a user to write sample data as a temp table for consistency: This approach will imply writing the computed sample as a temporary table in the database engine of the user and cleaning up the table once the computation is done. It has a few drawbacks:
- can be invasive and outside the scope of permission OpenMetadata should have
- can be inefficient as it would require performing write/delete for each sampled table

How can you participate in this discussion?

You can participate in this discussion in two ways:

share your feedback with the existing approach (is it a problem/concern for you, etc.)
share suggestions on what would be an acceptable solution to have a consistent profile sample

MichaelTiemannOSC · 2023-06-05T13:19:48Z

MichaelTiemannOSC
Jun 5, 2023

xref: #10306

TL;DR: random row selection with small sample sizes can lead to wildly "wrong" samples, especially when data is sparse (having many NULL values). The above proposal biases small samples toward collecting data with more non-NULL data, becoming more statistically neutral as the sample size grows. This makes it easier to see signal among the noise for small samples.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement a consistent profile sample #11864

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to implement a consistent profile sample #11864

TeddyCr Jun 2, 2023 Maintainer

Context

Problem

Solutions that have been suggested

How can you participate in this discussion?

Replies: 1 comment

MichaelTiemannOSC Jun 5, 2023

TeddyCr
Jun 2, 2023
Maintainer

MichaelTiemannOSC
Jun 5, 2023