Replies: 1 comment
-
xref: #10306 TL;DR: random row selection with small sample sizes can lead to wildly "wrong" samples, especially when data is sparse (having many NULL values). The above proposal biases small samples toward collecting data with more non-NULL data, becoming more statistically neutral as the sample size grows. This makes it easier to see signal among the noise for small samples. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Context
In the existing implementation of the profiler sampling, the data sample is computed (general case) by assigning a random number between 0-1 to all the rows (
ABS(RANDOM()) * 100 % 100 AS RDN
) and filtering in aWHERE
clause forRDN <= <profileSamplePercentage>
. This sample is recomputed for each group of metrics and columns when running the profiler.While this allows to efficiently compute sampled profiles, when consistency for the sampling is required it might prevent users to use the sampling functionality. More context can be found in the following ticket #10633.
Problem
Is the current computation of sampling for the profiler a concern/issue for existing users? If so, what approach would be the most acceptable?
Solutions that have been suggested
To solve this limitation 2 options have initially been suggested, one of which is non-applicable, and the other presents some important drawbacks:
How can you participate in this discussion?
You can participate in this discussion in two ways:
Beta Was this translation helpful? Give feedback.
All reactions