-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: remove default exist condition for groupby values in favour of … #5420
base: develop
Are you sure you want to change the base?
Conversation
…query correctness
LGTM, can you make the changes for traces as well? |
have removed it from traces as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
The PR should have included these 2 points too IMO
|
It was an incorrect assumption when the original code was written that group by keys must exist #2571 (comment). However, it returns results that don't make sense. For example, if we group by
These columns mainly exist to support the |
can produce incorrect results for legitimate zero values as it was origianally intended for
I see two things we can do it address the problem. This is only an issue for
This will ensure we don't overcount things for legitimate zero-value scenarios and at the same time don't produce misleading results when the group key doesn't exist.
|
Inefficient?
Would be still problematic for logs? What about having a default integer value and removing that in the where clause of the query? |
@ankitnayan this is why we didn't go with default values last time |
I summarized it below. Here's a scenario where our current query logic falls short. Imagine an application service that logs data with various attributes, including code and service resource information. The service always includes a 'service.name' by default, but 'code.function' only appears in certain parts of the codebase. Now, when a user tries to
The problem? The WHERE clause filters out all logs missing either So, does it make sense to exclude logs without certain attribute keys? Not really. This default behaviour skews results in common use cases, and there's no easy workaround except writing a custom ClickHouse query. Would it be better to include these logs under a special Is this always the right approach? Probably not. There might be cases where a missing attribute means the log isn't relevant. In such scenarios, we'd need a way to explicitly exclude these logs using a Now, let's consider numeric and boolean attributes. Take HTTP server logs with an Clearly, we can't make a one-size-fits-all decision. The context and nature of the data being queried matter. So, removing the default But simply removing these conditions changes our query to:
The consequence? For logs without a key in the attribute array, we get zero-values: an empty string for strings, 0 for numbers, and false for booleans. This creates a new problem: legitimate zero values in user attributes (like Can we just add our own default values for missing keys? Nope, because we risk colliding with users' actual values for numbers or booleans. We can't predict what values people might use in their attributes. One of the suggestions was to run two queries and the concern is it reads twice the data. We can avoid two queries by doing conditional counting. Something like. We should check if there is any performance change b/w this and the two queries approach.
How do we make this backwards compatible? Since people have already set up dashboards/alerts 'we mustn't change their existing behaviour. We need to write a data migration that adds a filter item to the panel WHERE clause for each group by key. For every new panel, users with more context make the judgement and add the Overall, I believe the fix in general should be implemented but how we handle the legitimate zero-values is the only concern. |
Fixes #5136
it removes the default exists caluses added to the group by values.