-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if can set a default value if some value is not present in materialized column #2594
Check if can set a default value if some value is not present in materialized column #2594
Comments
We can use the if condition for setting a default value. If I want to create a materialized column for bytes this will be the code to do it.
Where Show create table command returns
|
@nityanandagohain Still we are applying condition on bytes value == 0. But zero can be the intended value set by the user. |
No @makeavish , we are checking if |
No, it's checking if the key existed in the array and using it otherwise defaults to 99999999. It's good to know we can do that, but we should also get an idea if the |
Here are some results, there are perf issues with nullable columns. The performance drops by half in nullable columns Non nullable column perf
Nullable column perf
As you can see apart from the first query which is done with Click-house team also uses placeholder values for their table according to their comment last year https://groups.google.com/g/clickhouse/c/AP2FbQ-uoj8/m/SKHRSgIYBwAJ https://groups.google.com/g/clickhouse/c/AP2FbQ-uoj8 https://www.notion.so/signoz/Nullable-columns-in-Clickhouse-ad40c1cb2f6046a18eb73713480f72ee?pvs=4 |
Check out results for column having also perf on |
Steps performedCreate the table
Insert the data
Group by normal (B)
Group by nullable, all values filled (nonil)
Group by nullable, but half values are null(halfnil)
Group by mostnil, most of the values are null(mostnil)
ConclusionFrom the above results, we can see that not nill column is faster than, regardless of how much data is filled in the not nill columns Clikhouse also suggests not to use nullable column https://clickhouse.com/docs/en/optimize/avoid-nullable-columns |
Can you create an associated boolean column ( |
@srikanthccv sorry missed your comment, I didn't get what exactly you are asking to try out. |
I was saying create an associated column for the regular column, example, That translates to the following for more clear example. ALTER TABLE logs
ADD COLUMN `bytes` Nullable(Float64) MATERIALIZED if(indexOf(attributes_float64_key, 'bytes') != 0, attributes_float64_value[indexOf(attributes_float64_key, 'bytes')], NULL) CODEC(LZ4)
ALTER TABLE logs
ADD COLUMN `bytes_key_exists` Bool MATERIALIZED if(indexOf(attributes_float64_key, 'bytes') != 0, true, false) CODEC(LZ4) Now when we write query, we will filter out those with null using the select count(), halfnil from test where halfnil_key_exists != false group by halfnil format Null; wanted to see query time and storage results for this way of doing things. |
Here are the results select count(), halfnil from test where isNotNull(halfnil) group by halfnil format Null settings min_bytes_to_use_direct_io=1;
select count(), halfnil from test where exist != false group by halfnil format Null;
Here are the results @srikanthccv |
Thanks, I wanted to see if there is a better way to avoid the magic values for nullable rows because what you consider magic could be a real value for the user. You never know how people use stuff in the wild. Have you decided on how to handle this for all data types? |
Are you asking what are the max values that we have considered since we are not going with nullable columns ? |
The max values for the data types are the max value present here https://clickhouse.com/docs/en/sql-reference/data-types/int-uint . For string I am using a value For boolean we can't have anything default, but as of now we don't handle boolean seperately in logs we just store it is a string. Though we might need to find a solution for boolean |
I still believe the magic values for each data type (and I see the boolean is not taken care of) is not a good solution. How would you check if something exists in map/array in go (or any other language for that matter)? There is usually some membership check involved. And developers use |
Ah so you are saying if we have array and map columns in future what kind of default values will we set right ? Because empty map/array won't tell was if it was actually there or not. |
No, the map/array part I was referring to is the |
Ahh okay, would you still recommend nullable columns in that case ? |
I was not convinced how the |
Comparing the relative timing for the same queries, I don't see how a non-nil column is faster regardless of how much data is filled. CREATE TABLE test
(
`A` Int64,
`B` Int64,
`nonil` Int64,
`nonil_key_exists` bool,
`halfnil` Int64,
`halfnil_key_exists` bool,
`mostnil` Int64,
`mostnil_key_exists` bool
)
ENGINE = MergeTree
ORDER BY A
INSERT INTO test
SELECT
number AS id,
id % 9973 AS x,
x,
true,
if((x % 2) = 0, x, 0),
if((x % 2) = 0, true, false),
if((rand(1) % 5) = 0, x, 0),
if((rand(1) % 5) = 0, true, false)
FROM numbers(300000000); Inserted data
Group by normal (B)
Group by nullable, all values filled (nonil)
Group by nullable, but half values are null(halfnil)
Group by mostnil, most of the values are null(mostnil)
|
Interesting, your results seems to be different from mine. I performed it on my mac though. What configuration machine did you test this on ? Thanks for running this again |
4 Core, 16GB VM. And for that reason, I am only comparing the relative timings for different types of queries. I always do this kind of work on VMs because you can't trust a personal Mac for this. |
I will test this on another machine as well, and will get back |
Have tested this and found similar results, using an boolean columns seems to be the way to goo. But let me think about other cases that we might need to handle. @srikanthccv do you think something might come up with ingestion performance ? |
If the total number of columns is in the range of 1000, then the system defaults of clickhouse work fine. |
The perf impact shared by @srikanthccv is non-trivial (almost twice performant). My concern is storage cost. But guessing should not be large. Let's assume we are going forward with having an extra column. Some questions follow:
Maybe, the users do not run queries with |
So The reason is that we know before a query is run if the column is indexed or not if indexed we can modify the query in a certain way to include I will get back on the storage part |
So compared the ingestion speed, little to no difference in ingestion speed due to addition of boolean column. The exra storage used by the bool column is also very less i.e Without extra column create table non_empty_exists(A Int64, B Int64, b_exists bool) Engine=MergeTree order by A;
insert into non_empty_exists select number id, id % 9973, true from numbers (300000000);
0 rows in set. Elapsed: 60.136 sec. Processed 300.93 million rows, 2.41 GB (5.00 million rows/s., 40.03 MB/s.)
0 rows in set. Elapsed: 57.210 sec. Processed 300.93 million rows, 2.41 GB (5.26 million rows/s., 42.08 MB/s.)
Size
┌─database─┬─table────────────┬─column───┬─compressed─┬─uncompressed─┬─compr_ratio─┬──rows_cnt─┬─avg_row_size─┐
│ test │ non_empty_exists │ A │ 1.12 GiB │ 2.23 GiB │ 2 │ 300000000 │ 8 │
│ test │ non_empty_exists │ B │ 1.12 GiB │ 2.23 GiB │ 2 │ 300000000 │ 8 │
│ test │ non_empty_exists │ b_exists │ 1.27 MiB │ 285.99 MiB │ 224.44 │ 300000000 │ 1 │
└──────────┴──────────────────┴──────────┴────────────┴──────────────┴─────────────┴───────────┴──────────────┘ with extra column create table non_empty_exists(A Int64, B Int64, b_exists bool) Engine=MergeTree order by A;
insert into non_empty_exists select number id, id % 9973, true from numbers (300000000);
0 rows in set. Elapsed: 60.136 sec. Processed 300.93 million rows, 2.41 GB (5.00 million rows/s., 40.03 MB/s.)
0 rows in set. Elapsed: 57.210 sec. Processed 300.93 million rows, 2.41 GB (5.26 million rows/s., 42.08 MB/s.)
Size
┌─database─┬─table────────────┬─column───┬─compressed─┬─uncompressed─┬─compr_ratio─┬──rows_cnt─┬─avg_row_size─┐
│ test │ non_empty_exists │ A │ 1.12 GiB │ 2.23 GiB │ 2 │ 300000000 │ 8 │
│ test │ non_empty_exists │ B │ 1.12 GiB │ 2.23 GiB │ 2 │ 300000000 │ 8 │
│ test │ non_empty_exists │ b_exists │ 1.27 MiB │ 285.99 MiB │ 224.44 │ 300000000 │ 1 │
└──────────┴──────────────────┴──────────┴────────────┴──────────────┴─────────────┴───────────┴──────────────┘ |
Please go ahead with the implementation and ask in clickhouse community why are |
@ankitnayan @srikanthccv need suggestion. For the top level columns, since the default values are already added by SDK/Otel collector and we won't have default column for them. In our current implementation we don't have index for all these top level columns and users are allowed to convert them to selected fields by creating index in the backend.Should we also add index for them in our migration by default. The reason is that these fields will always be there regarless of the log line and will be filled with default values sent by the SDK/collector. I think we can add appropiate index for these and then add it in the migration. Also we can disable update fields for these top level keys i.e (adding/removing index) What are your thoughts ? |
Your thoughts LGTM cc: @rkssisodiya |
Added default index for top level keys SigNoz/signoz-otel-collector#164 but have not added skip index for trace_id, span_id as there is not point in adding them since they are always unique and no pattern to be used with some kind of bloom filter. |
Right now if key is not present the default value is set by clickhouse
The text was updated successfully, but these errors were encountered: