-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove redundant dag_id index on log table #42376
Remove redundant dag_id index on log table #42376
Conversation
(cherry picked from commit 388b60fa3b745289f5ab7c8692752b40ebef32f8)
airflow/migrations/versions/0033_3_0_0_remove_redundant_index.py
Outdated
Show resolved
Hide resolved
beeb6fb
to
97d37e5
Compare
This index is not needed because there's another index on the table that leads with dag_id.
I was reading about what the recommendation is around Multi-column indexes vs single -- and looks like it isn't as simple as I had initially thought that the multi-column index with a leading column can remove the need of a separate "single-col" index in all cases! From https://www.postgresql.org/docs/current/indexes-bitmap-scans.html:
|
Interesting, TIL. Probably still a good general approach though, but yeah in some cases it could make sense to keep them around. This table specifically is so write heavy, wouldn't be worth it :) |
Yeah, I was fixing some migrations issues and found similar indexes and I thought let me remove them and got curious what the recommendation -- and backtracked on mixing it with some of the changes I will have in a PR |
I did not really see anything in there that suggested that it's not true. If you have index on x, y, the queries on x are covered by the index. That's true. Now, sure, an index on x alone be smaller. |
This sentence is the key there: "For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone. " How much slower? That depends on the dataset and access pattern. In above, a multi-column index on (x, y) technically covers queries on x alone but the additional storage and scanning costs can impact performance, especially for high-traffic columns. A single-column index on x is smaller and offers a narrower scan range, which often improves I/O and cache usage, making a measurable difference in efficiency for large datasets. Again how much difference depends on the dataset and query patterns |
right yeah i saw that. i think that the default choice would be, don't add the redundant index unless you had a compelling reason. that being a little bigger would make it measurably slower isn't surprising but that doesn't mean it's worth adding the redundant index. i think the side proposing to add the redundant index would bear the burden to demonstrate the need for it. the fact that one index can serve multiple query patterns is a great thing and we should take advantage of that where we can i think. |
implication of that being, if you see this sort of thing, probably best to remove, don't you think? |
Agreed |
This index is redundant since there's another one with dag_id as leading column.