Remove redundant dag_id index on log table #42376

dstandish · 2024-09-20T16:25:33Z

This index is redundant since there's another one with dag_id as leading column.

(cherry picked from commit 388b60fa3b745289f5ab7c8692752b40ebef32f8)

airflow/migrations/versions/0033_3_0_0_remove_redundant_index.py

This index is not needed because there's another index on the table that leads with dag_id.

kaxil · 2024-10-30T20:33:35Z

I was reading about what the recommendation is around Multi-column indexes vs single -- and looks like it isn't as simple as I had initially thought that the multi-column index with a leading column can remove the need of a separate "single-col" index in all cases!

From https://www.postgresql.org/docs/current/indexes-bitmap-scans.html:

In all but the simplest applications, there are various combinations of indexes that might be useful, and the database developer must make trade-offs to decide which indexes to provide. Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combination feature. For example, if your workload includes a mix of queries that sometimes involve only column x, sometimes only column y, and sometimes both columns, you might choose to create two separate indexes on x and y, relying on index combination to process the queries that use both columns. You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone. The last alternative is to create all three indexes, but this is probably only reasonable if the table is searched much more often than it is updated and all three types of query are common. If one of the types of query is much less common than the others, you'd probably settle for creating just the two indexes that best match the common types.

jedcunningham · 2024-10-30T21:06:59Z

Interesting, TIL. Probably still a good general approach though, but yeah in some cases it could make sense to keep them around.

This table specifically is so write heavy, wouldn't be worth it :)

kaxil · 2024-10-30T21:34:27Z

Interesting, TIL. Probably still a good general approach though, but yeah in some cases it could make sense to keep them around.

This table specifically is so write heavy, wouldn't be worth it :)

Yeah, I was fixing some migrations issues and found similar indexes and I thought let me remove them and got curious what the recommendation -- and backtracked on mixing it with some of the changes I will have in a PR

dstandish · 2024-10-30T22:40:34Z

I was reading about what the recommendation is around Multi-column indexes vs single -- and looks like it isn't as simple as I had initially thought that the multi-column index with a leading column can remove the need of a separate "single-col" index in all cases!

I did not really see anything in there that suggested that it's not true.

If you have index on x, y, the queries on x are covered by the index. That's true. Now, sure, an index on x alone be smaller.
Yes. But it remains to be shown what difference that makes.

kaxil · 2024-10-30T22:48:19Z

This sentence is the key there: "For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone. "

How much slower? That depends on the dataset and access pattern.

In above, a multi-column index on (x, y) technically covers queries on x alone but the additional storage and scanning costs can impact performance, especially for high-traffic columns. A single-column index on x is smaller and offers a narrower scan range, which often improves I/O and cache usage, making a measurable difference in efficiency for large datasets. Again how much difference depends on the dataset and query patterns

dstandish · 2024-10-30T23:09:36Z

right yeah i saw that. i think that the default choice would be, don't add the redundant index unless you had a compelling reason. that being a little bigger would make it measurably slower isn't surprising but that doesn't mean it's worth adding the redundant index. i think the side proposing to add the redundant index would bear the burden to demonstrate the need for it. the fact that one index can serve multiple query patterns is a great thing and we should take advantage of that where we can i think.

dstandish · 2024-10-30T23:15:38Z

implication of that being, if you see this sort of thing, probably best to remove, don't you think?

kaxil · 2024-10-30T23:27:48Z

Agreed

Remove redundant dag_id index on log table

523060a

(cherry picked from commit 388b60fa3b745289f5ab7c8692752b40ebef32f8)

dstandish requested review from kaxil, XD-DENG and ashb as code owners September 20, 2024 16:25

dstandish marked this pull request as draft September 20, 2024 16:25

dstandish added 3 commits September 20, 2024 09:40

add migration

4d3ed11

fixup! add migration

7684ad1

fixup! fixup! add migration

fd5ecb0

dstandish marked this pull request as ready for review September 20, 2024 16:54

dstandish requested a review from potiuk as a code owner September 20, 2024 16:54

vincbeck approved these changes Sep 20, 2024

View reviewed changes

jedcunningham approved these changes Sep 20, 2024

View reviewed changes

fix static

a409909

ephraimbuddy reviewed Sep 20, 2024

View reviewed changes

airflow/migrations/versions/0033_3_0_0_remove_redundant_index.py Outdated Show resolved Hide resolved

fixup

97d37e5

dstandish force-pushed the remove-redundant-dag_id-index-on-log-table branch from beeb6fb to 97d37e5 Compare September 20, 2024 17:24

fixup! fixup

49dd65e

dstandish merged commit 0f64f32 into apache:main Sep 20, 2024
51 checks passed

dstandish deleted the remove-redundant-dag_id-index-on-log-table branch September 20, 2024 19:06

joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024

Remove redundant dag_id index on log table (apache#42376)

d6425d1

This index is not needed because there's another index on the table that leads with dag_id.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant dag_id index on log table #42376

Remove redundant dag_id index on log table #42376

dstandish commented Sep 20, 2024

kaxil commented Oct 30, 2024 •

edited

Loading

jedcunningham commented Oct 30, 2024

kaxil commented Oct 30, 2024

dstandish commented Oct 30, 2024

kaxil commented Oct 30, 2024

dstandish commented Oct 30, 2024 •

edited

Loading

dstandish commented Oct 30, 2024

kaxil commented Oct 30, 2024

Remove redundant dag_id index on log table #42376

Remove redundant dag_id index on log table #42376

Conversation

dstandish commented Sep 20, 2024

kaxil commented Oct 30, 2024 • edited Loading

jedcunningham commented Oct 30, 2024

kaxil commented Oct 30, 2024

dstandish commented Oct 30, 2024

kaxil commented Oct 30, 2024

dstandish commented Oct 30, 2024 • edited Loading

dstandish commented Oct 30, 2024

kaxil commented Oct 30, 2024

kaxil commented Oct 30, 2024 •

edited

Loading

dstandish commented Oct 30, 2024 •

edited

Loading