Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant dag_id index on log table #42376

Merged

Conversation

dstandish
Copy link
Contributor

This index is redundant since there's another one with dag_id as leading column.

(cherry picked from commit 388b60fa3b745289f5ab7c8692752b40ebef32f8)
@dstandish dstandish marked this pull request as draft September 20, 2024 16:25
@dstandish dstandish marked this pull request as ready for review September 20, 2024 16:54
@dstandish dstandish force-pushed the remove-redundant-dag_id-index-on-log-table branch from beeb6fb to 97d37e5 Compare September 20, 2024 17:24
@dstandish dstandish merged commit 0f64f32 into apache:main Sep 20, 2024
51 checks passed
@dstandish dstandish deleted the remove-redundant-dag_id-index-on-log-table branch September 20, 2024 19:06
joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024
This index is not needed because there's another index on the table that leads with dag_id.
@kaxil
Copy link
Member

kaxil commented Oct 30, 2024

I was reading about what the recommendation is around Multi-column indexes vs single -- and looks like it isn't as simple as I had initially thought that the multi-column index with a leading column can remove the need of a separate "single-col" index in all cases!

From https://www.postgresql.org/docs/current/indexes-bitmap-scans.html:

In all but the simplest applications, there are various combinations of indexes that might be useful, and the database developer must make trade-offs to decide which indexes to provide. Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combination feature. For example, if your workload includes a mix of queries that sometimes involve only column x, sometimes only column y, and sometimes both columns, you might choose to create two separate indexes on x and y, relying on index combination to process the queries that use both columns. You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone. The last alternative is to create all three indexes, but this is probably only reasonable if the table is searched much more often than it is updated and all three types of query are common. If one of the types of query is much less common than the others, you'd probably settle for creating just the two indexes that best match the common types.

@jedcunningham
Copy link
Member

Interesting, TIL. Probably still a good general approach though, but yeah in some cases it could make sense to keep them around.

This table specifically is so write heavy, wouldn't be worth it :)

@kaxil
Copy link
Member

kaxil commented Oct 30, 2024

Interesting, TIL. Probably still a good general approach though, but yeah in some cases it could make sense to keep them around.

This table specifically is so write heavy, wouldn't be worth it :)

Yeah, I was fixing some migrations issues and found similar indexes and I thought let me remove them and got curious what the recommendation -- and backtracked on mixing it with some of the changes I will have in a PR

@dstandish
Copy link
Contributor Author

I was reading about what the recommendation is around Multi-column indexes vs single -- and looks like it isn't as simple as I had initially thought that the multi-column index with a leading column can remove the need of a separate "single-col" index in all cases!

I did not really see anything in there that suggested that it's not true.

If you have index on x, y, the queries on x are covered by the index. That's true. Now, sure, an index on x alone be smaller.
Yes. But it remains to be shown what difference that makes.

@kaxil
Copy link
Member

kaxil commented Oct 30, 2024

This sentence is the key there: "For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone. "

How much slower? That depends on the dataset and access pattern.

In above, a multi-column index on (x, y) technically covers queries on x alone but the additional storage and scanning costs can impact performance, especially for high-traffic columns. A single-column index on x is smaller and offers a narrower scan range, which often improves I/O and cache usage, making a measurable difference in efficiency for large datasets. Again how much difference depends on the dataset and query patterns

@dstandish
Copy link
Contributor Author

dstandish commented Oct 30, 2024

right yeah i saw that. i think that the default choice would be, don't add the redundant index unless you had a compelling reason. that being a little bigger would make it measurably slower isn't surprising but that doesn't mean it's worth adding the redundant index. i think the side proposing to add the redundant index would bear the burden to demonstrate the need for it. the fact that one index can serve multiple query patterns is a great thing and we should take advantage of that where we can i think.

@dstandish
Copy link
Contributor Author

implication of that being, if you see this sort of thing, probably best to remove, don't you think?

@kaxil
Copy link
Member

kaxil commented Oct 30, 2024

Agreed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants