Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vreplication timeout query optimizer hints #13840

Merged
merged 16 commits into from
Oct 13, 2023

Conversation

olyazavr
Copy link
Contributor

@olyazavr olyazavr commented Aug 23, 2023

Description

When we try to run online ddl operations using the Vitess strategy, we see certain migrations on specific keyspace/tables fail due to timeout. The suggestion in slack was to increase mysql's max_execution_time, which did the trick, but we would like to limit this to only vreplication, not affecting any other query.

To do so, I have used the vreplication_copy_phase_duration flag and added /*+ MAX_EXECUTION_TIME(x) */ where x is the value of vreplication_copy_phase_duration to the vreplication queries. This involved moving the vreplication_copy_phase_duration duration flag to avoid circular dependencies

I tested this in our own environment with the previously failing schema migration and saw it make progress and complete with a 1hr setting for this flag.

Update: This has now been running in production and we no longer see online ddl schema migrations fail because of query timeouts. We see large/busy keyspaceshards able to run migrations where they previously were unable with ghost or vitess

Related Issue(s)

Fixes #14081

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Aug 23, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Aug 23, 2023
@github-actions github-actions bot added this to the v18.0.0 milestone Aug 23, 2023
@olyazavr olyazavr marked this pull request as ready for review September 22, 2023 15:25
@frouioui frouioui modified the milestones: v18.0.0, v19.0.0 Sep 29, 2023
@olyazavr
Copy link
Contributor Author

olyazavr commented Oct 4, 2023

I got feedback to not move the flag to another package, but I get circular dependencies otherwise, not sure how to proceed here

@rohit-nayak-ps
Copy link
Contributor

I got feedback to not move the flag to another package, but I get circular dependencies otherwise, not sure how to proceed here

Sorry, meant to take a look at this earlier. My thought was to move the flag to tablet-level. I will take a look first thing tomorrow and push to this PR if that is OK.

I can take a look at the failing unit tests too then.

@rohit-nayak-ps rohit-nayak-ps added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request labels Oct 4, 2023
@rohit-nayak-ps
Copy link
Contributor

Sorry, meant to take a look at this earlier. My thought was to move the flag to tablet-level. I will take a look first thing tomorrow and push to this PR if that is OK.

This commit moves the flag to the vttablet level.

I can take a look at the failing unit tests too then.

Didn't look at the failing tests since I saw you had some recent commits to try to address them. Let me know if you want me to take a look at them.

@olyazavr
Copy link
Contributor Author

olyazavr commented Oct 5, 2023

Thanks for that commit, I've pulled it in! I took a stab at the tests, but I would love some help with the remaining ones

@rohit-nayak-ps
Copy link
Contributor

Thanks for that commit, I've pulled it in! I took a stab at the tests, but I would love some help with the remaining ones

Only one other test related to noblob seems to be failing. Here is the commit that fixes it: 99e1dfd. For noblob there is another set of queries that need the timeout hint, which I have added in that commit.

Can you pull that and let's see if all tests pass? Thanks.

@olyazavr
Copy link
Contributor Author

olyazavr commented Oct 6, 2023

Tests are all now passing! I'm also trying to find someone with permissions to update the hubspot repo settings so others can actually push to my branches 😅

@rohit-nayak-ps rohit-nayak-ps removed the NeedsWebsiteDocsUpdate What it says label Oct 6, 2023
@rohit-nayak-ps
Copy link
Contributor

@olyazavr, great, good to see that all tests are passing.

Following up on the slack thread regarding this issue: did this PR fix your existing issue? Especially the fact that we are adding the timeout hints on the target for the insert/updates. Usually we have been seeing issues on the source while selecting rows in the copy phase. So just wanted to confirm that it does work.

Can you also add your specific findings to the description in the issue, just so that we have it all in one place? Thanks!

@olyazavr
Copy link
Contributor Author

olyazavr commented Oct 6, 2023

Yes, it did fix the issue, and we are now able to run migrations on larger/busier keyspaceshards that previously could not run schema migrations on ghost or vitess (:

If the question is do we need the timeout hints for anything other than the selects - I confess I didn't try this out, and added the timeout hints in all places to be consistent, and because it seemed like there was no downside to having all vreplication queries be allowed to run longer

I did not test this with any other usage of vreplication other than online ddl schema migrations

@@ -726,6 +727,7 @@ func (tpb *tablePlanBuilder) generateValuesPart(buf *sqlparser.TrackedBuffer, bv
func (tpb *tablePlanBuilder) generateSelectPart(buf *sqlparser.TrackedBuffer, bvf *bindvarFormatter) *sqlparser.ParsedQuery {
bvf.mode = bvAfter
buf.WriteString(" select ")
buf.WriteString(vstreamer.GetVReplicationMaxExecutionTimeQueryHint())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olyazavr we don't need this as well. We are essentially selecting static values which are used in inserts. So the tabletmanager\vreplication package doesn't need the optimizer hints anywhere.

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Thanks for this @olyazavr, this should be very useful to resolve issues that we have seen occassionally with users with large clusters.

Olga Shestopalova and others added 15 commits October 11, 2023 10:40
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
…r and tabletserver use it.

Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
@olyazavr
Copy link
Contributor Author

Rebased and dealt with merge conflicts

@olyazavr
Copy link
Contributor Author

Looks like all checks are passing, anything else for me to do here?

@deepthi deepthi merged commit 0adaf78 into vitessio:main Oct 13, 2023
115 checks passed
frouioui pushed a commit to planetscale/vitess that referenced this pull request Nov 21, 2023
…n timeout query optimizer hints (vitessio#3410)

* backport of 3391

* resolved conflict

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
frouioui pushed a commit to planetscale/vitess that referenced this pull request Mar 26, 2024
…n timeout query optimizer hints (vitessio#3409)

* backport of 3391

* resolved conflict

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* fix endtoend

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* fix endtoend

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* fix vtcombo endtoend test

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@mattlord
Copy link
Contributor

I will open a PR to update the docs here: https://vitess.io/docs/19.0/reference/vreplication/flags/#vreplication_copy_phase_duration

After this PR the flag is applicable on BOTH the source and target keyspaces:

  1. On the target as it is used for the context used in the RPC to the source tablet by the vcopier:
    ctx, cancel := context.WithTimeout(ctx, vttablet.CopyPhaseDuration)
    defer cancel()
    var lastpkpb *querypb.QueryResult
    if lastpkqr := copyState[tableName]; lastpkqr != nil {
    lastpkpb = sqltypes.ResultToProto3(lastpkqr)
    }
    rowsCopiedTicker := time.NewTicker(rowsCopiedUpdateInterval)
    defer rowsCopiedTicker.Stop()
    copyStateGCTicker := time.NewTicker(copyStateGCInterval)
    defer copyStateGCTicker.Stop()
    parallelism := getInsertParallelism()
    copyWorkerFactory := vc.newCopyWorkerFactory(parallelism)
    copyWorkQueue := vc.newCopyWorkQueue(parallelism, copyWorkerFactory)
    defer copyWorkQueue.close()
    // Allocate a result channel to collect results from tasks.
    resultCh := make(chan *vcopierCopyTaskResult, parallelism*4)
    defer close(resultCh)
    var lastpk *querypb.Row
    var pkfields []*querypb.Field
    // Use this for task sequencing.
    var prevCh <-chan *vcopierCopyTaskResult
    serr := vc.vr.sourceVStreamer.VStreamRows(ctx, initialPlan.SendRule.Filter, lastpkpb, func(rows *binlogdatapb.VStreamRowsResponse) error {
  2. On the source as it is used when building the query by the row streamer that runs there:
    func GetVReplicationMaxExecutionTimeQueryHint() string {
    return fmt.Sprintf("/*+ MAX_EXECUTION_TIME(%v) */ ", vttablet.CopyPhaseDuration.Milliseconds())
    }

mattlord added a commit to vitessio/website that referenced this pull request Jul 26, 2024


Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Contributor

vitessio/website#1797

mattlord added a commit to vitessio/website that referenced this pull request Jul 26, 2024
 (#1797)

Signed-off-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: set higher timeout for vreplication queries
5 participants