Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI upgrade/downgrade tests for Online DDL / throttler / vreplication flow #16017

Merged

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented May 27, 2024

Description

Adding upgrade/downgrade tests as follows:

  • The tests will run an X version VTTablet primary, with Y VTTablet replica. See breakdown below.
  • Enable tablet throttler.
  • Create and populate some table.
  • Create a load. The load takes throttling into account so that it does not overwhelm the cluster.
  • Run Online DDL with vitess strategy, thereby running VReplication that is also using the throttler.
  • Throttle and unthrottle the migration.
  • Validate migration behavior while under load. Complete the migration.
  • Expect successful completion of the migration, with cut-over as graceful as possible (but will resort to forced cut-over if unable to complete in a timely fashion).

See more details in https://github.com/vitessio/vitess/pull/16017/files#diff-dc6e3889488266071984d6da1f44795b7b06bf5ff6504c8e9aa359598918230aR17-R37

This will test cross version compatibility of the throttler, of vreplication, of Online DDL. It runs onlineddl_flow in the following setups:

  • Primary at version N (current) and replica at N-1
  • Primary at version N-1 and replica at N
  • Primary at version N+1 and replica at N
  • Primary at version N and replica at N+1

Each such flow runs at about 1 minute, which is why I chose to aggregate them all in the same workflow test (as opposed to splitting to "new" and "old")

EDIT: sample execution flow: https://github.com/vitessio/vitess/actions/runs/9348746367/job/25728525330?pr=16017

Related Issue(s)

No specific issue, but with #15988 in mind.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ctionality)

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Contributor

vitess-bot bot commented May 27, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 27, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone May 27, 2024
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ve more chance, terminate the workload and expect it to compelete

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ironment variable values to vttablet and mysqlctl binary paths

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach requested a review from a team June 3, 2024 09:43
Copy link

codecov bot commented Jun 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.22%. Comparing base (faa2e2e) to head (79256c7).
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16017      +/-   ##
==========================================
- Coverage   68.23%   68.22%   -0.02%     
==========================================
  Files        1541     1541              
  Lines      197234   197330      +96     
==========================================
+ Hits       134592   134634      +42     
- Misses      62642    62696      +54     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Jun 3, 2024

So weird. The workflow onlineddl_flow doesn't run now. It ran just fine when I was testing with the rest of the workflows removed (e.g. 42692c2), but now that I returned all the original workflows, I don't see it on the CI checks list.

@shlomi-noach
Copy link
Contributor Author

OK its just reappeared. Fine.

@shlomi-noach
Copy link
Contributor Author

Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, otherwise it looks good to me. We will have to remember to make this new workflow required on main once we merge the PR.

Comment on lines +60 to +79
- name: Check for changes in relevant files
if: steps.skip-workflow.outputs.skip-workflow == 'false'
uses: dorny/paths-filter@v3.0.1
id: changes
with:
token: ''
filters: |
end_to_end:
- 'go/**'
- 'go/**/*.go'
- 'test.go'
- 'Makefile'
- 'build.env'
- 'go.sum'
- 'go.mod'
- 'proto/*.proto'
- 'tools/**'
- 'config/**'
- 'bootstrap.sh'
- '.github/workflows/upgrade_downgrade_test_onlineddl_flow.yml'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be moved two steps up, after checking out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved.

go/test/endtoend/onlineddl/flow/onlineddl_flow_test.go Outdated Show resolved Hide resolved
shlomi-noach and others added 4 commits June 3, 2024 17:57
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach mentioned this pull request Jun 3, 2024
49 tasks
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! ❤️

mkdir -p /tmp/vitess-build-current/
cp -R bin /tmp/vitess-build-current/

# Swap the binaries. Use vtctl version n and keep vttablet at version n-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be checking the vtctldclient version rather than vtctl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we don't use vtctl and that comment is leftover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

cp /tmp/vitess-build-next/bin/vttablet $PWD/bin/vttablet-next
cp /tmp/vitess-build-next/bin/mysqlctl $PWD/bin/mysqlctl-next
cp /tmp/vitess-build-next/bin/mysqlctld $PWD/bin/mysqlctld-next
vtctl --version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should be vtctldclient --version. The tests are using vtctldclient (cluster.VtctldClientProcess) so not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should actually be vttablet, this is again an oversight. Fixing in next push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

t.Logf("Using REPLICA_TABLET_BINARY_SUFFIX: %s", binarySuffix)
}

shards = clusterInstance.Keyspaces[0].Shards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a require call before here too:

require.Greater(t, len(clusterInstance.Keyspaces), 0) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 306 to 312
t.Run("additional wait", func(t *testing.T) {
select {
case <-time.After(3 * time.Second):
case <-ctx.Done():
require.Fail(t, "context cancelled")
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what we're waiting for here? If nothing else, it will give us a clue if this becomes flaky. Are we waiting for the schema migration to reach the ready-to-complete stage? If we're just waiting for load generation then it's fine. Although we could also wait for the rows in a table to be greater than X. We have helpers for that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting just so that we generate more DMLs, and give migration/vreplication more "opportunities" to throttle or to make progress. Adding clarification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 326 to 332
select {
case <-time.After(10 * time.Second):
case <-ctx.Done():
require.Fail(t, "context cancelled")
}

status := onlineddl.WaitForMigrationStatus(t, &vtParams, shards, uuid, migrationWaitTimeout, schema.OnlineDDLStatusRunning, schema.OnlineDDLStatusComplete)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why in this spot and similar ones we're not instead just adding 10 seconds to the migrationWaitTimeout we pass in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. You're right and this is not needed. Removing altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 382 to 385
row := onlineddl.VtgateExecDDL(t, &vtParams, ddlStrategy, alterStatement, "").Named().Row()
if row != nil {
uuid = row.AsString("uuid", "")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not use require/assert to fail here if we don't get a row and/or uuid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is copy+paste from other tests where a nil result is in fact possible (where we expect an error in submission).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +474 to +479
if expected {
// if migration is ready to complete, the timestamp should be non-null
assert.False(t, row["ready_to_complete_timestamp"].IsNull())
} else {
assert.True(t, row["ready_to_complete_timestamp"].IsNull())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could condense this to:

assert.Equal(t, expected, row["ready_to_complete_timestamp"].IsNull())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the negation actually (IsNull() should be !expected) which is why I broke it into explicit if-else, as I find negation so confusing sometimes.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

We wanted this to be in time for v20 RC1 code freeze, but did not make it. Seeing that this is a tests-only PR, it complies with code freeze guidelines. There is an advantage to having this in release-20.0 branch and so once merged we will backport this PR to both release-20.0 and release-20.0-rc.

@shlomi-noach shlomi-noach merged commit 2df3545 into vitessio:main Jun 5, 2024
93 checks passed
@shlomi-noach shlomi-noach deleted the ci-upgrade-downgrade-onlineddl branch June 5, 2024 03:58
shlomi-noach added a commit to planetscale/vitess that referenced this pull request Jun 5, 2024
…flow (vitessio#16017)

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Florent Poinsard <35779988+frouioui@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants