Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the race condition during vttablet startup #15731

Merged

Conversation

dbussink
Copy link
Contributor

This avoids the problem where the connection pool is poisoned when we check for the MySQL port, by avoiding to use the pool in the first place. We only ever run this once at startup, so we can create a new connection here and then dispose of it once we've retrieved the port.

That way we know the connection pool is still clean and doesn't have any problems.

Related Issue(s)

Fixes #15730

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

This avoids the problem where the connection pool is poisoned when we
check for the MySQL port, by avoiding to use the pool in the first
place. We only ever run this once at startup, so we can create a new
connection here and then dispose of it once we've retrieved the port.

That way we know the connection pool is still clean and doesn't have any
problems.

Fixes vitessio#15730

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Copy link
Contributor

vitess-bot bot commented Apr 17, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Apr 17, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone Apr 17, 2024
timer := time.NewTimer(waitTime)
ctx, cancel := context.WithTimeout(ctx, waitTime)
defer cancel()
for {
conn, connErr := dbconnpool.NewDBConnection(ctx, mysqld.dbcfgs.DbaConnector())
conn, connErr := mysql.Connect(ctx, params)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted to also refactor this case slightly to avoid using the pool entirely so for any future readers it's also more obvious no pooling is actually used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. When I was looking at this yesterday, I had to navigate through the code to see that NewDBConnection returns a non-pooled connection.

// during MySQL startup when we still might be loading things like grants.
// This means we need to use an isolated connection to avoid poisoning the
// DBA connection pool for further queries.
params, err := mysqld.dbcfgs.DbaConnector().MysqlParams()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it here to use a connection without pooling which avoids the poisoning. I also fixed the context.TODO() while being in here, making sure it's explicitly passed in.

@dbussink dbussink removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Apr 17, 2024
@GrahamCampbell
Copy link
Contributor

Can this be backported to v19?

@dbussink
Copy link
Contributor Author

Can this be backported to v19?

@GrahamCampbell Are you affected by this bug? We haven't backported previous fixes either mentioned in the linked issue. It only happens when you first initialize a Vitess cluster, often when using k8s where things spin up in a fairly random order.

Copy link

codecov bot commented Apr 17, 2024

Codecov Report

Attention: Patch coverage is 86.36364% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 68.39%. Comparing base (4519c8f) to head (f4609f8).

Files Patch % Lines
go/vt/mysqlctl/replication.go 77.77% 2 Missing ⚠️
go/vt/mysqlctl/mysqld.go 75.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #15731   +/-   ##
=======================================
  Coverage   68.38%   68.39%           
=======================================
  Files        1556     1556           
  Lines      195347   195361   +14     
=======================================
+ Hits       133593   133608   +15     
+ Misses      61754    61753    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timer := time.NewTimer(waitTime)
ctx, cancel := context.WithTimeout(ctx, waitTime)
defer cancel()
for {
conn, connErr := dbconnpool.NewDBConnection(ctx, mysqld.dbcfgs.DbaConnector())
conn, connErr := mysql.Connect(ctx, params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. When I was looking at this yesterday, I had to navigate through the code to see that NewDBConnection returns a non-pooled connection.

@@ -730,10 +730,18 @@ func (tm *TabletManager) checkMysql() error {
return nil
}

const portCheckTimeout = 5 * time.Second

func (tm *TabletManager) getMysqlPort() (int32, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems overkill to define a function that is used exactly once. Why can't all this be folded into findMysqlPort? It will grow from 10 lines to 15 lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepthi Because it's setting up a context and you can't defer that properly in a for loop. So hence the separate function so it's not needed to manually cancel that which is more error prone.

return 0, err
}
defer conn.Close()
qr, err := conn.ExecuteFetch("SHOW VARIABLES LIKE 'port'", 1, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not super important here, but should the context be passed down to the actual query? We've seen various cases where MySQL ends up being "stuck" for whatever reason not replying to incoming queries, and this would cause the GetMysqlPort function to hang indefinitely (instead of returning an error once ctx expires).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arthurschreiber Do you mean as a general feature / refactor? It's not possible to pass in a context at the moment in the MySQL connection handling that Vitess does.

That might be useful as a separate feature / change, but I think that's independent of what we're doing here?

Copy link
Member

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dbussink dbussink merged commit 178e6e8 into vitessio:main Apr 17, 2024
106 of 113 checks passed
@dbussink dbussink deleted the dbussink/fix-vttablet-port-check-race branch April 17, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: vttablet DBA pool can be poisoned with broken connection during startup
5 participants