Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slack-vitess-r15.0.5: backport v16 vtorc fixes #258

Merged

Conversation

timvaillancourt
Copy link
Member

@timvaillancourt timvaillancourt commented Mar 15, 2024

Description

This PR backports v16 fixes for vtorc -> v15. Most of these PRs backported with no conflicts at all or very minor ones

The backports chosen are the only v16 vtorc fixes I can see that are not in v15 too

Related PRs

  1. VTOrc running PRS when database_instance empty bug fix. vitessio/vitess#12019
  2. Timeout Fixes and VTOrc Improvement vitessio/vitess#11881
  3. Also log error on a failure in DiscoverInstance vitessio/vitess#11936
  4. VTOrc Code Cleanup - generate_base, replace cluster_name with keyspace and shard. vitessio/vitess#12012
  5. Fix insert query of blocked_recovery table in VTOrc vitessio/vitess#12091
  6. Fix: VTOrc forgetting old instances vitessio/vitess#12089
  7. Move vtorc from go-sqlite3 to modernc.org/sqlite vitessio/vitess#12214

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

* feat: convert join with database_instance to a left join and prevent fixes from running if the information from database_instance is unavailable

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add tests to verify the fix works

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
* refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add failing ers test for handling multiple vttablet failures with default values of flags

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add a new lock-timeout flag and use that instead of remote-operation-timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets

Signed-off-by: Manan Gupta <manan@planetscale.com>

* log: add more logging lines around ers in vtorc

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: get the test to work

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix usage of wait for replicas timeout

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags expected output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix race in test now that the function is called in parallel multiple times

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix default of onCloseTimeout to 1 second

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add failing unit test to refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc to not forget a tablet which has been deleted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix backward compatibility, add tests and release notes docs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flags output

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix flaky test by not checking for an error

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: handle the case of empty hostname in tablet initialization

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: update onclose timeout to 10 seconds

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix unit test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: address review comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* docs: add comments explaining the test functions

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add summary docs for 'lock-shard-timeout' deprecation

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
…fied (vitessio#11936)

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
…e and shard. (vitessio#12012)

* feat: refactor generate commands of VTOrc to be in a single file

Signed-off-by: Manan Gupta <manan@planetscale.com>

* refactor: cleanup create table formatting

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: cleanup the usage of IsSQLite and IsMySQL

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused minimal instance

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused table cluster_domain_name

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix vtorc database to store keyspace and shard instead of cluster

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused attributes

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unused cluster domain

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: change GetClusterName to GetKeyspaceAndShardName

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix insertion into database_instance

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix SnapshotTopologies

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove inject unseen primary and inject seed

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove ClusterName from Instance

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix Audit operations

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add Keyspace and Shard to cluster information to replace ClusterName

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix attempt failure detection registeration

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix blocked topology recoveries

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix topology recovery

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: reading recovery instances

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix get replication and analysis

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix bug in query

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add tests to check that filtering by keyspace works for APIs

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove remaining usages of ClusterName

Signed-off-by: Manan Gupta <manan@planetscale.com>

* refactor: fix comment explaining sleep in the test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add code to prevent filtering just by shard and add tests for it

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
* feat: add failing test and fix the query of insertion

Signed-off-by: Manan Gupta <manan@planetscale.com>

* empty-commit

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
* test: add a failing test for the case where the port changes for a tablet

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix the issue by adding alias as a unique field

Signed-off-by: Manan Gupta <manan@planetscale.com>

* empty-commit

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link

Thanks for the contribution! Before we can merge this, we need @GuptaManan100 to sign the Salesforce Inc. Contributor License Agreement.

@github-actions github-actions bot added this to the v15.0.5 milestone Mar 15, 2024
@timvaillancourt timvaillancourt changed the title Bp v16 vtorc fixes slack vitess r15.0.5 slack-vitess-r15.0.5: backport v16 vtorc fixes Mar 15, 2024
* Move vtorc from go-sqlite3 to modernc.org/sqlite

This moves vtorc from the go-sqlite3 library that uses CGO, to use
modernc.org/sqlite which is a pure Go implementation.

vtorc is the only component we have to build with CGO but it's causing
pain for releases since we need to build it against an old Linux for
linking against glibc.

Using modernc.org/sqlite allows for using Go only again and makes all
Vitess components buildable without CGO.

In
https://datastation.multiprocess.io/blog/2022-05-12-sqlite-in-go-with-and-without-cgo.html
someone ran some basic benchmarks. It shows that the pure Go version can
be twice as slow, but the usage of vtorc is very limited and we operate
on small datasets, so I think the performance impact purely of a
somewhat slower sqlite implementation is negligable.

None of this is in a hot query serving path or anything like that, so I
have little concern performance wise.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Fix error handling in RowToArray

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Copy link

Thanks for the contribution! Before we can merge this, we need @GuptaManan100 @dbussink to sign the Salesforce Inc. Contributor License Agreement.

@timvaillancourt
Copy link
Member Author

This PR fails reliably on reparenting with our slack-vitess-r14.0.5 release. It doesn't seem to be a "flaky CI" problem

The 2 failures are:

  1. Run Upgrade Downgrade Test - Reparent Old Vtctl:
I0315 21:40:08.269356   82204 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21406 Validate -- --ping-tablets=true (attempt 1 of 10)
I0315 21:40:08.283520   82204 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21406 GetTablet zone1-0000000101 (attempt 1 of 10)
I0315 21:40:08.292355   82204 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21406 GetTablet -- zone1-0000000101 (attempt 1 of 10)
I0315 21:40:08.298516   82204 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21406 Validate (attempt 1 of 10)
I0315 21:40:08.811477   82204 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21406 ShardReplicationPositions ks/0 (attempt 1 of 10)
I0315 21:40:08.820775   82204 utils.go:580] Positions:
I0315 21:40:08.820787   82204 utils.go:582] 	zone1-0000000101 ks 0 primary localhost:21407 localhost:21409 [] 2024-03-15T21:40:08Z MySQL56/94b66df1-e314-11ee-97dd-002248b56b4c:1-16 0
I0315 21:40:08.820790   82204 utils.go:582] 	zone1-0000000102 ks 0 replica localhost:21410 localhost:21412 [] <null> MySQL56/94b66df1-e314-11ee-97dd-002248b56b4c:1-16 0
I0315 21:40:08.820793   82204 utils.go:582] 	zone1-0000000103 ks 0 replica localhost:21413 localhost:21415 [] <null> MySQL56/94b66df1-e314-11ee-97dd-002248b56b4c:1-16 0
I0315 21:40:08.820795   82204 utils.go:582] 	zone2-0000000201 ks 0 replica localhost:21416 localhost:21418 [] <null> MySQL56/94b66df1-e314-11ee-97dd-002248b56b4c:1-16 0
I0315 21:40:09.123646   82204 vtctldclient_process.go:67] Executing vtctldclient with command: vtctldclient --server localhost:21406 GetFullStatus zone1-0000000101 (attempt 1 of 10)
    reparent_test.go:446: 
        	Error Trace:	reparent_test.go:446
        	Error:      	Received unexpected error:
        	            	proto: syntax error (line 1:1): invalid value Executes
        	Test:       	TestFullStatus
--- FAIL: TestFullStatus (18.41s)
  1. Run Upgrade Downgrade Test - Reparent Old VTTablet:
I0315 22:54:48.242508   82784 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21206 PlannedReparentShard -- --keyspace_shard ks/0 --wait_replicas_timeout 31s --new_primary zone1-101 (attempt 1 of 10)
I0315 22:54:48.391025   82784 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21206 Validate -- --ping-tablets=true (attempt 1 of 10)
I0315 22:54:48.407181   82784 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21206 GetTablet zone1-0000000101 (attempt 1 of 10)
I0315 22:54:48.415258   82784 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21206 GetTablet -- zone1-0000000101 (attempt 1 of 10)
I0315 22:54:48.421470   82784 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21206 Validate (attempt 1 of 10)
I0315 22:54:48.933712   82784 vtctlclient_process.go:214] Executing vtctlclient with command: vtctlclient --server localhost:21206 ShardReplicationPositions ks/0 (attempt 1 of 10)
I0315 22:54:48.944119   82784 utils.go:580] Positions:
I0315 22:54:48.944138   82784 utils.go:582] 	zone1-0000000101 ks 0 primary localhost:21207 localhost:21209 [] 2024-03-15T22:54:48Z MySQL56/0312c62d-e31f-11ee-99b8-000d3ae4740e:1-16 0
I0315 22:54:48.944146   82784 utils.go:582] 	zone1-0000000103 ks 0 replica localhost:21213 localhost:21215 [] <null> MySQL56/0312c62d-e31f-11ee-99b8-000d3ae4740e:1-16 0
I0315 22:54:48.944149   82784 utils.go:582] 	zone1-0000000102 ks 0 replica localhost:21210 localhost:21212 [] <null> MySQL56/0312c62d-e31f-11ee-99b8-000d3ae4740e:1-16 0
I0315 22:54:48.944152   82784 utils.go:582] 	zone2-0000000201 ks 0 replica localhost:21216 localhost:21218 [] <null> MySQL56/0312c62d-e31f-11ee-99b8-000d3ae4740e:1-16 0
    utils.go:489: 
        	Error Trace:	utils.go:489
        	            				reparent_test.go:404
        	Error:      	Not equal: 
        	            	expected: "ON"
        	            	actual  : "OFF"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-ON
        	            	+OFF
        	Test:       	TestCrossCellDurability
--- FAIL: TestCrossCellDurability (16.06s)

It could just be coincidence but I would like to rule out that #177 introduced this, because I believe this passed on upstream v14.0.5 before I fixed v15 to use our v14 build

cc @tanjinx / @vmogilev

@timvaillancourt timvaillancourt marked this pull request as ready for review March 26, 2024 22:28
@timvaillancourt timvaillancourt requested a review from a team as a code owner March 26, 2024 22:28
@timvaillancourt timvaillancourt merged commit 36d8315 into slack-vitess-r15.0.5 May 16, 2024
189 of 197 checks passed
@timvaillancourt timvaillancourt deleted the bp-v16-vtorc-fixes-slack-vitess-r15.0.5 branch May 16, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants