Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure BinlogEventBuffer Channel is initialized before running the BinlogStreamer #331

Open
Scalvando opened this issue Feb 16, 2022 · 0 comments
Labels
Bug Something isn't working

Comments

@Scalvando
Copy link
Contributor

Scalvando commented Feb 16, 2022

Problem

During a run of copydb we found a problem where the binlog streamer lag kept increasing and the lag was almost identical to the run time of the copy.

Tracing and Investigation

A huge thanks to @shuhaowu for helping me investigate this problem and helping to find the bug.

Initially I looked at the logs for the copydb container:

time="2022-02-11T16:17:51Z" level=info msg="starting binlog streaming" file=mysql-bin.001433 host=/app/tmp/mysql_connection.sock port=0 position=9494960 tag=source_binlog_streamer
...
time="2022-02-11T16:17:52Z" level=info msg="starting binlog streamer" tag=source_binlog_streamer
time="2022-02-11T16:17:52Z" level=info msg="binlog file rotated" last_filename=mysql-bin.001433 last_position=9494960 new_filename=mysql-bin.001433 new_position=9494960 tag=source_binlog_streamer

This was the last mention of the binlog streamer in the logs, I assumed at this point that the streamer had stopped for some reason or another.

Trying to investigate further we used the pprof endpoint for the copydb instance and we were able to get some more information.

First we took a 30 second trace from the copydb instance. The blocking information showed that the main program thread's time was spent almost entirely on the inline verifier's PeriodicallyVerifyBinlogEvents

Screen Shot 2022-02-15 at 4 19 46 PM

There wasn't much more useful information from the trace aside from the binlog streamer not appearing in it. We looked to see if anything was deadlocked or waiting to get a mutex but nothing showed up.

We went back to the pprof endpoint for copydb and looked at the goroutine stack dump which just takes a snapshot of the running goroutines. After eliminating the go routines that weren't relevant we were left with the ones related to binlogs.

Of the remaining stack dumps these 2 were of importance:

Goroutine Stack Dump

goroutine 34 [chan receive, 6019 minutes]:
github.com/Shopify/ghostferry.(*BinlogWriter).Run(0xc00028da00)
	/tmp/ghostferry/binlog_writer.go:36 +0x355
github.com/Shopify/ghostferry.(*Ferry).Run.func8(0xc0004781f0, 0xc0000c00f0)
	/tmp/ghostferry/ferry.go:743 +0x57
created by github.com/Shopify/ghostferry.(*Ferry).Run
	/tmp/ghostferry/ferry.go:741 +0x2ed

goroutine 35 [chan send (nil chan), 6019 minutes]:
github.com/Shopify/ghostferry.(*BinlogWriter).BufferBinlogEvents(0xc00028da00, 0xc00048c2f0, 0x1, 0x1, 0x3ff0000000000000, 0x1)
	/tmp/ghostferry/binlog_writer.go:75 +0x7a
github.com/Shopify/ghostferry.(*BinlogStreamer).handleRowsEvent(0xc0002a05a0, 0xc0002ef290, 0x0, 0x0, 0x0, 0xc00047af00, 0x16)
	/tmp/ghostferry/binlog_streamer.go:403 +0x6aa
github.com/Shopify/ghostferry.(*BinlogStreamer).Run(0xc0002a05a0)
	/tmp/ghostferry/binlog_streamer.go:234 +0xc5e
github.com/Shopify/ghostferry.(*Ferry).Run.func9(0xc0004781f0, 0xc0000c00f0)
	/tmp/ghostferry/ferry.go:750 +0x5b
created by github.com/Shopify/ghostferry.(*Ferry).Run
	/tmp/ghostferry/ferry.go:747 +0x339

Here we can see that while the BinlogWriter was reading from the channel, the BinlogStreamer was sending the events into a nil channel. We refreshed a number of times to make sure it wasn't an anomaly.

The minutes also corresponded with the lag and run time for the copy showing that this had been happening since almost the beginning of the run.

Bug

When ferry.go creates a new BinlogWriter, it does not create the event buffer channel. That happens in BinlogWriter.Run().

The problem is in

ghostferry/ferry.go

Lines 741 to 752 in abcda3a

go func() {
defer binlogWg.Done()
f.BinlogWriter.Run()
}()
binlogWg.Add(1)
go func() {
defer binlogWg.Done()
f.BinlogStreamer.Run()
f.BinlogWriter.Stop()
}()

Here there is a chance that BinlogStreamer.Run() runs before BinlogWriter.Run() has the chance to create the channel. This can result in the event listener for the binlog streamer (BinlogWritter.BufferBinlogEvents) trying to write binlog events into the uninitialized BinlogEventBuffer channel.

Here's an example of a similar scenario we used to reproduce the stack dump:

//Taken from https://golangbyexample.com/send-receive-nil-channel-go/
package main

import (
	"fmt"
	"time"
)

func main() {
	var ch chan int
	go send(ch)
	ch = make(chan int)
	<-ch
	time.Sleep(time.Second * 1)
}

func send(ch chan int) {
	fmt.Println("Sending value to channnel start")
	ch <- 1
	fmt.Println("Sending value to channnel finish")
}

The resulting goroutine dump is very similar to what we saw in ghostferry.

Sending value to channnel start fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
main.main()
	/tmp/sandbox3043519188/prog.go:13 +0x90

goroutine 6 [chan send (nil chan)]:
main.send(0x0)
	/tmp/sandbox3043519188/prog.go:19 +0x6c
created by main.main
	/tmp/sandbox3043519188/prog.go:11 +0x7b  Program exited.

Possible Solution

Currently NewBinlogWriter in ferry.go just creates a new BinlogWriter but doesn't initialize the channel

ghostferry/ferry.go

Lines 163 to 178 in abcda3a

func (f *Ferry) NewBinlogWriter() *BinlogWriter {
f.ensureInitialized()
return &BinlogWriter{
DB: f.TargetDB,
DatabaseRewrites: f.Config.DatabaseRewrites,
TableRewrites: f.Config.TableRewrites,
Throttler: f.Throttler,
BatchSize: f.Config.BinlogEventBatchSize,
WriteRetries: f.Config.DBWriteRetries,
ErrorHandler: f.ErrorHandler,
StateTracker: f.StateTracker,
}
}

What could be done is that instead of immediately returning the writer, initialize the buffer channel on the writer in ferry.go and then return it.

This would be similar to what we do in NewBatchWriter

ghostferry/ferry.go

Lines 186 to 202 in abcda3a

func (f *Ferry) NewBatchWriter() *BatchWriter {
f.ensureInitialized()
batchWriter := &BatchWriter{
DB: f.TargetDB,
StateTracker: f.StateTracker,
DatabaseRewrites: f.Config.DatabaseRewrites,
TableRewrites: f.Config.TableRewrites,
WriteRetries: f.Config.DBWriteRetries,
enableRowBatchSize: f.Config.EnableRowBatchSize,
}
batchWriter.Initialize()
return batchWriter
}

@Scalvando Scalvando added the Bug Something isn't working label Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant