Skip to content

Commit

Permalink
Merge branch 'fix-infinite-cutover-loop' of github.com:github/gh-ost …
Browse files Browse the repository at this point in the history
…into fix-infinite-cutover-loop
  • Loading branch information
Shlomi Noach committed Mar 7, 2017
2 parents 7f72872 + 7517238 commit fc268ab
Show file tree
Hide file tree
Showing 175 changed files with 26,128 additions and 418 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,10 @@ More tips:
Also see:

- [requirements and limitations](doc/requirements-and-limitations.md)
- [common questions](doc/questions.md)
- [what if?](doc/what-if.md)
- [the fine print](doc/the-fine-print.md)
- [Questions](https://github.com/github/gh-ost/issues?q=label%3Aquestion)
- [Community questions](https://github.com/github/gh-ost/issues?q=label%3Aquestion)

## What's in a name?

Expand Down
6 changes: 5 additions & 1 deletion doc/cheatsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,12 @@ gh-ost --allow-master-master --assume-master-host=a.specific.master.com

Topologies using _tungsten replicator_ are peculiar in that the participating servers are not actually aware they are replicating. The _tungsten replicator_ looks just like another app issuing queries on those hosts. `gh-ost` is unable to identify that a server participates in a _tungsten_ topology.

If you choose to migrate directly on master (see above), there's nothing special you need to do. If you choose to migrate via replica, then you must supply the identity of the master, and indicate this is a tungsten setup, as follows:
If you choose to migrate directly on master (see above), there's nothing special you need to do.

If you choose to migrate via replica, then you need to make sure Tungsten is configured with log-slave-updates parameter (note this is different from MySQL's own log-slave-updates parameter), otherwise changes will not be in the replica's binlog, causing data to be corrupted after table swap. You must also supply the identity of the master, and indicate this is a tungsten setup, as follows:

```
gh-ost --tungsten --assume-master-host=the.topology.master.com
```

Also note that `--switch-to-rbr` does not work for a Tungsten setup as the replication process is external, so you need to make sure `binlog_format` is set to ROW before Tungsten Replicator connects to the server and starts applying events from the master.
4 changes: 4 additions & 0 deletions doc/command-line-flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,7 @@ See `approve-renamed-columns`
### test-on-replica

Issue the migration on a replica; do not modify data on master. Useful for validating, testing and benchmarking. See [testing-on-replica](testing-on-replica.md)

### timestamp-old-table

Makes the _old_ table include a timestamp value. The _old_ table is what the original table is renamed to at the end of a successful migration. For example, if the table is `gh_ost_test`, then the _old_ table would normally be `_gh_ost_test_del`. With `--timestamp-old-table` it would be, for example, `_gh_ost_test_20170221103147_del`.
15 changes: 12 additions & 3 deletions doc/interactive-commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ Both interfaces may serve at the same time. Both respond to simple text command,
- The `critical-load` format must be: `some_status=<numeric-threshold>[,some_status=<numeric-threshold>...]`'
- For example: `Threads_running=1000,threads_connected=5000`, and you would then write/echo `critical-load=Threads_running=1000,threads_connected=5000` to the socket.
- `nice-ratio=<ratio>`: change _nice_ ratio: 0 for aggressive (not nice, not sleeping), positive integer `n`:
- For any `1ms` spent copying rows, spend `n*1ms` units of time sleeping.
- Examples: assume a single rows chunk copy takes `100ms` to complete.
- `nice-ratio=0.5` will cause `gh-ost` to sleep for `50ms` immediately following.
- For any `1ms` spent copying rows, spend `n*1ms` units of time sleeping.
- Examples: assume a single rows chunk copy takes `100ms` to complete.
- `nice-ratio=0.5` will cause `gh-ost` to sleep for `50ms` immediately following.
- `nice-ratio=1` will cause `gh-ost` to sleep for `100ms`, effectively doubling runtime
- value of `2` will effectively triple the runtime; etc.
- `throttle-query`: change throttle query
Expand All @@ -38,6 +38,10 @@ Both interfaces may serve at the same time. Both respond to simple text command,
- `unpostpone`: at a time where `gh-ost` is postponing the [cut-over](cut-over.md) phase, instruct `gh-ost` to stop postponing and proceed immediately to cut-over.
- `panic`: immediately panic and abort operation

### Querying for data

For commands that accept an argumetn as value, pass `?` (question mark) to _get_ current value rather than _set_ a new one.

### Examples

While migration is running:
Expand All @@ -63,6 +67,11 @@ $ echo "chunk-size=250" | nc -U /tmp/gh-ost.test.sample_data_0.sock
# Serving on TCP port: 10001
```

```shell
$ echo "chunk-size=?" | nc -U /tmp/gh-ost.test.sample_data_0.sock
250
```

```shell
$ echo throttle | nc -U /tmp/gh-ost.test.sample_data_0.sock

Expand Down
26 changes: 26 additions & 0 deletions doc/questions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# How?

### How does the cut-over work? Is it really atomic?

The cut-over phase, where the original table is swapped away, and the _ghost_ table takes its place, is an atomic, blocking, controlled operation.

- Atomic: the tables are swapped together. There is no gap where your table does not exist.
- Blocking: all app queries involving the migrated (original) table are either operate on the original table, or are blocked, or proceed to operate on the _new_ table (formerly the _ghost_ table, now swapped in).
- Controlled: the cut-over times out at pre-defined threshold, and is atomically aborted, then re-attempted. Cut-over only takes place when no lags are present, and otherwise no throttling reason is found. Cut-over step itself gets high priority and is never throttled.

Read more on [cut-over](cut-over.md) and on the [cut-over design Issue](https://github.com/github/gh-ost/issues/82)


# Is it possible to?

### Is it possible to add a UNIQUE KEY?

Adding a `UNIQUE KEY` is possible, in the condition that no violation will occur. That is, you must make sure there aren't any violating rows on your table before, and during the migration.

At this time there is no equivalent to `ALTER IGNORE`, where duplicates are implicitly and silently thrown away. The MySQL `5.7` docs say:

> As of MySQL 5.7.4, the IGNORE clause for ALTER TABLE is removed and its use produces an error.
It is therefore unlikely that `gh-ost` will support this behavior.

# Why
20 changes: 14 additions & 6 deletions go/base/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ type MigrationContext struct {
OkToDropTable bool
InitiallyDropOldTable bool
InitiallyDropGhostTable bool
TimestampOldTable bool // Should old table name include a timestamp
CutOverType CutOver
ReplicaServerId uint

Expand All @@ -135,7 +136,9 @@ type MigrationContext struct {
OriginalBinlogFormat string
OriginalBinlogRowImage string
InspectorConnectionConfig *mysql.ConnectionConfig
InspectorMySQLVersion string
ApplierConnectionConfig *mysql.ConnectionConfig
ApplierMySQLVersion string
StartTime time.Time
RowCopyStartTime time.Time
RowCopyEndTime time.Time
Expand Down Expand Up @@ -232,11 +235,12 @@ func (this *MigrationContext) GetGhostTableName() string {

// GetOldTableName generates the name of the "old" table, into which the original table is renamed.
func (this *MigrationContext) GetOldTableName() string {
if this.TestOnReplica {
return fmt.Sprintf("_%s_ght", this.OriginalTableName)
}
if this.MigrateOnReplica {
return fmt.Sprintf("_%s_ghr", this.OriginalTableName)
if this.TimestampOldTable {
t := this.StartTime
timestamp := fmt.Sprintf("%d%02d%02d%02d%02d%02d",
t.Year(), t.Month(), t.Day(),
t.Hour(), t.Minute(), t.Second())
return fmt.Sprintf("_%s_%s_del", this.OriginalTableName, timestamp)
}
return fmt.Sprintf("_%s_del", this.OriginalTableName)
}
Expand Down Expand Up @@ -559,7 +563,11 @@ func (this *MigrationContext) GetControlReplicasLagResult() mysql.ReplicationLag
func (this *MigrationContext) SetControlReplicasLagResult(lagResult *mysql.ReplicationLagResult) {
this.throttleMutex.Lock()
defer this.throttleMutex.Unlock()
this.controlReplicasLagResult = *lagResult
if lagResult == nil {
this.controlReplicasLagResult = *mysql.NewNoReplicationLagResult()
} else {
this.controlReplicasLagResult = *lagResult
}
}

func (this *MigrationContext) GetThrottleControlReplicaKeys() *mysql.InstanceKeyMap {
Expand Down
18 changes: 12 additions & 6 deletions go/binlog/gomysql_reader.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"github.com/outbrain/golib/log"
gomysql "github.com/siddontang/go-mysql/mysql"
"github.com/siddontang/go-mysql/replication"
"golang.org/x/net/context"
)

type GoMySQLReader struct {
Expand All @@ -39,7 +40,16 @@ func NewGoMySQLReader(connectionConfig *mysql.ConnectionConfig) (binlogReader *G
}

serverId := uint32(binlogReader.MigrationContext.ReplicaServerId)
binlogReader.binlogSyncer = replication.NewBinlogSyncer(serverId, "mysql")

binlogSyncerConfig := &replication.BinlogSyncerConfig{
ServerID: serverId,
Flavor: "mysql",
Host: connectionConfig.Key.Hostname,
Port: uint16(connectionConfig.Key.Port),
User: connectionConfig.User,
Password: connectionConfig.Password,
}
binlogReader.binlogSyncer = replication.NewBinlogSyncer(binlogSyncerConfig)

return binlogReader, err
}
Expand All @@ -49,10 +59,6 @@ func (this *GoMySQLReader) ConnectBinlogStreamer(coordinates mysql.BinlogCoordin
if coordinates.IsEmpty() {
return log.Errorf("Emptry coordinates at ConnectBinlogStreamer()")
}
log.Infof("Registering replica at %+v:%+v", this.connectionConfig.Key.Hostname, uint16(this.connectionConfig.Key.Port))
if err := this.binlogSyncer.RegisterSlave(this.connectionConfig.Key.Hostname, uint16(this.connectionConfig.Key.Port), this.connectionConfig.User, this.connectionConfig.Password); err != nil {
return err
}

this.currentCoordinates = coordinates
log.Infof("Connecting binlog streamer at %+v", this.currentCoordinates)
Expand Down Expand Up @@ -126,7 +132,7 @@ func (this *GoMySQLReader) StreamEvents(canStopStreaming func() bool, entriesCha
if canStopStreaming() {
break
}
ev, err := this.binlogStreamer.GetEvent()
ev, err := this.binlogStreamer.GetEvent(context.Background())
if err != nil {
return err
}
Expand Down
3 changes: 2 additions & 1 deletion go/cmd/gh-ost/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ func main() {
flag.BoolVar(&migrationContext.OkToDropTable, "ok-to-drop-table", false, "Shall the tool drop the old table at end of operation. DROPping tables can be a long locking operation, which is why I'm not doing it by default. I'm an online tool, yes?")
flag.BoolVar(&migrationContext.InitiallyDropOldTable, "initially-drop-old-table", false, "Drop a possibly existing OLD table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table exists")
flag.BoolVar(&migrationContext.InitiallyDropGhostTable, "initially-drop-ghost-table", false, "Drop a possibly existing Ghost table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table exists")
flag.BoolVar(&migrationContext.TimestampOldTable, "timestamp-old-table", false, "Use a timestamp in old table name. This makes old table names unique and non conflicting cross migrations")
cutOver := flag.String("cut-over", "atomic", "choose cut-over type (default|atomic, two-step)")
flag.BoolVar(&migrationContext.ForceNamedCutOverCommand, "force-named-cut-over", false, "When true, the 'unpostpone|cut-over' interactive command must name the migrated table")

Expand Down Expand Up @@ -242,5 +243,5 @@ func main() {
migrator.ExecOnFailureHook()
log.Fatale(err)
}
log.Info("Done")
fmt.Fprintf(os.Stdout, "# Done\n")
}
9 changes: 7 additions & 2 deletions go/logic/applier.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,14 +70,15 @@ func (this *Applier) InitDBConnections() (err error) {
if err := this.readTableColumns(); err != nil {
return err
}
log.Infof("Applier initiated on %+v, version %+v", this.connectionConfig.ImpliedKey, this.migrationContext.ApplierMySQLVersion)
return nil
}

// validateConnection issues a simple can-connect to MySQL
func (this *Applier) validateConnection(db *gosql.DB) error {
query := `select @@global.port`
query := `select @@global.port, @@global.version`
var port int
if err := db.QueryRow(query).Scan(&port); err != nil {
if err := db.QueryRow(query).Scan(&port, &this.migrationContext.ApplierMySQLVersion); err != nil {
return err
}
if port != this.connectionConfig.Key.Port {
Expand Down Expand Up @@ -141,6 +142,10 @@ func (this *Applier) ValidateOrDropExistingTables() error {
return err
}
}
if len(this.migrationContext.GetOldTableName()) > mysql.MaxTableNameLength {
log.Fatalf("--timestamp-old-table defined, but resulting table name (%s) is too long (only %d characters allowed)", this.migrationContext.GetOldTableName(), mysql.MaxTableNameLength)
}

if this.tableExists(this.migrationContext.GetOldTableName()) {
return fmt.Errorf("Table %s already exists. Panicking. Use --initially-drop-old-table to force dropping it, though I really prefer that you drop it or rename it away", sql.EscapeName(this.migrationContext.GetOldTableName()))
}
Expand Down
9 changes: 5 additions & 4 deletions go/logic/inspect.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ func (this *Inspector) InitDBConnections() (err error) {
if err := this.applyBinlogFormat(); err != nil {
return err
}
log.Infof("Inspector initiated on %+v, version %+v", this.connectionConfig.ImpliedKey, this.migrationContext.InspectorMySQLVersion)
return nil
}

Expand Down Expand Up @@ -168,9 +169,9 @@ func (this *Inspector) inspectOriginalAndGhostTables() (err error) {

// validateConnection issues a simple can-connect to MySQL
func (this *Inspector) validateConnection() error {
query := `select @@global.port`
query := `select @@global.port, @@global.version`
var port int
if err := this.db.QueryRow(query).Scan(&port); err != nil {
if err := this.db.QueryRow(query).Scan(&port, &this.migrationContext.InspectorMySQLVersion); err != nil {
return err
}
if port != this.connectionConfig.Key.Port {
Expand Down Expand Up @@ -347,7 +348,7 @@ func (this *Inspector) validateLogSlaveUpdates() error {
}

if this.migrationContext.IsTungsten {
log.Warning("log_slave_updates not found on %s:%d, but --tungsten provided, so I'm proceeding", this.connectionConfig.Key.Hostname, this.connectionConfig.Key.Port)
log.Warningf("log_slave_updates not found on %s:%d, but --tungsten provided, so I'm proceeding", this.connectionConfig.Key.Hostname, this.connectionConfig.Key.Port)
return nil
}

Expand All @@ -356,7 +357,7 @@ func (this *Inspector) validateLogSlaveUpdates() error {
}

if this.migrationContext.InspectorIsAlsoApplier() {
log.Warning("log_slave_updates not found on %s:%d, but executing directly on master, so I'm proceeeding", this.connectionConfig.Key.Hostname, this.connectionConfig.Key.Port)
log.Warningf("log_slave_updates not found on %s:%d, but executing directly on master, so I'm proceeeding", this.connectionConfig.Key.Hostname, this.connectionConfig.Key.Port)
return nil
}

Expand Down
25 changes: 23 additions & 2 deletions go/logic/migrator.go
Original file line number Diff line number Diff line change
Expand Up @@ -791,6 +791,12 @@ func (this *Migrator) printMigrationStatusHint(writers ...io.Writer) {
throttleQuery,
))
}
if throttleControlReplicaKeys := this.migrationContext.GetThrottleControlReplicaKeys(); throttleControlReplicaKeys.Len() > 0 {
fmt.Fprintln(w, fmt.Sprintf("# throttle-control-replicas count: %+v",
throttleControlReplicaKeys.Len(),
))
}

if this.migrationContext.PostponeCutOverFlagFile != "" {
setIndicator := ""
if base.FileExists(this.migrationContext.PostponeCutOverFlagFile) {
Expand Down Expand Up @@ -970,7 +976,9 @@ func (this *Migrator) initiateThrottler() error {

go this.throttler.initiateThrottlerCollection(this.firstThrottlingCollected)
log.Infof("Waiting for first throttle metrics to be collected")
<-this.firstThrottlingCollected
<-this.firstThrottlingCollected // replication lag
<-this.firstThrottlingCollected // other metrics
log.Infof("First throttle metrics collected")
go this.throttler.initiateThrottlerChecks()

return nil
Expand Down Expand Up @@ -1022,11 +1030,13 @@ func (this *Migrator) iterateChunks() error {
for {
if atomic.LoadInt64(&this.rowCopyCompleteFlag) == 1 {
// Done
// There's another such check down the line
return nil
}
copyRowsFunc := func() error {
if atomic.LoadInt64(&this.rowCopyCompleteFlag) == 1 {
// Done
// Done.
// There's another such check down the line
return nil
}
hasFurtherRange, err := this.applier.CalculateNextIterationRangeEndValues()
Expand All @@ -1038,6 +1048,17 @@ func (this *Migrator) iterateChunks() error {
}
// Copy task:
applyCopyRowsFunc := func() error {
if atomic.LoadInt64(&this.rowCopyCompleteFlag) == 1 {
// No need for more writes.
// This is the de-facto place where we avoid writing in the event of completed cut-over.
// There could _still_ be a race condition, but that's as close as we can get.
// What about the race condition? Well, there's actually no data integrity issue.
// when rowCopyCompleteFlag==1 that means **guaranteed** all necessary rows have been copied.
// But some are still then collected at the binary log, and these are the ones we're trying to
// not apply here. If the race condition wins over us, then we just attempt to apply onto the
// _ghost_ table, which no longer exists. So, bothering error messages and all, but no damage.
return nil
}
_, rowsAffected, _, err := this.applier.ApplyIterationInsertQuery()
if err != nil {
return terminateRowIteration(err)
Expand Down
Loading

0 comments on commit fc268ab

Please sign in to comment.