Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop swarm services by scaling down and up #328

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cmd/backup/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,12 @@ func main() {
}()

s.must(s.withLabeledCommands(lifecyclePhaseArchive, func() error {
restartContainers, err := s.stopContainers()
restartContainersOrServices, err := s.stopContainersOrServices()
// The mechanism for restarting containers is not using hooks as it
// should happen as soon as possible (i.e. before uploading backups or
// similar).
defer func() {
s.must(restartContainers())
s.must(restartContainersOrServices())
}()
if err != nil {
return err
Expand Down
134 changes: 99 additions & 35 deletions cmd/backup/script.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ import (
"github.com/ProtonMail/go-crypto/openpgp"
"github.com/containrrr/shoutrrr"
"github.com/containrrr/shoutrrr/pkg/router"
"github.com/docker/cli/cli/command/service/progress"
"github.com/docker/docker/api/types"
ctr "github.com/docker/docker/api/types/container"
"github.com/docker/docker/api/types/filters"
"github.com/docker/docker/api/types/swarm"
"github.com/docker/docker/client"
"github.com/leekchan/timeutil"
"github.com/offen/envconfig"
Expand Down Expand Up @@ -318,14 +318,107 @@ func newScript() (*script, error) {
return s, nil
}

// stopContainers stops all Docker containers that are marked as to being
// stopped during the backup and returns a function that can be called to
// restart everything that has been stopped.
func (s *script) stopContainers() (func() error, error) {
type noopWriteCloser struct {
io.Writer
}

func (noopWriteCloser) Close() error {
return nil
}

func (s *script) stopContainersOrServices() (func() error, error) {
if s.cli == nil {
return noop, nil
}

dockerInfo, err := s.cli.Info(context.Background())
if err != nil {
return noop, fmt.Errorf("stopContainers: error getting docker info: %w", err)
}
isDockerSwarm := dockerInfo.Swarm.LocalNodeState != "inactive"
if isDockerSwarm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend making it stop both the services and the containers.
It might be a bit of an odd approach, but it is possible to create containers in a docker swarm that are not bound to services.

Copy link
Member Author

@m90 m90 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend making it stop both the services and the containers.

I'm not entirely sure I understand. If I stop containers in Swarm Mode, Swarm will try to restart them, unless the service has a on-failure restart policy. This is what's currently happening when you use this image with Swarm. From what I understand you cannot really "stop" a service, you can a. temporarily scale it down to 0 or b. delete and redeploy.

This is currently not reflected in the code, but I was thinking the best (also least breaking) approach here would be to do the following:

  • if the label is on the container(s) (i.e. not in deploy), the container will be stopped and restarted, keeping the existing behavior
  • if the label is put on the service (i.e. in deploy), the service will be scaled down and up again

If you want to help out here, that's more than welcome, although I would still like to understand the pros and cons of all approaches (stopping containers, removing services, scaling services) better so we can pick the best one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was not clear with the exact issue by not stopping the containers too.

Even if a node is part of a Docker Swarm, while unorthodox, it is still possible to create standalone containers on it by calling docker run.
Given the following scenario:

  1. Node is part of a Docker Swarm
  2. User for some reason calls docker run on the specific node
  3. Applies docker-volume-backup.stop-during-backup labels on the container
  4. Starts the backup process
  5. At this point, the backup will skip stopping containers, since it believes the containers are all part of a service, which is a false assumption.

To avoid this happening, I would recommend stopping both services and containers. The only error case here would be if a container has the label and the service that it belongs to also has it, so a container would be stopped twice. ( Which is a configuration error, but can be ignored by the backup if needed. )

Regarding your understanding of the services, that is correct. You cannot stop a service. but scale it down to 0.

The main difference between stopping all containers and scaling down is where the containers will be scheduled after the backup. In the first case they will stay on the same node, there is a guarantee that the container will not move to a new one. For scaling, it will depend on how the service is configured, if the placement requirements allow, it is possible to schedule it on another node. This might have some issues if the services configuration is faulty. ( And the volumes are local. )

The advantage of the option to configure on service labels, at least for me, is reducing the chances of messing up the configuration. For containers, one has to apply the on-failure restart policy and put the labels on the containers, not the services. The latter is a bit counterintuitive, since they are usually on the services, eg Traefik.

Removing and recreating the services sounds a bit too intrusive for my taste and I am not totally sure if you can save the state of the service/stack for redeployment. As a user of this product, I would prefer the scaling alternative instead.

I like the idea you have explained, it is also something that I would implement. It would not break the previous configurations upon upgrade, but allows the tool to be configured in a more Swarm friendly way. Also having both options would allow the user to pick based on their exact scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I get it, also makes sense seeing you commented on this particular line. I think that should give us a pretty straightforward plan of what's to add (scale services to zero and up again when they are labeled). I'll see if I can try this out the next couple of days and will keep you posted here. Thanks for your input.

return s.stopServices()
}
return s.stopContainers()

}

func (s *script) stopServices() (func() error, error) {
serviceLabel := fmt.Sprintf(
"docker-volume-backup.stop-during-backup=%s",
s.c.BackupStopContainerLabel,
)
allServices, err := s.cli.ServiceList(context.Background(), types.ServiceListOptions{})
if err != nil {
return noop, fmt.Errorf("stopServices: error querying services: %w", err)
}

matchingServices, err := s.cli.ServiceList(context.Background(), types.ServiceListOptions{
Filters: filters.NewArgs(filters.KeyValuePair{
Key: "label",
Value: serviceLabel,
}),
})
if err != nil {
return noop, fmt.Errorf("stopServices: error querying services: %w", err)
}

s.logger.Info(
fmt.Sprintf(
"Scaling down %d services(s) labeled `%s` out of %d running services(s).",
len(matchingServices),
serviceLabel,
len(allServices),
),
)

var prevReplicas []uint64
for idx, service := range matchingServices {
var zero uint64
prevReplicas[idx] = *service.Spec.Mode.Replicated.Replicas
service.Spec.Mode.Replicated.Replicas = &zero
_, err := s.cli.ServiceUpdate(context.Background(), service.ID, service.Version, service.Spec, types.ServiceUpdateOptions{})
if err != nil {
return noop, fmt.Errorf("stopServices: error scaling down services: %w", err)
}
if err := progress.ServiceProgress(context.Background(), s.cli, service.ID, &noopWriteCloser{io.Discard}); err != nil {
return noop, fmt.Errorf("stopServices: error converging service: %w", err)
}
}

s.stats.Containers = ContainersStats{
All: uint(len(allServices)),
ToStop: uint(len(matchingServices)),
Stopped: uint(len(matchingServices)),
}

return func() error {
for idx, service := range matchingServices {
fmt.Println("restarting", service.ID, service.PreviousSpec)
service.Spec.Mode.Replicated.Replicas = &prevReplicas[idx]
_, err := s.cli.ServiceUpdate(context.Background(), service.ID, service.Version, service.Spec, types.ServiceUpdateOptions{})
if err != nil {
return fmt.Errorf("stopServices: error scaling up services: %w", err)
}
fmt.Println("tried to scale back servie", service.ID)
if err := progress.ServiceProgress(context.Background(), s.cli, service.ID, &noopWriteCloser{io.Discard}); err != nil {
return fmt.Errorf("stopServices: error converging service: %w", err)
}
}
s.logger.Info(
fmt.Sprintf(
"Restarted %d services(s).",
len(matchingServices),
),
)
return nil
}, nil
}

// stopContainers stops all Docker containers or services that are marked as to being
// stopped during the backup and returns a function that can be called to
// restart everything that has been stopped.
func (s *script) stopContainers() (func() error, error) {
allContainers, err := s.cli.ContainerList(context.Background(), types.ContainerListOptions{})
if err != nil {
return noop, fmt.Errorf("stopContainers: error querying for containers: %w", err)
Expand Down Expand Up @@ -385,42 +478,13 @@ func (s *script) stopContainers() (func() error, error) {
}

return func() error {
servicesRequiringUpdate := map[string]struct{}{}

var restartErrors []error
for _, container := range stoppedContainers {
if swarmServiceName, ok := container.Labels["com.docker.swarm.service.name"]; ok {
servicesRequiringUpdate[swarmServiceName] = struct{}{}
continue
}
if err := s.cli.ContainerStart(context.Background(), container.ID, types.ContainerStartOptions{}); err != nil {
restartErrors = append(restartErrors, err)
}
}

if len(servicesRequiringUpdate) != 0 {
services, _ := s.cli.ServiceList(context.Background(), types.ServiceListOptions{})
for serviceName := range servicesRequiringUpdate {
var serviceMatch swarm.Service
for _, service := range services {
if service.Spec.Name == serviceName {
serviceMatch = service
break
}
}
if serviceMatch.ID == "" {
return fmt.Errorf("stopContainers: couldn't find service with name %s", serviceName)
}
serviceMatch.Spec.TaskTemplate.ForceUpdate += 1
if _, err := s.cli.ServiceUpdate(
context.Background(), serviceMatch.ID,
serviceMatch.Version, serviceMatch.Spec, types.ServiceUpdateOptions{},
); err != nil {
restartErrors = append(restartErrors, err)
}
}
}

if len(restartErrors) != 0 {
return fmt.Errorf(
"stopContainers: %d error(s) restarting containers and services: %w",
Expand All @@ -430,7 +494,7 @@ func (s *script) stopContainers() (func() error, error) {
}
s.logger.Info(
fmt.Sprintf(
"Restarted %d container(s) and the matching service(s).",
"Restarted %d container(s).",
len(stoppedContainers),
),
)
Expand Down
3 changes: 3 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@ require (
)

require (
github.com/Azure/go-ansiterm v0.0.0-20170929234023-d6e3b3328b78 // indirect
github.com/cloudflare/circl v1.3.7 // indirect
github.com/golang-jwt/jwt/v5 v5.0.0 // indirect
github.com/golang/protobuf v1.5.3 // indirect
golang.org/x/time v0.0.0-20220609170525-579cf78fd858 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/protobuf v1.31.0 // indirect
)
Expand All @@ -35,6 +37,7 @@ require (
github.com/AzureAD/microsoft-authentication-library-for-go v1.1.1 // indirect
github.com/Microsoft/go-winio v0.5.2 // indirect
github.com/ProtonMail/go-crypto v0.0.0-20230717121422-5aa5874ade95
github.com/docker/cli v24.0.7+incompatible
github.com/docker/distribution v2.8.2+incompatible // indirect
github.com/docker/go-connections v0.4.0 // indirect
github.com/docker/go-units v0.4.0 // indirect
Expand Down
3 changes: 3 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,8 @@ github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dnaeon/go-vcr v1.2.0 h1:zHCHvJYTMh1N7xnV7zf1m1GPBF9Ad0Jk/whtQ1663qI=
github.com/dnaeon/go-vcr v1.2.0/go.mod h1:R4UdLID7HZT3taECzJs4YgbbH6PIGXB6W/sc5OLb6RQ=
github.com/docker/cli v24.0.7+incompatible h1:wa/nIwYFW7BVTGa7SWPVyyXU9lgORqUb1xfI36MSkFg=
github.com/docker/cli v24.0.7+incompatible/go.mod h1:JLrzqnKDaYBop7H2jaqPtU4hHvMKP+vjCwu2uszcLI8=
github.com/docker/distribution v2.8.2+incompatible h1:T3de5rq0dB1j30rp0sA2rER+m322EBzniBPB6ZIzuh8=
github.com/docker/distribution v2.8.2+incompatible/go.mod h1:J2gT2udsDAN96Uj4KfcMRqY0/ypR+oyYUYmja8H+y+w=
github.com/docker/docker v24.0.7+incompatible h1:Wo6l37AuwP3JaMnZa226lzVXGA3F9Ig1seQen0cKYlM=
Expand Down Expand Up @@ -1259,6 +1261,7 @@ gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C
gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gotest.tools v2.2.0+incompatible h1:VsBPFP1AI068pPrMxtb/S8Zkgf9xEmTLJjfM+P5UIEo=
gotest.tools v2.2.0+incompatible/go.mod h1:DsYFclhRJ6vuDpmuTbkuFWG+y2sxOXAzmJt81HFBacw=
gotest.tools/v3 v3.0.2/go.mod h1:3SzNCllyD9/Y+b5r9JIKQ474KzkZyqLqEfYqMsX94Bk=
gotest.tools/v3 v3.0.3 h1:4AuOwCGf4lLR9u3YOe2awrHygurzhO/HeQ6laiA6Sx0=
Expand Down
8 changes: 4 additions & 4 deletions test/swarm/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,11 @@ services:

offen:
image: offen/offen:latest
labels:
- docker-volume-backup.stop-during-backup=true
healthcheck:
disable: true
deploy:
labels:
- docker-volume-backup.stop-during-backup=true
replicas: 2
restart_policy:
condition: on-failure
Expand All @@ -54,11 +54,11 @@ services:
image: postgres:14-alpine
environment:
POSTGRES_PASSWORD: example
labels:
- docker-volume-backup.stop-during-backup=true
volumes:
- pg_data:/var/lib/postgresql/data
deploy:
labels:
- docker-volume-backup.stop-during-backup=true
restart_policy:
condition: on-failure

Expand Down
Loading