Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to create volume", Concourse running in docker-compose on Linux #42

Open
barrucadu opened this issue Apr 12, 2019 · 15 comments
Open

Comments

@barrucadu
Copy link

barrucadu commented Apr 12, 2019

I've got Concourse running on a NixOS 18.03 VPS inside docker-compose, and this is working fine. I'm now trying to deploy exactly the same Concourse configuration to another NixOS 18.03 machine, but aren't having any luck. I'm using the same docker-compose file, and the same pipelines.

The new machine gives errors about being unable to create volumes:

Apr 12 21:55:49 nyarlathotep docker-compose[26088]: concourse_1  | {"timestamp":"2019-04-12T20:55:49.753780802Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"af97f489-2d27-4007-57b4-e5cb9c43e659","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1.3","team":"main","volume":"e843e1a7-4122-494b-5397-d0a94294e418"}}
Apr 12 21:55:49 nyarlathotep docker-compose[26088]: concourse_1  | {"timestamp":"2019-04-12T20:55:49.793734883Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.failed-to-fetch-image-for-container","data":{"container":"af97f489-2d27-4007-57b4-e5cb9c43e659","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1","team":"main"}}
Apr 12 21:55:49 nyarlathotep docker-compose[26088]: concourse_1  | {"timestamp":"2019-04-12T20:55:49.794088237Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.failed-to-initialize-new-container","data":{"error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1","team":"main"}}

The concoursefiles-git resource it's failing to create a volume for there is a normal git resource. The other resources in the pipeline are failing with the same error.

The pipeline is here: https://github.com/barrucadu/concoursefiles/blob/master/pipelines/ci.yml

This is the docker-compose file:

version: '3'

services:
  concourse:
    image: concourse/concourse
    command: quickstart
    privileged: true
    depends_on: [postgres, registry]
    ports: ["3003:8080"]
    environment:
      CONCOURSE_POSTGRES_HOST: postgres
      CONCOURSE_POSTGRES_USER: concourse
      CONCOURSE_POSTGRES_PASSWORD: concourse
      CONCOURSE_POSTGRES_DATABASE: concourse
      CONCOURSE_EXTERNAL_URL: "https://ci.nyarlathotep.barrucadu.co.uk"
      CONCOURSE_MAIN_TEAM_GITHUB_USER: "barrucadu"
      CONCOURSE_GITHUB_CLIENT_ID: "<omitted>"
      CONCOURSE_GITHUB_CLIENT_SECRET: "<omitted>"
      CONCOURSE_LOG_LEVEL: error
      CONCOURSE_GARDEN_LOG_LEVEL: error
    networks:
      - ci

  postgres:
    image: postgres
    environment:
      POSTGRES_DB: concourse
      POSTGRES_PASSWORD: concourse
      POSTGRES_USER: concourse
      PGDATA: /database
    networks:
      - ci
    volumes:
      - pgdata:/database

  registry:
    image: registry
    networks:
      ci:
        ipv4_address: "172.21.0.254"
        aliases: [ci-registry]
    volumes:
      - regdata:/var/lib/registry

networks:
  ci:
    ipam:
      driver: default
      config:
        - subnet: 172.21.0.0/16

volumes:
  pgdata:
  regdata:

I'm using the latest concourse/concourse image, as I set this up today. The version of docker is 18.09.2 (build 62479626f213818ba5b4565105a05277308587d5). What can I look at to help debug this?

@vito
Copy link
Member

vito commented May 14, 2019

Are there any baggageclaim logs with more information?

@barrucadu
Copy link
Author

Here's the log from the systemd unit running docker-compose: https://misc.barrucadu.co.uk/forever/e4355f6a-9b9e-449b-8263-196cc1222161/concourseci.log

There are a few baggageclaim errors:

{"timestamp":"2019-06-08T16:52:06.743477477Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1","session":"3.1.29.1"}}
{"timestamp":"2019-06-08T16:52:06.743542809Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1","privileged":true,"session":"3.1.29","strategy":{"type":"cow","volume":"50edbcad-c379-4e9f-4c2b-362041bcec32"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:06.743579749Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"ad59841f-1ce1-4d90-6b70-8700467701dd","session":"3.1.34.1"}}
{"timestamp":"2019-06-08T16:52:06.743608924Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"ad59841f-1ce1-4d90-6b70-8700467701dd","privileged":true,"session":"3.1.34","strategy":{"type":"cow","volume":"d1ad2edf-38b9-40f9-4048-da5300b5d0ab"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:07.151643149Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"281d33e4-c50a-408e-5895-b70dcddcfade","error":"failed to create volume","pipeline":"ci","resource":"ci-base-image","session":"18.1.2.1.1.3","team":"main","volume":"ad59841f-1ce1-4d90-6b70-8700467701dd"}}
{"timestamp":"2019-06-08T16:52:07.162617819Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"ec58bb4f-214b-4377-5d96-4c37462eab68","error":"failed to create volume","pipeline":"ci","resource":"ci-agent-image","session":"18.1.1.1.1.3","team":"main","volume":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1"}}
{"timestamp":"2019-06-08T16:52:07.280776831Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"86e8680d-a0cb-48fc-4906-53894aa351c6","session":"3.1.42.1"}}
{"timestamp":"2019-06-08T16:52:07.280816616Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"86e8680d-a0cb-48fc-4906-53894aa351c6","privileged":true,"session":"3.1.42","strategy":{"type":"cow","volume":"50edbcad-c379-4e9f-4c2b-362041bcec32"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:07.619068627Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"81b2354f-6914-4cff-613a-89616cded84a","error":"failed to create volume","pipeline":"ci","resource":"ci-resource-rsync-image","session":"18.1.3.1.1.3","team":"main","volume":"86e8680d-a0cb-48fc-4906-53894aa351c6"}}
{"timestamp":"2019-06-08T16:52:10.371909263Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"f9c2581b-8f75-4713-7346-4fa7ec29b455","session":"3.1.50.1"}}
{"timestamp":"2019-06-08T16:52:10.371946002Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"f9c2581b-8f75-4713-7346-4fa7ec29b455","privileged":false,"session":"3.1.50","strategy":{"type":"cow","volume":"05e99eb0-95a6-439b-6e08-9215476f7cc7"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:10.373695769Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"40452054-8f69-4694-7b9a-483c97c6ded6","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1.3","team":"main","volume":"f9c2581b-8f75-4713-7346-4fa7ec29b455"}}
{"timestamp":"2019-06-08T16:52:18.362785417Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"ddfabe8d-4b23-44f7-598b-a9c30853eef3","session":"3.1.56.1"}}
{"timestamp":"2019-06-08T16:52:18.362824190Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"ddfabe8d-4b23-44f7-598b-a9c30853eef3","privileged":false,"session":"3.1.56","strategy":{"type":"cow","volume":"05e99eb0-95a6-439b-6e08-9215476f7cc7"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:18.987394396Z","level":"error","source":"atc","message":"atc.check-resource.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"713ee08b-487a-4158-6deb-69f48de6e58d","error":"failed to create volume","session":"367.3","volume":"ddfabe8d-4b23-44f7-598b-a9c30853eef3"}}

@vito
Copy link
Member

vito commented Jun 12, 2019

Looks like a pretty low-level failure, possibly from an incompatibility with your kernel/OS stack - we haven't tested NixOS. 🤔 To get to the bottom of the 'invalid argument' error you'll probably need to run strace against the concourse worker process. Sorry the logs aren't super useful.

@caiges
Copy link

caiges commented Jun 19, 2019

I'm getting this on Arch running in compose as well. Tried running the worker with strace but I didn't see anything that stood out.

@caiges
Copy link

caiges commented Jun 19, 2019

Switching my Docker storage driver to vfs allows me to work around this issue but I don't think that's a solution. I haven't fully taken a dive into what's actually happening here.

EDIT: For some background, I'm building docker images as part of my pipeline.

@caiges
Copy link

caiges commented Jun 19, 2019

@barrucadu setting:

CONCOURSE_WORK_DIR=/worker-state
CONCOURSE_WORKER_WORK_DIR=/worker-state

and adding a volume for the /worker-state directory in my worker's service configuration was necessary for baggageclaim to create volumes.

@barrucadu
Copy link
Author

I tried setting CONCOURSE_WORKER_WORK_DIR, after adding a worker container (rather than using the quickstart command), giving this docker-compose file, but had the original problem.

I then tried switching to the overlay2 storage driver, but docker doesn't seem to support overlay2 on zfs (do you also use zfs, @caiges?):

Error starting daemon: error initializing graphdriver: backing file system is unsupported for this graph driver

Then I tried switching to the vfs storage driver, but still had the original problem.

@caiges
Copy link

caiges commented Jun 21, 2019

I did a cursory search and couldn't find that CONCOURSE_WORKER_WORK_DIR is referenced anywhere. CONCOURSE_WORK_DIR does appear to be used.

I don't use ZFS but you could configure docker to use a different partition for its storage that supports "overlay2".

@vito
Copy link
Member

vito commented Jun 26, 2019

I'm getting this on Arch running in compose as well. Tried running the worker with strace but I didn't see anything that stood out.

FWIW I think you'd want to grep the output for EINVAL.

Here's a snippet that'll strip out a lot of noise:

strace -f -p (worker pid) -e '!futex,restart_syscall,epoll_wait,select,getdents64,close,sched_yield,epoll_ctl,accept4,setsockopt,getsockname'

@mikroskeem
Copy link
Member

I'm running into same issue - NixOS 19.09 and ZFS. I'll try debugging this...

@mikroskeem
Copy link
Member

[ 395.180725] overlayfs: filesystem on '/workdir/overlays/14745864-d72c-4d46-4dd3-e03ffb3a8585' not supported as upperdir

So I assume that worker strictly attempts to use overlayfs. I'm not entirely sure how Concourse works internally yet, but I'll try to feed an ext4 based workdir hosted on ZFS zvol to worker instead.

@mikroskeem
Copy link
Member

mikroskeem commented Nov 18, 2019

Yeah that seems to work.

  1. Create zvol with ext4
zfs create -V 10g rpool/concourse-workdir0-ext4
mkfs.ext4 /dev/zvol/rpool/concourse-workdir0-ext4
  1. Configure NixOS to mount it at /mnt/concourse-workdir0

Into configuration.nix, add:

fileSystems."/mnt/concourse-workdir0" = {
  device = "/dev/zvol/rpool/concourse-workdir0-ext4";
  fsType = "ext4";
};
  1. Configure worker to use given workdir
  • (bind?) mount host's /mnt/concourse-workdir0 to worker container's /workdir
  • set CONCOURSE_WORK_DIR to /workdir

@trevormarshall
Copy link

We are seeing this error very frequently in the Spring Boot builds. We are running v5.7.2 on bosh-vsphere-esxi-ubuntu-xenial-go_agent 621.29 stemcell, using the overlay driver.

In web.stdout.log we have:

{"timestamp":"2020-01-23T15:35:59.923944142Z","level":"error","source":"atc","message":"atc.tracker.track.task-step.find-or-create-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":104720,"error":"failed to create volume","job":"build-pull-requests","job-id":2744,"pipeline":"spring-boot-2.3.x","session":"19.62686.7.31","step-name":"build-project","volume":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1"}}

In worker.stdout.log we see the baggageclaim error:

{"timestamp":"2020-01-23T15:35:55.212121511Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"exit status 1","handle":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1","session":"3.1.999394.1"}}
{"timestamp":"2020-01-23T15:35:55.299415431Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"exit status 1","handle":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1","privileged":false,"session":"3.1.999394","strategy":{"type":"import","path":"/var/vcap/data/worker/work/volumes/live/24ae1aac-852c-4c5c-414d-29088119c8a3/volume","follow_symlinks":false}}}

The error arrives at the end of builds. The pipelines use https://concourse-ci.org/tasks.html#task-caches to cache dependencies between runs.
https://github.com/spring-projects/spring-boot/blob/89237634c7931f275ddbddba176c7a826b1667cb/ci/tasks/build-project.yml#L7
When we query the volumes table by handle, we can confirm no record was created for 999ba5a8-f8a1-4e5d-5087-c5e3974e15e1.

We considered underlying server load, so enabled container-placement-strategy-limit-active-tasks, which distibuted things nicely (thank you!). Now that load seems fine, it is mainly the Spring Boot pipelines that have this issue in our multi-tenant https://ci.spring.io.

We can re-recreate all of the workers to make the issue go way for a few days, but it eventually comes back. We see a clear pattern of the error re-surfacing after a number of green builds. reported in #concourse-operations.

@inkblot
Copy link

inkblot commented Nov 26, 2020

I started seeing this error after upgrading from concourse 6.1.0 to 6.7.1. I have downgraded back to 6.1.0.

Only resources using custom resource types are affected. I am running my workers on Flatcar Linux (successor of the defunct CoreOS) as a docker container started by a systemd unit. I have tried setting the baggageclaim driver to overlay and naive with the same results as the default value. I have tried mounting a volume in the container and using it as the work directory with the same results.

The kernel is 5.4.77-flatcar. The filesystem is ext4 and there is plenty of space. The docker is version 19.03.12 running with defaults plus a registry mirror. Here is the systemd unit that I use to start the worker container:

[Unit]
Description=concourse-worker
After=network-online.target
After=docker.service
After=coreos-metadata.service
Requires=docker.service
Requires=coreos-metadata.service

[Service]
TimeoutStartSec=0
Restart=always
EnvironmentFile=/run/metadata/flatcar
ExecStartPre=-/usr/bin/docker stop -t 100 concourse-worker
ExecStartPre=-/usr/bin/docker rm concourse-worker
ExecStartPre=/usr/bin/docker pull concourse/concourse:6.7.1
ExecStartPre=/usr/bin/docker volume create worker-scratch
ExecStart=/usr/bin/docker run \
  --privileged \
  --name concourse-worker \
  --volume /stuff/concourse/worker:/concourse-keys:ro \
  --volume worker-scratch:/work \
  concourse/concourse:6.7.1 \
  worker \
  --tsa-host concourse-tsa.movealong.internal:2222 \
  --tsa-public-key /concourse-keys/tsa_host_key.pub \
  --tsa-worker-private-key /concourse-keys/worker_key \
  --work-dir /work \
  --ephemeral
ExecStop=/usr/bin/docker stop -t 100 concourse-worker
ExecStop=/usr/bin/docker volume rm worker-scratch

[Install]
WantedBy=multi-user.target

@inkblot
Copy link

inkblot commented Nov 27, 2020

I've isolated the problem to the upgrade from 6.6.0 to 6.7.1. All concourse minor versions fro 6.1.0 though 6.6.0 are able to process resources with a custom resource type correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants