Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After update to 6.2.0: failed to retrieve kernel parameter #5711

Closed
kramerul opened this issue Jun 3, 2020 · 8 comments
Closed

After update to 6.2.0: failed to retrieve kernel parameter #5711

kramerul opened this issue Jun 3, 2020 · 8 comments
Labels

Comments

@kramerul
Copy link

kramerul commented Jun 3, 2020

Summary

After updating concourse from 6.1.0 to 6.2.0, we get the error message

failed to retrieve kernel parameter "net.ipv4.tcp_keepalive_time": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory

In the worker logs, we can find:

{"timestamp":"2020-06-03T08:22:15.787525864Z","level":"error","source":"guardian","message":"guardian.create.containerizer-create.bundle-generate-failed","data":{"error":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.3"}}
{"timestamp":"2020-06-03T08:22:15.787634290Z","level":"info","source":"guardian","message":"guardian.create.containerizer-create.finished","data":{"handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.3"}}
{"timestamp":"2020-06-03T08:22:15.787655597Z","level":"info","source":"guardian","message":"guardian.create.create-failed-cleaningup.start","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4"}}
{"timestamp":"2020-06-03T08:22:15.787670221Z","level":"info","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy.started","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4.1"}}
{"timestamp":"2020-06-03T08:22:15.787682661Z","level":"info","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy.delete.started","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4.1.1"}}
{"timestamp":"2020-06-03T08:22:15.809938776Z","level":"info","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy.delete.state-failed-skipping-delete","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","error":"runc state: runc: exit status 1: container \"da525953-d55c-4b88-643a-9ab36070049c\" does not exist","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4.1.1"}}
{"timestamp":"2020-06-03T08:22:15.809978760Z","level":"info","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy.delete.finished","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4.1.1"}}
{"timestamp":"2020-06-03T08:22:15.809990217Z","level":"info","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy.finished","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4.1"}}
{"timestamp":"2020-06-03T08:22:15.810004624Z","level":"error","source":"guardian","message":"guardian.create.create-failed-cleaningup.no-properties-for-container-skipping-destroy-network","data":{"cause":"failed to retrieve kernel parameter \"net.ipv4.tcp_keepalive_time\": open /proc/sys/net/ipv4/tcp_keepalive_time: no such file or directory","error":"property not found: kawasaki.host-interface","handle":"da525953-d55c-4b88-643a-9ab36070049c","session":"36.4"}}

when starting a task

Steps to reproduce

  • Update to 6.2.0
  • Run a task

Expected results

Task runs without errors.

Actual results

Additional context

Triaging info

  • Concourse version:
  • Browser (if applicable):
  • Did this used to work?
@kramerul kramerul added the bug label Jun 3, 2020
@kramerul
Copy link
Author

kramerul commented Jun 3, 2020

After switching back to version 6.1.0, the problem still occurs.

@kramerul
Copy link
Author

kramerul commented Jun 3, 2020

See also concourse/concourse-docker#61

@jamieklassen
Copy link
Member

Can you give some more details about your system -- what kind of packaging are you using to deploy concourse (binary/systemd/bosh/helm/docker/etc)? What is the underlying OS/distribution? Any kernel configuration could be useful too.

While we're at it, can you provide the task you were using as well? You never know what details might be helpful.

@kramerul
Copy link
Author

kramerul commented Jun 3, 2020

We are using docker-compose to run this concourse instance. It's running on AWS.
The docker node was set up using docker-machine (ubuntu 16.04, docker version 19.03.11)

docker-machine create -d amazonec2 \
    --amazonec2-region <redacted> --amazonec2-zone <redacted> --amazonec2-instance-type t3.xlarge \
    --amazonec2-root-size 500 \
    --amazonec2-security-group <redacted> \
    concourse

We used the following docker-compose file

version: '3'

services:
  nginx:
    image: linuxserver/letsencrypt
    ports: ["443:443","80:80"]
    environment:
      URL:   <redacted>
      EMAIL: <redacted>
      VALIDATION: http

  db:
    image: postgres:11
    environment:
      POSTGRES_DB: <redacted>
      POSTGRES_USER: <redacted>
      POSTGRES_PASSWORD: <redacted>
    volumes: ["pgdata:/var/lib/postgresql/data"]

  web:
    image: concourse/concourse
    command: web
    links: [db]
    depends_on: [db]
    volumes:
      - "concourse-keys-web:/concourse-keys"
      - "users:/users"
    environment:
      CONCOURSE_EXTERNAL_URL: <redacted>
      CONCOURSE_POSTGRES_HOST: <redacted>
      CONCOURSE_POSTGRES_USER: <redacted>
      CONCOURSE_POSTGRES_PASSWORD: <redacted>
      CONCOURSE_POSTGRES_DATABASE: <redacted>
      CONCOURSE_ADD_LOCAL_USER: <redacted>
      CONCOURSE_MAIN_TEAM_CONFIG: /users/users.yml
      CONCOURSE_AWS_SECRETSMANAGER_ACCESS_KEY: <redacted>
      CONCOURSE_AWS_SECRETSMANAGER_SECRET_KEY: <redacted>
      CONCOURSE_AWS_SECRETSMANAGER_REGION: <redacted>
      CONCOURSE_GITHUB_CLIENT_ID: <redacted>
      CONCOURSE_GITHUB_CLIENT_SECRET: <redacted>
      CONCOURSE_GITHUB_HOST: <redacted>

  worker_1:
    image: concourse/concourse
    command: worker
    privileged: true
    depends_on: [web]
    volumes: ["concourse-keys-worker:/concourse-keys"]
    links: [web]
    stop_signal: SIGUSR2
    environment:
      CONCOURSE_TSA_HOST: web:2222

volumes:
  concourse-keys-worker:
    external: true
  concourse-keys-web:
    external: true
  pgdata:
  users:

@alexdulin
Copy link

We ran into this same issue on AWS running ubuntu 16.04 with linux kernel version 4.4.0 (I believe that was the version). Upgrade to kernel version 4.15.0 solved the problem. You can get the kernel version you are using with uname -a

@kramerul
Copy link
Author

kramerul commented Jun 3, 2020

Thanks @alexdulin,

switching to Ubuntu 18.04 also fixed the problem on our side.

@kramerul kramerul closed this as completed Jun 3, 2020
@jamieklassen
Copy link
Member

I can tell a slightly more detailed story here.

$ docker run --rm --entrypoint /usr/local/concourse/bin/gdn concourse/concourse:6.0.0 -v
90961f153e3c4eccf6c461e9efa5165ac454f47c
$ docker run --rm --entrypoint /usr/local/concourse/bin/gdn concourse/concourse:6.1.0 -v
51480bc73a282c02f827dde4851cc12265774272

This difference corresponds to the fact that concourse v6.0.0 packaged garden-runc-release v.1.19.10, which we can see depends on cloudfoundry/guardian@90961f1: https://github.com/cloudfoundry/garden-runc-release/tree/v1.19.10/src, but concourse v6.1.0 and v6.2.0 packaged garden-runc-release v1.19.12, which depends on cloudfoundry/guardian@51480bc: https://github.com/cloudfoundry/garden-runc-release/tree/v1.19.12/src.

An important difference between v1.19.10 and v1.19.12 of garden-runc-release can be seen in the release notes for v1.19.11:

We now set net.ipv4.tcp_keepalive_time and related parameters automatically based on the host configuration, and allow overriding them via bosh properties (see cloudfoundry/guardian#165 (comment), and thanks @h0nIg, @arjenw, and @krumts for the suggestion).

Looking at discussions like https://discuss.linuxcontainers.org/t/why-is-there-no-tcp-keepalive-under-lxd/891/5, I learned that the procfs mount for tcp_keepalive_time (and a few other ipv4/tcp configurations) did not exist inside user namespaces in older versions of the Linux kernel. In fact, looking at the release notes for Linux 4.5:

IPv4: Make TCP keepalive settings per-namespace commit, tcp_keepalive_probes sysctl knob commit, tcp_keepalive_intvl sysctl knob commit

Putting a few things together, this means that garden-runc-release v1.19.11+ won't run correctly inside a user namespace before Linux 4.5. Both @kramerul and @alexdulin mention their docker hosts are running Ubuntu 16.04, and this is an LTS release which promises to ship Linux 4.4: https://wiki.ubuntu.com/XenialXerus/ReleaseNotes#Linux_kernel_4.4.

We can therefore conclude that a Concourse v6.1.0+ worker will not run correctly inside a user namespace (most notably, a docker container) on kernel versions pre-4.5 -- and Ubuntu 16.04 LTS is a striking example of this. This probably could afford to be called out in our release notes, and it might be useful for something similar to be declared on the guardian or garden-runc-release repos.

@jamieklassen
Copy link
Member

I have updated the release notes for v5.5.11, v6.1.0, v6.2.0 and v6.3.0 which all bundle versions of gdn with the breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants