Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race when creating azure-vnet lock directory with permissions #2818

Open
QxBytes opened this issue Jun 26, 2024 · 12 comments · Fixed by #2819
Open

Race when creating azure-vnet lock directory with permissions #2818

QxBytes opened this issue Jun 26, 2024 · 12 comments · Fixed by #2819
Assignees
Labels
bug cni Related to CNI. exempt-stale Keep this fresh needs-backport Change needs to be backported to previous release trains

Comments

@QxBytes
Copy link
Contributor

QxBytes commented Jun 26, 2024

What happened:
On the first boot, no CNI binary is on the node, and so k8s creates the /var/run/azure-vnet directory with 0755 permissions automatically because it is a mount part of the azure-cns daemonset. Then the CNI is deployed.
The /var/run directory is not preserved between reboots.
Then, when the VM reboots, the CNI binary may run before k8s creates the /var/run/azure-vnet directory. When the CNI binary runs first, it creates the directory with 0644 permissions. This causes permission denied errors for the cns. Even if k8s creates/mounts the /var/run/azure-vnet directory later, it will see it already exists and won't recreate the directory with the 0755 permissions.

What you expected to happen:
The CNI binary should create the directory with 0755 permissions.

How to reproduce it:
Reboot the VM with the cns capabilities security context dropping all capabilities (so it doesn't bypass permission checks). There is a chance that the azure-cns pod will get stuck in crash loop backoff.

Orchestrator and Version (e.g. Kubernetes, Docker):

Operating System (Linux/Windows):

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows):

Anything else we need to know?:
[Miscellaneous information that will assist in solving the issue.]

@QxBytes QxBytes added bug cni Related to CNI. labels Jun 26, 2024
@QxBytes QxBytes self-assigned this Jun 26, 2024
@QxBytes QxBytes added the needs-backport Change needs to be backported to previous release trains label Jun 26, 2024
@rbtr
Copy link
Contributor

rbtr commented Jun 27, 2024

  • @jpayne3506 can you update the CNS specs that we use in CI with
securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_ADMIN # only necessary for delegated IPAM/Cilium
      - NET_RAW # only necessary for delegated IPAM/Cilium

and make sure that we have a test that's rebooting the Node and verifying CNS functionality afterwards

@rbtr
Copy link
Contributor

rbtr commented Jun 27, 2024

https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods sounds like it would let the k8s chmod the directory when the CNS Pod mounts it as a Volume?

Copy link

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jul 17, 2024
@QxBytes QxBytes removed the stale Stale due to inactivity. label Jul 17, 2024
Copy link

github-actions bot commented Aug 1, 2024

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Aug 1, 2024
@QxBytes QxBytes removed the stale Stale due to inactivity. label Aug 3, 2024
Copy link

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Aug 18, 2024
@QxBytes
Copy link
Contributor Author

QxBytes commented Aug 20, 2024

Node Rebooting Test added in: #2901

@QxBytes QxBytes closed this as completed Aug 20, 2024
@jpayne3506 jpayne3506 reopened this Aug 20, 2024
@jpayne3506
Copy link
Contributor

jpayne3506 commented Aug 20, 2024

Backport PR can be created for release/v1.5, but the validate package within release/v1.4 needs to be updated to consume the work. Will leave this open until both have been merged

@github-actions github-actions bot removed the stale Stale due to inactivity. label Aug 21, 2024
Copy link

github-actions bot commented Sep 4, 2024

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Sep 4, 2024
@QxBytes QxBytes removed the stale Stale due to inactivity. label Sep 5, 2024
@rbtr
Copy link
Contributor

rbtr commented Sep 17, 2024

@QxBytes looks like this was fully backported, is it good to resolve?

@jpayne3506
Copy link
Contributor

jpayne3506 commented Sep 17, 2024

#2901 did not get backported to release/v1.4 as the testing relies on the validation package being updated. Leaving this open until that work is complete and all E2E within this release train can validate Linux properly.

- script: |
kubectl get pods -A -o wide
echo "Deploying test pods"
cd test/integration/load
go test -count 1 -timeout 30m -tags load -run ^TestLoad$ -tags=load -iterations=2 -scaleup=${{ parameters.scaleup }} -os=${{ parameters.os }}
cd ../../..
# Remove this once we have cniv1 support for validating the test cluster
echo "Validate State skipped for linux cniv1 for now"
if [ "${{parameters.os}}" == "windows" ]; then
make test-validate-state OS=${{ parameters.os }}
fi
kubectl delete ns load-test
displayName: "Validate State"
retryCountOnTaskFailure: 3

Copy link

github-actions bot commented Oct 2, 2024

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Oct 2, 2024
Copy link

github-actions bot commented Oct 9, 2024

Issue closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2024
@rbtr rbtr reopened this Oct 9, 2024
@rbtr rbtr removed the stale Stale due to inactivity. label Oct 9, 2024
@QxBytes QxBytes added the exempt-stale Keep this fresh label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cni Related to CNI. exempt-stale Keep this fresh needs-backport Change needs to be backported to previous release trains
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants