Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS: edge case in Azure NPM policy enforcement #2786

Open
oOraph opened this issue Jun 13, 2024 · 8 comments
Open

AKS: edge case in Azure NPM policy enforcement #2786

oOraph opened this issue Jun 13, 2024 · 8 comments
Assignees
Labels
linux npm Related to NPM.

Comments

@oOraph
Copy link

oOraph commented Jun 13, 2024

What happened:
Azure network policy manager does not enforce defined network policies on the local node.

For example if you define a network policy to filter out all egress traffic from the pod, the traffic going toward the local node private ip (not the public one if any) won't be filtered out.

Consequently any listening service on the private ip can be connected to (containerd, kubelet, ssh…).

What you expected to happen:

All specified traffic to be filtered out properly with no exception (other than the ones requested by the customer)

How to reproduce it:

  • Spawn an aks with Azure cni + Azure network policy manager for policy enforcement
  • Once the cluster is spawned, connect there and apply the two following manifests
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: np1
  namespace: default 
spec:
  egress:
  - ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
    to:
    - ipBlock:
        cidr: 0.0.0.0/0
  - ports:
    - port: 80
      protocol: TCP
    - port: 443
      protocol: TCP
    - port: 22
      protocol: TCP
    - endPort: 65535
      port: 1024
      protocol: TCP
    - endPort: 65535
      port: 1024
      protocol: UDP
    to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 10.0.0.0/8
        - 172.16.0.0/12
        - 192.168.0.0/16
        - 169.254.169.254/32
  podSelector:
    matchExpressions:
    - key: test
      operator: Exists
  policyTypes:
  - Egress
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    test: "true"
  name: test 
  namespace: default
spec:
  containers:
  - image: ubuntu:latest
    imagePullPolicy: Always
    command:
    - sleep
    - infinity
    name: main
  terminationGracePeriodSeconds: 0
  • Get the node host private ip
$ k get pods -o wide
NAME        READY   STATUS    RESTARTS   AGE     IP             NODE                                NOMINATED NODE   READINESS GATES
test   1/1     Running   0          9m28s   10.224.0.110   aks-agentpool-31351106-vmss000000   <none>           <none>
$ k get node aks-agentpool-31351106-vmss000000 -o wide
NAME                                STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-agentpool-31351106-vmss000000   Ready    agent   13m   v1.28.9   10.224.0.4    20.231.2.119   Ubuntu 22.04.4 LTS   5.15.0-1064-azure   containerd://1.7.15-1
  • Go into the pod and verify the traffic toward the local node private ip is let through
$ k exec -it test -- /bin/bash
# apt-get update && apt-get install curl
# curl --insecure https://10.224.0.4:10250/pods
Unauthorized
# curl --insecure https://10.224.0.4:10250
404 page not found
# nc 10.224.0.4 22
  • Reproduce the same with Calico network policy plugin instead to verify the policy is well defined and correctly filtering egresses

Kubernetes Version:

The one proposed with AKS by default, at the time of reporting the issue (1.28 or so)

Kernel (e.g. uname -a):

The one of azure aks nodes

@huntergregory
Copy link
Contributor

Hi @oOraph, thanks for authoring this issue. In general, NPM does enforce policies properly, but it sounds like you discovered an edge case with NPM. Trying to decipher this scenario: it seems like we can reduce the problem to a NetworkPolicy allowing egress to all IPs/ports except your Node's private IP? So the NetworkPolicy should drop traffic from the Pod to its Node? Please let me know if I misinterpreted.

@oOraph
Copy link
Author

oOraph commented Jun 17, 2024

you're right. Allowing anything but sth related to local node will show the issue (policy won't be enforced for pods deployed on the said node, but for others, filtering will be effective)

Copy link

github-actions bot commented Jul 2, 2024

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jul 2, 2024
@oOraph
Copy link
Author

oOraph commented Jul 2, 2024

comment anti-stale

@rbtr rbtr removed the stale Stale due to inactivity. label Jul 8, 2024
@huntergregory
Copy link
Contributor

Hi @oOraph, would you be able to validate your scenario on an AKS cluster with Cilium? If that solves your problem, we would recommend using Cilium to enforce your network policies going forward.

@huntergregory huntergregory changed the title AKS: npm not enforcing policies properly AKS: edge case in Azure NPM policy enforcement Jul 10, 2024
@huntergregory huntergregory added npm Related to NPM. linux labels Jul 11, 2024
@oOraph
Copy link
Author

oOraph commented Jul 25, 2024

@huntergregory I tested with calico and did not reproduce. For cilium I did not test but I would bet it's not concerned either as many people use it with kubernetes for policy enforcement. Also note that switching the np manager on an existing aks cluster is not possible. One needs to remove it first (leaving the cluster with no policy enforcement for the migration time), then select the new one, with no node pool rolling upgrade, causing workload downtimes...

@huntergregory
Copy link
Contributor

Also note that switching the np manager on an existing aks cluster is not possible

Please reference this documentation: Upgrade an existing cluster to Azure CNI Powered by Cilium

@oOraph
Copy link
Author

oOraph commented Jul 26, 2024

Also note that switching the np manager on an existing aks cluster is not possible

Please reference this documentation: Upgrade an existing cluster to Azure CNI Powered by Cilium

I should have specified "not possible without workload and, worse, policy enforcement downtime" (see note and warning in the doc page you point to)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linux npm Related to NPM.
Projects
None yet
Development

No branches or pull requests

4 participants