Skip to content

Latest commit

 

History

History
180 lines (122 loc) · 16.6 KB

11-validate-cluster-access-and-bootstrapping.md

File metadata and controls

180 lines (122 loc) · 16.6 KB

Validate cluster access and boostraping

Now that your bootstrapping images have passed through quarantine and the AKS cluster has been deployed, the next step is to validate cluster access and bootstrapping results.

Expected results

Jump box access is validated

Because the cluster is private, your cluster cannot be directly accessed locally. You'll validate controlled jump box access for direct ops against the cluster if needed.

Bootstrapping results validation

Your cluster was deployed with Azure Policy and the Flux GitOps extension. You'll execute some commands to show how those are manifesting itself in your cluster.

Steps

  1. Connect to a jump box node via Azure Bastion.

    If this is the first time you've used Azure Bastion, here is a detailed walk through of this process.

    1. Open the Azure Portal.
    2. Navigate to the rg-bu0001a0005-centralus resource group.
    3. Click on the virtual machine Scale Set resource named vmss-jumpboxes.
    4. Click Instances.
    5. Click the name of any of the two listed instances. Such as vmss-jumpboxes_0
    6. Click Connect -> Bastion -> Use Bastion.
    7. Fill in the username field with one of the users from your customized jumpBoxCloudInit.yml file. Such as opsuser01
    8. Select SSH Private Key from Local File and select your private key file (such as opsuser01.key) for that specific user.
    9. Provide your SSH passphrase in SSH Passphrase if your private key is protected with one.
    10. Click Connect.
    11. For enhanced "copy-on-select" & "paste-on-right-click" support, your browser may request your permission to support those features. It's recommended that you Allow that feature. If you don't, you'll have to use the >> flyout on the screen to perform copy and paste actions.
    12. Welcome to your jump box!

    ⚠️ The jump box deployed in this walkthrough has only ephemeral disks attached, in which content written to disk will not survive planned or unplanned restarts of the host. Never store anything of value on these jump boxes. They are expected to be fully ephemeral in nature, and in fact could be scaled-to-zero when not in use.

  2. From your Azure Bastion connection, sign in to your Azure RBAC tenant and select your subscription.

    The following command will perform a device login. Ensure you're logging in with the Microsoft Entra user that has access to your AKS resources (that is, the one you did your deployment with.)

    az login
    # This will give you a link to https://microsoft.com/devicelogin where you can enter
    # the provided code and perform authentication.
    
    # Ensure you're on the correct subscription
    az account show
    
    # If not, select the correct subscription
    # az account set -s <subscription name or id>

    ⚠️ Your organization may have a conditional access policies in place that forbids access to Azure resources [from non corporate-managed devices]https://learn.microsoft.com/entra/identity/conditional-access/concept-conditional-access-grant). This jump box as deployed in these steps might trigger that policy. If that is the case, you'll need to work with your IT Security organization to provide an alterative access mechanism or temporary solution.

  3. From your Azure Bastion connection, get your AKS credentials and set your kubectl context to your cluster.

    AKS_CLUSTER_NAME=$(az deployment group show -g rg-bu0001a0005-centralus -n cluster-stamp --query properties.outputs.aksClusterName.value -o tsv)
    
    az aks get-credentials -g rg-bu0001a0005-centralus -n $AKS_CLUSTER_NAME
  4. From your Azure Bastion connection, test cluster access and authenticate as a cluster admin user.

    The following command will force you to authenticate into your AKS cluster's control plane. This will start yet another device login flow. For this one (Azure Kubernetes Service Microsoft Entra Client), log in with a user that is a member of your cluster admin group in the Microsoft Entra tenant you selected to be used for Kubernetes Cluster API RBAC. Also this is where any specified Microsoft Entra Conditional Access policies would take effect if they had been applied, and ideally you would have first used PIM JIT access to be assigned to the admin group. Remember, the identity you log in here with is the identity you're performing cluster control plane (Cluster API) management commands (such as kubectl) as.

    kubectl get nodes

    If all is successful you should see something like:

    NAME                                  STATUS   ROLES   AGE   VERSION
    aks-npinscope01-26621167-vmss000000   Ready    agent   20m   v1.26.x
    aks-npinscope01-26621167-vmss000001   Ready    agent   20m   v1.26.x
    aks-npooscope01-26621167-vmss000000   Ready    agent   20m   v1.26.x
    aks-npooscope01-26621167-vmss000001   Ready    agent   20m   v1.26.x
    aks-npsystem-26621167-vmss000000      Ready    agent   20m   v1.26.x
    aks-npsystem-26621167-vmss000001      Ready    agent   20m   v1.26.x
    aks-npsystem-26621167-vmss000002      Ready    agent   20m   v1.26.x
    

    ⌚ The access tokens obtained in the prior two steps are subject to a Microsoft identity platform TTL (usually six hours). If your az or kubectl commands start erroring out after hours of usage with a message related to permission/authorization, you'll need to reexecute the az login and az aks get-credentials (overwriting your context) to refresh those tokens.

  5. From your Azure Bastion connection, confirm admission policies are applied to the AKS cluster.

    Azure Policy was configured in the cluster deployment with a set of starter policies. Your cluster pods will be covered using the Azure Policy add-on for AKS. Some of these policies might end up in the denial of a specific Kubernetes API request operation to ensure the pod's specification is compliance with your organization's security best practices. Moreover data is generated by Azure Policy to assist the app team in the process of assessing the current compliance state of the AKS cluster. You've assign the Azure Policy for Kubernetes built-in restricted initiative as well as ten more built-in individual Azure policies that enforce that pods perform resource requests, define trusted container registries, allow root filesystem access in read-only mode, enforce the usage of internal load balancers, and enforce HTTPS-only Kuberentes Ingress objects.

    kubectl get ConstraintTemplate

    A similar output as the one showed below should be returned

    NAME                                     AGE
    k8sazureallowedcapabilities              21m
    k8sazureallowedseccomp                   21m
    … more …
    k8sazureserviceallowedports              21m
    k8sazurevolumetypes                      21m
    
  6. From your Azure Bastion connection, validate bootstrapping.

    Validate that the Flux extension has bootstrapped your cluster.

    kubectl get namespaces

    You should see namespaces created as described your manifest files, such as a0005-i.

    NAME                        STATUS   AGE
    a0005-i                     Active   79m
    a0005-o                     Active   79m
    cluster-baseline-settings   Active   79m
    default                     Active   107m
    falco-system                Active   79m
    flux-system                 Active   87m
    gatekeeper-system           Active   107m
    ingress-nginx               Active   87m
    kube-node-lease             Active   107m
    kube-public                 Active   107m
    kube-system                 Active   107m
    

Configuration needs outside of GitOps

There is a subnet allocated in this reference implementation called snet-management-agents specifically for your build agents to perform unattended, "last mile" deployment, and configuration needs for your cluster. There is no compute deployed to that subnet, but typically this is where you'd put in a Virtual Machine Scale Set as a GitHub Action Self-Hosted Runner or Azure DevOps Self-Hosted Agent Pool. This compute should be a hardened, minimal installation set, and monitored. Just like a jump box, this compute will span two distinct security zones; in this case, unattended, externally managed GitHub Workflow definitions and your cluster.

In summary, there should be a stated objective in your team that routine processes (especially bootstrapping new clusters), are never performed by a human via direct access tools kubectl -- Zero-kubectl starting from Day 0. Any adhoc human interaction with the cluster introduces risk and audit trail concerns. Obviously, live-site issues will require humans to perform out-of-the-norm practices to triage and address critical issues. Reserve your usage of direct access tooling like this for those times.

GitOps configuration

The GitOps implementation in this reference architecture is intentionally simplistic. Flux is configured to simply monitor manifests in ALL namespaces. It doesn't account for concepts like:

  • Multitenancy

  • Private GitHub repos

  • Kustomization under/overlays

  • Additional flux controllers like Helm. These are explicitly disabled to reduce surface area on the extension in the regulated cluster. Enable only those controllers you need from within the mcFlux_extension resource in your ARM template.

This reference implementation isn't going to dive into the nuances of Git manifest organization. A lot of that depends on your namespacing, multitenant needs, multistage (dev, preproduction, production) deployment needs, multicluster needs, and so on. The key takeaway here is to ensure that you're managing your Kubernetes resources in a declarative manner with a reconcile loop, to achieve desired state configuration within your cluster. Ensuring your cluster internally is managed by a single, appropriately-privileged, observable pipeline will aide in compliance. You'll have a Git trail that aligns with a log trail from your GitOps toolkit.

Public dependencies

As with any dependency your cluster or workload has, you'll want to minimize or eliminate your reliance on services in which you do not have an SLO or do not meet your observability/compliance requirements. Your cluster's GitOps operators should rely on a git repository that satisfies your reliability & compliance requirements. Consider using a Git mirroring approach to bring your cluster dependencies to be "network local" and provide a fault-tolerant syncing mechanism from centralized working repositories (like your organization's GitHub Enterprise private repositories). Following an approach like this will air gap Git repositories as an external dependency, at the cost of added complexity.

Security tooling

While Azure Kubernetes Service, Microsoft Defender for topic, and Azure Policy offers a secure platform foundation; the inner workings of your cluster are more of a relationship with you-and-Kubernetes than you-and-Azure. To that end, most customers bring their own security solutions that solve for their specific compliance and organizational requirements within their clusters. They often bring in holistic ISV solutions like Aqua Security, Prisma Cloud Compute, StackRox, Sysdig, or Tigera Enterprise to name a few. These solutions offer a suite of added security and reporting controls to your platform, but also come with their own licensing and support agreements.

Common features offered in ISV solutions like these:

  • File integrity monitoring (FIM)
  • Container-aware Anti-Virus (AV) solutions
  • CVE Detection against admission requests and already executing images
  • CVE Detection on Kubernetes configuration
  • Advanced network segmentation
  • Dangerous runtime container activity detection
  • Workload level CIS benchmark reporting
  • Managed network isolation and enhanced observability features (such as network flow visualizers)

Your dependency on or choice of in-cluster tooling to achieve your compliance needs cannot be suggested as a "one-size fits all" in this reference implementation. However, as a reminder of the need to solve for these, the Flux bootstrapping above deployed a placeholder FIM and AV solution. They are not functioning as a real FIM or AV, simply a visual reminder that you will likely need to bring a suitable solution for compliance concerns.

This reference implementation includes Microsoft Defender for Containers as a solution for helping you secure your containers and runtime environment. Microsoft Defender for Containers, provides security alerts on the cluster level and on the underlying cluster nodes by monitoring both the API server activity and the behavior of the workloads themselves.

You can also consider installing third-party solutions to supplement Microsoft Defender for Containers as a defense-in-depth strategy. As an example of that, in this reference implementation you also install a very basic deployment of Falco. It is not configured for alerts, nor tuned to any specific needs. It uses the default rules as they were defined when its manifests were generated. This is installed for illustrative purposes, and you're encouraged to evaluate whether a third-party solution like Falco is relevant to your situation. If so, in your final implementation, review and tune its deployment to fit your needs (such as add custom rules like CVE detection, sudo usage, basic FIM, SSH Connection monitoring, and NGINX containment). This tooling, as most security tooling will be, is highly privileged within your cluster. Usually running as DaemonSets with access to the underlying node in a manner that is well beyond any typical workload in your cluster. Remember to consider the runtime compute and networking requirements of your security tooling when sizing your cluster, as these can often be overlooked when initial cluster sizing conversations are happening.

It's worth noting, some customers with regulated workloads are bringing ISV or open-source security solutions to their clusters in addition to Microsoft Defender for Containers for maximum (and/or redundant) compliance coverage. Azure Kubernetes Service is a managed Kubernetes platform, it does not imply that you will exclusively be using Microsoft products/solutions to solve your requirements. For the most part, after the deployment of the infrastructure and some out-of-the-box addons (like Azure Policy, Azure Monitor, Microsoft Entra Workload ID, Open Service Mesh), you're in charge of what you choose to run in your hosted Kubernetes platform. Bring the business and compliance solving solutions you need to the cluster from the vast and ever-growing Kubernetes and CNCF ecosystem.

You should ensure all necessary tooling and related reporting/alerting is applied as part of your initial bootstrapping process to ensure coverage immediately after cluster creation.

📓 For more information, see Azure Architecture Center guidance for PCI-DSS 3.2.1 Requirement 5 & 6 in AKS.

Next step

▶️ Deploy your workload