GKE creation failure requires manual cleanup #2700

rshade · 2024-11-27T17:17:36Z

Hello!

Vote on this issue by adding a 👍 reaction
If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Currently when resources like GKE fail, they are marked as tainted, and not saved to state. You need to go into the cloud console, and delete the resource and run pulumi up again. We should support tainted resources by saving them to state, marking them for replacement, and setting them to DeleteBeforeReplace.

Affected area/feature

The text was updated successfully, but these errors were encountered:

Frassle · 2024-11-27T17:43:27Z

they are marked as tainted

This is a tfbridge specific behavior?

VenelinMartinov · 2024-12-02T13:14:49Z

I think this is a TF feature which is TF's variant of partial creation errors. @rshade can you please raise an issue with the specific use case you have here. I do not believe Pulumi should interact with TF taint in any way as taint is only present in the TF state.

VenelinMartinov · 2024-12-02T13:20:41Z

@rshade can you please add a repro here as well as corresponding TF code which shows that TF does not have the same problem?

rshade · 2024-12-02T14:28:48Z

TF State:

    {
      "mode": "managed",
      "type": "google_container_cluster",
      "name": "cluster2",
      "provider": "provider[\"registry.terraform.io/hashicorp/google\"]",
      "instances": [
        {
          "status": "tainted",
          "schema_version": 2,
          "attributes": {
            "addons_config": null,
            "allow_net_admin": null,
            "authenticator_groups_config": null,
            "binary_authorization": [],
            "cluster_autoscaling": null,
            "cluster_ipv4_cidr": null,
            "confidential_nodes": null,
            "control_plane_endpoints_config": null,
            "cost_management_config": null,
            "database_encryption": null,
            "datapath_provider": null,
            "default_max_pods_per_node": null,
            "default_snat_status": null,
            "deletion_protection": false,
            "description": "GKE Cluster for testing Terraform failure mode",
            "dns_config": [],
            "effective_labels": {
              "goog-terraform-provisioned": "true"
            },
            "enable_autopilot": null,
            "enable_cilium_clusterwide_network_policy": false,
            "enable_intranode_visibility": null,
            "enable_k8s_beta_apis": [],
            "enable_kubernetes_alpha": false,
            "enable_l4_ilb_subsetting": false,
            "enable_legacy_abac": false,
            "enable_multi_networking": false,
            "enable_shielded_nodes": true,
            "enable_tpu": null,
            "endpoint": null,
            "fleet": [],
            "gateway_api_config": null,
            "id": "projects/pulumi-ce-team/locations/us-east1-b/clusters/huckstream-sbx-gke-fail-2",
            "identity_service_config": null,
            "initial_node_count": 2,
            "ip_allocation_policy": null,
            "label_fingerprint": null,
            "location": "us-east1-b",
            "logging_config": null,
            "logging_service": null,
            "maintenance_policy": [],
            "master_auth": null,
            "master_authorized_networks_config": null,
            "master_version": null,
            "mesh_certificates": null,
            "min_master_version": "1.30.5-gke.1014003",
            "monitoring_config": null,
            "monitoring_service": null,
            "name": "huckstream-sbx-gke-fail-2",
            "network": "default",
            "network_policy": [],
            "networking_mode": "VPC_NATIVE",
            "node_config": [
              {
                "advanced_machine_features": [],
                "boot_disk_kms_key": "",
                "confidential_nodes": [],
                "containerd_config": [],
                "disk_size_gb": 0,
                "disk_type": "",
                "effective_taints": [],
                "enable_confidential_storage": false,
                "ephemeral_storage_local_ssd_config": [],
                "fast_socket": [],
                "gcfs_config": [],
                "guest_accelerator": [],
                "gvnic": [],
                "host_maintenance_policy": [],
                "image_type": "",
                "kubelet_config": [],
                "labels": {},
                "linux_node_config": [],
                "local_nvme_ssd_block_config": [],
                "local_ssd_count": 0,
                "logging_variant": "",
                "machine_type": "e2-micro",
                "metadata": {},
                "min_cpu_platform": "",
                "node_group": "",
                "oauth_scopes": [
                  "https://www.googleapis.com/auth/cloud-platform"
                ],
                "preemptible": false,
                "reservation_affinity": [],
                "resource_labels": null,
                "resource_manager_tags": null,
                "secondary_boot_disks": [],
                "service_account": "",
                "shielded_instance_config": [],
                "sole_tenant_config": [],
                "spot": false,
                "storage_pools": null,
                "tags": null,
                "taint": [],
                "workload_metadata_config": []
              }
            ],
            "node_locations": [],
            "node_pool": null,
            "node_pool_auto_config": null,
            "node_pool_defaults": null,
            "node_version": null,
            "notification_config": null,
            "operation": null,
            "private_cluster_config": [
              {
                "enable_private_endpoint": false,
                "enable_private_nodes": true,
                "master_global_access_config": [
                  {
                    "enabled": true
                  }
                ],
                "master_ipv4_cidr_block": "172.16.0.0/28",
                "peering_name": "",
                "private_endpoint": "",
                "private_endpoint_subnetwork": "",
                "public_endpoint": ""
              }
            ],
            "private_ipv6_google_access": null,
            "project": null,
            "release_channel": null,
            "remove_default_node_pool": null,
            "resource_labels": null,
            "resource_usage_export_config": [],
            "secret_manager_config": [],
            "security_posture_config": null,
            "self_link": null,
            "service_external_ips_config": null,
            "services_ipv4_cidr": null,
            "subnetwork": null,
            "terraform_labels": {
              "goog-terraform-provisioned": "true"
            },
            "timeouts": null,
            "tpu_ipv4_cidr_block": null,
            "user_managed_keys_config": [],
            "vertical_pod_autoscaling": null,
            "workload_identity_config": null
          },
          "sensitive_attributes": [],
          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjoyNDAwMDAwMDAwMDAwLCJkZWxldGUiOjI0MDAwMDAwMDAwMDAsInJlYWQiOjI0MDAwMDAwMDAwMDAsInVwZGF0ZSI6MzYwMDAwMDAwMDAwMH0sInNjaGVtYV92ZXJzaW9uIjoiMiJ9",
          "dependencies": [
            "google_container_cluster.cluster1"
          ]
        }
      ]
    }

rshade · 2024-12-02T14:29:58Z

Terraform Code:

provider "google" {
  project = $PROJECT_ID # Replace with your GCP project ID
  region  = "us-east1"
}

variable "namespace" {}
variable "environment" {}
variable "name" {}

locals {
  resource_name = "${var.namespace}-${var.environment}-${var.name}"
}

# First GKE Cluster
resource "google_container_cluster" "cluster1" {
  name        = "${local.resource_name}-1"
  location    = "us-east1-b"
  description = "GKE Cluster for testing Terraform failure mode"

  initial_node_count = 2
  min_master_version = "1.30.5-gke.1014003"

  node_config {
    machine_type = "e2-micro"
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  private_cluster_config {
    enable_private_nodes = true
    master_global_access_config {
      enabled = true
    }
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  deletion_protection = false
}

# Second GKE Cluster (with IP range conflict)
resource "google_container_cluster" "cluster2" {
  name        = "${local.resource_name}-2"
  location    = "us-east1-b"
  description = "GKE Cluster for testing Terraform failure mode"

  initial_node_count = 2
  min_master_version = "1.30.5-gke.1014003"

  node_config {
    machine_type = "e2-micro"
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  private_cluster_config {
    enable_private_nodes = true
    master_global_access_config {
      enabled = true
    }
    master_ipv4_cidr_block = "172.16.0.0/28" # Conflict CIDR Block
    # Uncomment the next line to repair the conflict
    # master_ipv4_cidr_block = "172.16.0.16/28"
  }

  deletion_protection = false

  # Ensure dependency on the first cluster
  depends_on = [google_container_cluster.cluster1]
}

output "endpoint_1" {
  value = google_container_cluster.cluster1.endpoint
}

output "endpoint_2" {
  value = google_container_cluster.cluster2.endpoint
}

rshade · 2024-12-02T14:32:48Z

import pulumi
import pulumi_gcp as gcp

# Get configuration values
config = pulumi.Config()
namespace = config.require("namespace")
environment = config.require("environment")
name = config.require("name")

# Create the base resource name
resource_name = f"{namespace}-{environment}-{name}"

# Create a first GKE cluster as usual
cluster1 = gcp.container.Cluster(
    f"{resource_name}-1",
    name=f"{resource_name}-1",
    description="GKE Cluster for testing Pulumi failure mode",
    initial_node_count=2,
    min_master_version="1.30.5-gke.1014003",
    location="us-east1-b",
    node_config=gcp.container.ClusterNodeConfigArgs(
        machine_type="e2-micro",
        oauth_scopes=[
            "https://www.googleapis.com/auth/cloud-platform",
        ],
    ),
    private_cluster_config={
        "enable_private_nodes": True,
        "master_global_access_config": {
            "enabled": True,
        },
        "master_ipv4_cidr_block": "172.16.0.0/28",
    },
    deletion_protection=False,
)

# Create a second GKE cluster with IP range conflict to trigger failure mode
cluster2 = gcp.container.Cluster(
    f"{resource_name}-2",
    name=f"{resource_name}-2",
    description="GKE Cluster for testing Pulumi failure mode",
    initial_node_count=2,
    min_master_version="1.30.5-gke.1014003",
    location="us-east1-b",
    node_config=gcp.container.ClusterNodeConfigArgs(
        machine_type="e2-micro",
        oauth_scopes=[
            "https://www.googleapis.com/auth/cloud-platform",
        ],
    ),
    private_cluster_config={
        "enable_private_nodes": True,
        "master_global_access_config": {
            "enabled": True,
        },
        "master_ipv4_cidr_block": "172.16.0.0/28",  # Intentionally create IP range conflict - Comment out to attempt repair after failure
        # "master_ipv4_cidr_block": "172.16.0.16/28",  # Fix CIDR range to repair conflict - Uncomment to attempt to repair after failure
    },
    deletion_protection=False,
    # Force dependency to ensure that first cluster is up and running for conflict to occur
    opts=pulumi.ResourceOptions(depends_on=[cluster1]),
)


# Export the cluster endpoint and kubeconfig
pulumi.export("endpoint_1", cluster1.endpoint)
pulumi.export("endpoint_2", cluster2.endpoint)

VenelinMartinov · 2024-12-06T15:08:44Z

Did some digging here. What is happening is that the TF Create call returns both an id for the resource and a failure. In pulumi land this is an initialization errors and I believe we have some handling of that.

https://github.com/pulumi/pulumi-terraform-bridge/blob/1c9c3e01c42fe6f91b0cdc56b66340f582a5fdbf/pkg/tfbridge/provider.go#L1387

and https://github.com/pulumi/pulumi-terraform-bridge/blob/1c9c3e01c42fe6f91b0cdc56b66340f582a5fdbf/pkg/tfbridge/provider.go#L1940

It's possible, however, that something is going wrong there. We need to do a bit more digging here to root-cause.

VenelinMartinov · 2024-12-06T16:20:54Z

This is a bridge issue: pulumi/pulumi-terraform-bridge#2696

There is also an engine issue if this is ran with --refresh: pulumi/pulumi#17949

In the SDKv2 bridge under PlanResourceChange we are not passing any state we receive during TF Apply back to the engine if we also received an error. This causes us to incorrectly miss any resources which were created but encountered errors during the creation process. The engine should see these as `ResourceInitError`, which allows the engine to attempt to update the partially created resource on the next `up`. This PR fixes the issue by passing the state down to the engine in the case when we receive an error and a non-nil state from TF during Apply. related to pulumi/pulumi-gcp#2700 related to pulumi/pulumi-aws#4759 fixes #2696

In the SDKv2 bridge under PlanResourceChange we are not passing any state we receive during TF Apply back to the engine if we also received an error. This causes us to incorrectly miss any resources which were created but encountered errors during the creation process. The engine should see these as ResourceInitError, which allows the engine to attempt to update the partially created resource on the next up. This PR fixes the issue by passing the state down to the engine in the case when we receive an error and a non-nil state from TF during Apply. This is the second attempt at this. The first was #2695 but was reverted because it caused a different panic: #2706. We added a regression test for that in #2710 The reason for that panic was that we were now creating a non-nil `InstanceState` with a nil `stateValue` which causes the `ID` function to panic. This PR fixes both issues by not allowing non-nil states with nil `stateValue`s and by preventing the panic in `ID`. There was also a bit of fun with go nil interfaces along the way, which is the reason why `ApplyResourceChange` now returns a `shim.InstanceState` interface instead of a `*v2InstanceState2`. Otherwise we end up creating a non-nil interface with a nil value. related to pulumi/pulumi-gcp#2700 related to pulumi/pulumi-aws#4759 fixes #2696

VenelinMartinov · 2024-12-11T12:05:32Z

Confirmed that this is fixed in the bridge in pulumi/pulumi-terraform-bridge#2713. Will be fixed here with the next bridge release.

rshade added kind/enhancement Improvements or new features needs-triage Needs attention from the triage team labels Nov 27, 2024

Frassle added area/providers and removed needs-triage Needs attention from the triage team labels Nov 27, 2024

VenelinMartinov transferred this issue from pulumi/pulumi Dec 2, 2024

VenelinMartinov changed the title ~~Support for Tainted Resources~~ GKE creation failure requires manual cleanup Dec 2, 2024

VenelinMartinov removed the area/providers label Dec 2, 2024

VenelinMartinov added needs-repro Needs repro steps before it can be triaged or fixed awaiting-feedback Blocked on input from the author labels Dec 2, 2024

rshade removed needs-repro Needs repro steps before it can be triaged or fixed awaiting-feedback Blocked on input from the author labels Dec 2, 2024

VenelinMartinov added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec and removed kind/enhancement Improvements or new features needs-triage Needs attention from the triage team labels Dec 2, 2024

VenelinMartinov added the awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. label Dec 6, 2024

VenelinMartinov self-assigned this Dec 6, 2024

VenelinMartinov mentioned this issue Dec 10, 2024

Pass state back to the engine if Apply encountered an error 2 pulumi/pulumi-terraform-bridge#2713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE creation failure requires manual cleanup #2700

GKE creation failure requires manual cleanup #2700

rshade commented Nov 27, 2024

Frassle commented Nov 27, 2024

VenelinMartinov commented Dec 2, 2024

VenelinMartinov commented Dec 2, 2024

rshade commented Dec 2, 2024

rshade commented Dec 2, 2024

rshade commented Dec 2, 2024

VenelinMartinov commented Dec 6, 2024

VenelinMartinov commented Dec 6, 2024

VenelinMartinov commented Dec 11, 2024

GKE creation failure requires manual cleanup #2700

GKE creation failure requires manual cleanup #2700

Comments

rshade commented Nov 27, 2024

Hello!

Issue details

Affected area/feature

Frassle commented Nov 27, 2024

VenelinMartinov commented Dec 2, 2024

VenelinMartinov commented Dec 2, 2024

rshade commented Dec 2, 2024

rshade commented Dec 2, 2024

rshade commented Dec 2, 2024

VenelinMartinov commented Dec 6, 2024

VenelinMartinov commented Dec 6, 2024

VenelinMartinov commented Dec 11, 2024