Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE creation failure requires manual cleanup #2700

Open
rshade opened this issue Nov 27, 2024 · 9 comments
Open

GKE creation failure requires manual cleanup #2700

rshade opened this issue Nov 27, 2024 · 9 comments
Assignees
Labels
awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. kind/bug Some behavior is incorrect or out of spec

Comments

@rshade
Copy link
Contributor

rshade commented Nov 27, 2024

Hello!

  • Vote on this issue by adding a 👍 reaction
  • If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Currently when resources like GKE fail, they are marked as tainted, and not saved to state. You need to go into the cloud console, and delete the resource and run pulumi up again. We should support tainted resources by saving them to state, marking them for replacement, and setting them to DeleteBeforeReplace.

Affected area/feature

@rshade rshade added kind/enhancement Improvements or new features needs-triage Needs attention from the triage team labels Nov 27, 2024
@Frassle Frassle added area/providers and removed needs-triage Needs attention from the triage team labels Nov 27, 2024
@Frassle
Copy link
Member

Frassle commented Nov 27, 2024

they are marked as tainted

This is a tfbridge specific behavior?

@VenelinMartinov
Copy link
Contributor

I think this is a TF feature which is TF's variant of partial creation errors. @rshade can you please raise an issue with the specific use case you have here. I do not believe Pulumi should interact with TF taint in any way as taint is only present in the TF state.

@VenelinMartinov VenelinMartinov transferred this issue from pulumi/pulumi Dec 2, 2024
@VenelinMartinov VenelinMartinov changed the title Support for Tainted Resources GKE creation failure requires manual cleanup Dec 2, 2024
@VenelinMartinov
Copy link
Contributor

@rshade can you please add a repro here as well as corresponding TF code which shows that TF does not have the same problem?

@VenelinMartinov VenelinMartinov added needs-repro Needs repro steps before it can be triaged or fixed awaiting-feedback Blocked on input from the author labels Dec 2, 2024
@rshade
Copy link
Contributor Author

rshade commented Dec 2, 2024

TF State:

    {
      "mode": "managed",
      "type": "google_container_cluster",
      "name": "cluster2",
      "provider": "provider[\"registry.terraform.io/hashicorp/google\"]",
      "instances": [
        {
          "status": "tainted",
          "schema_version": 2,
          "attributes": {
            "addons_config": null,
            "allow_net_admin": null,
            "authenticator_groups_config": null,
            "binary_authorization": [],
            "cluster_autoscaling": null,
            "cluster_ipv4_cidr": null,
            "confidential_nodes": null,
            "control_plane_endpoints_config": null,
            "cost_management_config": null,
            "database_encryption": null,
            "datapath_provider": null,
            "default_max_pods_per_node": null,
            "default_snat_status": null,
            "deletion_protection": false,
            "description": "GKE Cluster for testing Terraform failure mode",
            "dns_config": [],
            "effective_labels": {
              "goog-terraform-provisioned": "true"
            },
            "enable_autopilot": null,
            "enable_cilium_clusterwide_network_policy": false,
            "enable_intranode_visibility": null,
            "enable_k8s_beta_apis": [],
            "enable_kubernetes_alpha": false,
            "enable_l4_ilb_subsetting": false,
            "enable_legacy_abac": false,
            "enable_multi_networking": false,
            "enable_shielded_nodes": true,
            "enable_tpu": null,
            "endpoint": null,
            "fleet": [],
            "gateway_api_config": null,
            "id": "projects/pulumi-ce-team/locations/us-east1-b/clusters/huckstream-sbx-gke-fail-2",
            "identity_service_config": null,
            "initial_node_count": 2,
            "ip_allocation_policy": null,
            "label_fingerprint": null,
            "location": "us-east1-b",
            "logging_config": null,
            "logging_service": null,
            "maintenance_policy": [],
            "master_auth": null,
            "master_authorized_networks_config": null,
            "master_version": null,
            "mesh_certificates": null,
            "min_master_version": "1.30.5-gke.1014003",
            "monitoring_config": null,
            "monitoring_service": null,
            "name": "huckstream-sbx-gke-fail-2",
            "network": "default",
            "network_policy": [],
            "networking_mode": "VPC_NATIVE",
            "node_config": [
              {
                "advanced_machine_features": [],
                "boot_disk_kms_key": "",
                "confidential_nodes": [],
                "containerd_config": [],
                "disk_size_gb": 0,
                "disk_type": "",
                "effective_taints": [],
                "enable_confidential_storage": false,
                "ephemeral_storage_local_ssd_config": [],
                "fast_socket": [],
                "gcfs_config": [],
                "guest_accelerator": [],
                "gvnic": [],
                "host_maintenance_policy": [],
                "image_type": "",
                "kubelet_config": [],
                "labels": {},
                "linux_node_config": [],
                "local_nvme_ssd_block_config": [],
                "local_ssd_count": 0,
                "logging_variant": "",
                "machine_type": "e2-micro",
                "metadata": {},
                "min_cpu_platform": "",
                "node_group": "",
                "oauth_scopes": [
                  "https://www.googleapis.com/auth/cloud-platform"
                ],
                "preemptible": false,
                "reservation_affinity": [],
                "resource_labels": null,
                "resource_manager_tags": null,
                "secondary_boot_disks": [],
                "service_account": "",
                "shielded_instance_config": [],
                "sole_tenant_config": [],
                "spot": false,
                "storage_pools": null,
                "tags": null,
                "taint": [],
                "workload_metadata_config": []
              }
            ],
            "node_locations": [],
            "node_pool": null,
            "node_pool_auto_config": null,
            "node_pool_defaults": null,
            "node_version": null,
            "notification_config": null,
            "operation": null,
            "private_cluster_config": [
              {
                "enable_private_endpoint": false,
                "enable_private_nodes": true,
                "master_global_access_config": [
                  {
                    "enabled": true
                  }
                ],
                "master_ipv4_cidr_block": "172.16.0.0/28",
                "peering_name": "",
                "private_endpoint": "",
                "private_endpoint_subnetwork": "",
                "public_endpoint": ""
              }
            ],
            "private_ipv6_google_access": null,
            "project": null,
            "release_channel": null,
            "remove_default_node_pool": null,
            "resource_labels": null,
            "resource_usage_export_config": [],
            "secret_manager_config": [],
            "security_posture_config": null,
            "self_link": null,
            "service_external_ips_config": null,
            "services_ipv4_cidr": null,
            "subnetwork": null,
            "terraform_labels": {
              "goog-terraform-provisioned": "true"
            },
            "timeouts": null,
            "tpu_ipv4_cidr_block": null,
            "user_managed_keys_config": [],
            "vertical_pod_autoscaling": null,
            "workload_identity_config": null
          },
          "sensitive_attributes": [],
          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjoyNDAwMDAwMDAwMDAwLCJkZWxldGUiOjI0MDAwMDAwMDAwMDAsInJlYWQiOjI0MDAwMDAwMDAwMDAsInVwZGF0ZSI6MzYwMDAwMDAwMDAwMH0sInNjaGVtYV92ZXJzaW9uIjoiMiJ9",
          "dependencies": [
            "google_container_cluster.cluster1"
          ]
        }
      ]
    }

@rshade
Copy link
Contributor Author

rshade commented Dec 2, 2024

Terraform Code:

provider "google" {
  project = $PROJECT_ID # Replace with your GCP project ID
  region  = "us-east1"
}

variable "namespace" {}
variable "environment" {}
variable "name" {}

locals {
  resource_name = "${var.namespace}-${var.environment}-${var.name}"
}

# First GKE Cluster
resource "google_container_cluster" "cluster1" {
  name        = "${local.resource_name}-1"
  location    = "us-east1-b"
  description = "GKE Cluster for testing Terraform failure mode"

  initial_node_count = 2
  min_master_version = "1.30.5-gke.1014003"

  node_config {
    machine_type = "e2-micro"
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  private_cluster_config {
    enable_private_nodes = true
    master_global_access_config {
      enabled = true
    }
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  deletion_protection = false
}

# Second GKE Cluster (with IP range conflict)
resource "google_container_cluster" "cluster2" {
  name        = "${local.resource_name}-2"
  location    = "us-east1-b"
  description = "GKE Cluster for testing Terraform failure mode"

  initial_node_count = 2
  min_master_version = "1.30.5-gke.1014003"

  node_config {
    machine_type = "e2-micro"
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  private_cluster_config {
    enable_private_nodes = true
    master_global_access_config {
      enabled = true
    }
    master_ipv4_cidr_block = "172.16.0.0/28" # Conflict CIDR Block
    # Uncomment the next line to repair the conflict
    # master_ipv4_cidr_block = "172.16.0.16/28"
  }

  deletion_protection = false

  # Ensure dependency on the first cluster
  depends_on = [google_container_cluster.cluster1]
}

output "endpoint_1" {
  value = google_container_cluster.cluster1.endpoint
}

output "endpoint_2" {
  value = google_container_cluster.cluster2.endpoint
}

@rshade
Copy link
Contributor Author

rshade commented Dec 2, 2024

import pulumi
import pulumi_gcp as gcp

# Get configuration values
config = pulumi.Config()
namespace = config.require("namespace")
environment = config.require("environment")
name = config.require("name")

# Create the base resource name
resource_name = f"{namespace}-{environment}-{name}"

# Create a first GKE cluster as usual
cluster1 = gcp.container.Cluster(
    f"{resource_name}-1",
    name=f"{resource_name}-1",
    description="GKE Cluster for testing Pulumi failure mode",
    initial_node_count=2,
    min_master_version="1.30.5-gke.1014003",
    location="us-east1-b",
    node_config=gcp.container.ClusterNodeConfigArgs(
        machine_type="e2-micro",
        oauth_scopes=[
            "https://www.googleapis.com/auth/cloud-platform",
        ],
    ),
    private_cluster_config={
        "enable_private_nodes": True,
        "master_global_access_config": {
            "enabled": True,
        },
        "master_ipv4_cidr_block": "172.16.0.0/28",
    },
    deletion_protection=False,
)

# Create a second GKE cluster with IP range conflict to trigger failure mode
cluster2 = gcp.container.Cluster(
    f"{resource_name}-2",
    name=f"{resource_name}-2",
    description="GKE Cluster for testing Pulumi failure mode",
    initial_node_count=2,
    min_master_version="1.30.5-gke.1014003",
    location="us-east1-b",
    node_config=gcp.container.ClusterNodeConfigArgs(
        machine_type="e2-micro",
        oauth_scopes=[
            "https://www.googleapis.com/auth/cloud-platform",
        ],
    ),
    private_cluster_config={
        "enable_private_nodes": True,
        "master_global_access_config": {
            "enabled": True,
        },
        "master_ipv4_cidr_block": "172.16.0.0/28",  # Intentionally create IP range conflict - Comment out to attempt repair after failure
        # "master_ipv4_cidr_block": "172.16.0.16/28",  # Fix CIDR range to repair conflict - Uncomment to attempt to repair after failure
    },
    deletion_protection=False,
    # Force dependency to ensure that first cluster is up and running for conflict to occur
    opts=pulumi.ResourceOptions(depends_on=[cluster1]),
)


# Export the cluster endpoint and kubeconfig
pulumi.export("endpoint_1", cluster1.endpoint)
pulumi.export("endpoint_2", cluster2.endpoint)

@rshade rshade removed needs-repro Needs repro steps before it can be triaged or fixed awaiting-feedback Blocked on input from the author labels Dec 2, 2024
@VenelinMartinov VenelinMartinov added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec and removed kind/enhancement Improvements or new features needs-triage Needs attention from the triage team labels Dec 2, 2024
@VenelinMartinov
Copy link
Contributor

Did some digging here. What is happening is that the TF Create call returns both an id for the resource and a failure. In pulumi land this is an initialization errors and I believe we have some handling of that.

https://github.com/pulumi/pulumi-terraform-bridge/blob/1c9c3e01c42fe6f91b0cdc56b66340f582a5fdbf/pkg/tfbridge/provider.go#L1387

and https://github.com/pulumi/pulumi-terraform-bridge/blob/1c9c3e01c42fe6f91b0cdc56b66340f582a5fdbf/pkg/tfbridge/provider.go#L1940

It's possible, however, that something is going wrong there. We need to do a bit more digging here to root-cause.

@VenelinMartinov
Copy link
Contributor

This is a bridge issue: pulumi/pulumi-terraform-bridge#2696

There is also an engine issue if this is ran with --refresh: pulumi/pulumi#17949

@VenelinMartinov VenelinMartinov added the awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. label Dec 6, 2024
@VenelinMartinov VenelinMartinov self-assigned this Dec 6, 2024
VenelinMartinov added a commit to pulumi/pulumi-terraform-bridge that referenced this issue Dec 9, 2024
In the SDKv2 bridge under PlanResourceChange we are not passing any
state we receive during TF Apply back to the engine if we also received
an error. This causes us to incorrectly miss any resources which were
created but encountered errors during the creation process. The engine
should see these as `ResourceInitError`, which allows the engine to
attempt to update the partially created resource on the next `up`.

This PR fixes the issue by passing the state down to the engine in the
case when we receive an error and a non-nil state from TF during Apply.

related to pulumi/pulumi-gcp#2700
related to pulumi/pulumi-aws#4759

fixes #2696
VenelinMartinov added a commit to pulumi/pulumi-terraform-bridge that referenced this issue Dec 11, 2024
In the SDKv2 bridge under PlanResourceChange we are not passing any
state we receive during TF Apply back to the engine if we also received
an error. This causes us to incorrectly miss any resources which were
created but encountered errors during the creation process. The engine
should see these as ResourceInitError, which allows the engine to
attempt to update the partially created resource on the next up.

This PR fixes the issue by passing the state down to the engine in the
case when we receive an error and a non-nil state from TF during Apply.

This is the second attempt at this. The first was
#2695 but was
reverted because it caused a different panic:
#2706. We added
a regression test for that in
#2710

The reason for that panic was that we were now creating a non-nil
`InstanceState` with a nil `stateValue` which causes the `ID` function
to panic. This PR fixes both issues by not allowing non-nil states with
nil `stateValue`s and by preventing the panic in `ID`.

There was also a bit of fun with go nil interfaces along the way, which
is the reason why `ApplyResourceChange` now returns a
`shim.InstanceState` interface instead of a `*v2InstanceState2`.
Otherwise we end up creating a non-nil interface with a nil value.

related to pulumi/pulumi-gcp#2700
related to pulumi/pulumi-aws#4759

fixes #2696
@VenelinMartinov
Copy link
Contributor

Confirmed that this is fixed in the bridge in pulumi/pulumi-terraform-bridge#2713. Will be fixed here with the next bridge release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

3 participants