-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE creation failure requires manual cleanup #2700
Comments
This is a tfbridge specific behavior? |
I think this is a TF feature which is TF's variant of partial creation errors. @rshade can you please raise an issue with the specific use case you have here. I do not believe Pulumi should interact with TF taint in any way as taint is only present in the TF state. |
@rshade can you please add a repro here as well as corresponding TF code which shows that TF does not have the same problem? |
TF State: {
"mode": "managed",
"type": "google_container_cluster",
"name": "cluster2",
"provider": "provider[\"registry.terraform.io/hashicorp/google\"]",
"instances": [
{
"status": "tainted",
"schema_version": 2,
"attributes": {
"addons_config": null,
"allow_net_admin": null,
"authenticator_groups_config": null,
"binary_authorization": [],
"cluster_autoscaling": null,
"cluster_ipv4_cidr": null,
"confidential_nodes": null,
"control_plane_endpoints_config": null,
"cost_management_config": null,
"database_encryption": null,
"datapath_provider": null,
"default_max_pods_per_node": null,
"default_snat_status": null,
"deletion_protection": false,
"description": "GKE Cluster for testing Terraform failure mode",
"dns_config": [],
"effective_labels": {
"goog-terraform-provisioned": "true"
},
"enable_autopilot": null,
"enable_cilium_clusterwide_network_policy": false,
"enable_intranode_visibility": null,
"enable_k8s_beta_apis": [],
"enable_kubernetes_alpha": false,
"enable_l4_ilb_subsetting": false,
"enable_legacy_abac": false,
"enable_multi_networking": false,
"enable_shielded_nodes": true,
"enable_tpu": null,
"endpoint": null,
"fleet": [],
"gateway_api_config": null,
"id": "projects/pulumi-ce-team/locations/us-east1-b/clusters/huckstream-sbx-gke-fail-2",
"identity_service_config": null,
"initial_node_count": 2,
"ip_allocation_policy": null,
"label_fingerprint": null,
"location": "us-east1-b",
"logging_config": null,
"logging_service": null,
"maintenance_policy": [],
"master_auth": null,
"master_authorized_networks_config": null,
"master_version": null,
"mesh_certificates": null,
"min_master_version": "1.30.5-gke.1014003",
"monitoring_config": null,
"monitoring_service": null,
"name": "huckstream-sbx-gke-fail-2",
"network": "default",
"network_policy": [],
"networking_mode": "VPC_NATIVE",
"node_config": [
{
"advanced_machine_features": [],
"boot_disk_kms_key": "",
"confidential_nodes": [],
"containerd_config": [],
"disk_size_gb": 0,
"disk_type": "",
"effective_taints": [],
"enable_confidential_storage": false,
"ephemeral_storage_local_ssd_config": [],
"fast_socket": [],
"gcfs_config": [],
"guest_accelerator": [],
"gvnic": [],
"host_maintenance_policy": [],
"image_type": "",
"kubelet_config": [],
"labels": {},
"linux_node_config": [],
"local_nvme_ssd_block_config": [],
"local_ssd_count": 0,
"logging_variant": "",
"machine_type": "e2-micro",
"metadata": {},
"min_cpu_platform": "",
"node_group": "",
"oauth_scopes": [
"https://www.googleapis.com/auth/cloud-platform"
],
"preemptible": false,
"reservation_affinity": [],
"resource_labels": null,
"resource_manager_tags": null,
"secondary_boot_disks": [],
"service_account": "",
"shielded_instance_config": [],
"sole_tenant_config": [],
"spot": false,
"storage_pools": null,
"tags": null,
"taint": [],
"workload_metadata_config": []
}
],
"node_locations": [],
"node_pool": null,
"node_pool_auto_config": null,
"node_pool_defaults": null,
"node_version": null,
"notification_config": null,
"operation": null,
"private_cluster_config": [
{
"enable_private_endpoint": false,
"enable_private_nodes": true,
"master_global_access_config": [
{
"enabled": true
}
],
"master_ipv4_cidr_block": "172.16.0.0/28",
"peering_name": "",
"private_endpoint": "",
"private_endpoint_subnetwork": "",
"public_endpoint": ""
}
],
"private_ipv6_google_access": null,
"project": null,
"release_channel": null,
"remove_default_node_pool": null,
"resource_labels": null,
"resource_usage_export_config": [],
"secret_manager_config": [],
"security_posture_config": null,
"self_link": null,
"service_external_ips_config": null,
"services_ipv4_cidr": null,
"subnetwork": null,
"terraform_labels": {
"goog-terraform-provisioned": "true"
},
"timeouts": null,
"tpu_ipv4_cidr_block": null,
"user_managed_keys_config": [],
"vertical_pod_autoscaling": null,
"workload_identity_config": null
},
"sensitive_attributes": [],
"private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjoyNDAwMDAwMDAwMDAwLCJkZWxldGUiOjI0MDAwMDAwMDAwMDAsInJlYWQiOjI0MDAwMDAwMDAwMDAsInVwZGF0ZSI6MzYwMDAwMDAwMDAwMH0sInNjaGVtYV92ZXJzaW9uIjoiMiJ9",
"dependencies": [
"google_container_cluster.cluster1"
]
}
]
} |
Terraform Code: provider "google" {
project = $PROJECT_ID # Replace with your GCP project ID
region = "us-east1"
}
variable "namespace" {}
variable "environment" {}
variable "name" {}
locals {
resource_name = "${var.namespace}-${var.environment}-${var.name}"
}
# First GKE Cluster
resource "google_container_cluster" "cluster1" {
name = "${local.resource_name}-1"
location = "us-east1-b"
description = "GKE Cluster for testing Terraform failure mode"
initial_node_count = 2
min_master_version = "1.30.5-gke.1014003"
node_config {
machine_type = "e2-micro"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
private_cluster_config {
enable_private_nodes = true
master_global_access_config {
enabled = true
}
master_ipv4_cidr_block = "172.16.0.0/28"
}
deletion_protection = false
}
# Second GKE Cluster (with IP range conflict)
resource "google_container_cluster" "cluster2" {
name = "${local.resource_name}-2"
location = "us-east1-b"
description = "GKE Cluster for testing Terraform failure mode"
initial_node_count = 2
min_master_version = "1.30.5-gke.1014003"
node_config {
machine_type = "e2-micro"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
private_cluster_config {
enable_private_nodes = true
master_global_access_config {
enabled = true
}
master_ipv4_cidr_block = "172.16.0.0/28" # Conflict CIDR Block
# Uncomment the next line to repair the conflict
# master_ipv4_cidr_block = "172.16.0.16/28"
}
deletion_protection = false
# Ensure dependency on the first cluster
depends_on = [google_container_cluster.cluster1]
}
output "endpoint_1" {
value = google_container_cluster.cluster1.endpoint
}
output "endpoint_2" {
value = google_container_cluster.cluster2.endpoint
} |
import pulumi
import pulumi_gcp as gcp
# Get configuration values
config = pulumi.Config()
namespace = config.require("namespace")
environment = config.require("environment")
name = config.require("name")
# Create the base resource name
resource_name = f"{namespace}-{environment}-{name}"
# Create a first GKE cluster as usual
cluster1 = gcp.container.Cluster(
f"{resource_name}-1",
name=f"{resource_name}-1",
description="GKE Cluster for testing Pulumi failure mode",
initial_node_count=2,
min_master_version="1.30.5-gke.1014003",
location="us-east1-b",
node_config=gcp.container.ClusterNodeConfigArgs(
machine_type="e2-micro",
oauth_scopes=[
"https://www.googleapis.com/auth/cloud-platform",
],
),
private_cluster_config={
"enable_private_nodes": True,
"master_global_access_config": {
"enabled": True,
},
"master_ipv4_cidr_block": "172.16.0.0/28",
},
deletion_protection=False,
)
# Create a second GKE cluster with IP range conflict to trigger failure mode
cluster2 = gcp.container.Cluster(
f"{resource_name}-2",
name=f"{resource_name}-2",
description="GKE Cluster for testing Pulumi failure mode",
initial_node_count=2,
min_master_version="1.30.5-gke.1014003",
location="us-east1-b",
node_config=gcp.container.ClusterNodeConfigArgs(
machine_type="e2-micro",
oauth_scopes=[
"https://www.googleapis.com/auth/cloud-platform",
],
),
private_cluster_config={
"enable_private_nodes": True,
"master_global_access_config": {
"enabled": True,
},
"master_ipv4_cidr_block": "172.16.0.0/28", # Intentionally create IP range conflict - Comment out to attempt repair after failure
# "master_ipv4_cidr_block": "172.16.0.16/28", # Fix CIDR range to repair conflict - Uncomment to attempt to repair after failure
},
deletion_protection=False,
# Force dependency to ensure that first cluster is up and running for conflict to occur
opts=pulumi.ResourceOptions(depends_on=[cluster1]),
)
# Export the cluster endpoint and kubeconfig
pulumi.export("endpoint_1", cluster1.endpoint)
pulumi.export("endpoint_2", cluster2.endpoint) |
Did some digging here. What is happening is that the TF Create call returns both an It's possible, however, that something is going wrong there. We need to do a bit more digging here to root-cause. |
This is a bridge issue: pulumi/pulumi-terraform-bridge#2696 There is also an engine issue if this is ran with |
In the SDKv2 bridge under PlanResourceChange we are not passing any state we receive during TF Apply back to the engine if we also received an error. This causes us to incorrectly miss any resources which were created but encountered errors during the creation process. The engine should see these as `ResourceInitError`, which allows the engine to attempt to update the partially created resource on the next `up`. This PR fixes the issue by passing the state down to the engine in the case when we receive an error and a non-nil state from TF during Apply. related to pulumi/pulumi-gcp#2700 related to pulumi/pulumi-aws#4759 fixes #2696
In the SDKv2 bridge under PlanResourceChange we are not passing any state we receive during TF Apply back to the engine if we also received an error. This causes us to incorrectly miss any resources which were created but encountered errors during the creation process. The engine should see these as ResourceInitError, which allows the engine to attempt to update the partially created resource on the next up. This PR fixes the issue by passing the state down to the engine in the case when we receive an error and a non-nil state from TF during Apply. This is the second attempt at this. The first was #2695 but was reverted because it caused a different panic: #2706. We added a regression test for that in #2710 The reason for that panic was that we were now creating a non-nil `InstanceState` with a nil `stateValue` which causes the `ID` function to panic. This PR fixes both issues by not allowing non-nil states with nil `stateValue`s and by preventing the panic in `ID`. There was also a bit of fun with go nil interfaces along the way, which is the reason why `ApplyResourceChange` now returns a `shim.InstanceState` interface instead of a `*v2InstanceState2`. Otherwise we end up creating a non-nil interface with a nil value. related to pulumi/pulumi-gcp#2700 related to pulumi/pulumi-aws#4759 fixes #2696
Confirmed that this is fixed in the bridge in pulumi/pulumi-terraform-bridge#2713. Will be fixed here with the next bridge release. |
Hello!
Issue details
Currently when resources like GKE fail, they are marked as tainted, and not saved to state. You need to go into the cloud console, and delete the resource and run
pulumi up
again. We should support tainted resources by saving them to state, marking them for replacement, and setting them to DeleteBeforeReplace.Affected area/feature
The text was updated successfully, but these errors were encountered: