Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[helm] Cannot start replication pods when using a service account not named airbyte-admin (Airbyte OSS 1.3.1) #49938

Open
AcidFlow opened this issue Dec 19, 2024 · 5 comments
Labels
area/platform issues related to the platform community team/deployments type/bug Something isn't working

Comments

@AcidFlow
Copy link

Helm Chart Version

1.3.1

What step the error happened?

During the Sync

Relevant information

Summary

After upgrading Airbyte to 1.3.1 (OSS), when workload-launcher starts a replication pod, it does a PATCH call to pods it is trying to launch, however the PATCH call seems to try to use airbyte-admin as its ServiceAccount even though a different service account name is specified in the values.yaml (the pod being started has uses the correct service account and has the JOB_KUBE_SERVICEACCOUNT from the values file).

Step to reproduce

  • Airbyte 1.3.1 (OSS) deployed using helm chart
  • Use a ServiceAccount name that's different from airbyte-admin but has the necessary role bindings
  • Start a sync for a connection
  • When the pod start you should see the following exception in the workload-launcher pod logs:
Caused by: io.airbyte.workers.exception.KubeClientException: Failed to create pod postgres-discover-<REDACTED>.
    at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:227)
    at io.airbyte.workload.launcher.pods.KubePodClient.launchDiscover(KubePodClient.kt:185)
    at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:50)
    at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24)
    at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42)
    ... 53 common frames omitted
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://<IP_REDACTED>:<PORT_REDACTED>/api/v1/namespaces/airbyte/pods/<POD_ID_REDACTED>?fieldManager=fabric8. Message: pods "<POD_ID_REDACTED>" is forbidden: error looking up service account airbyte/airbyte-admin: serviceaccount "airbyte-admin" not found. 

Let me know if you need any further information, I'll be happy to help.

Additional information

  • Kurbenetes cluster is running on EKS

Note

The issue is not happening on 1.2.0.

Workaround

As a workaround I created another service account named airbyte-admin having the same role binding as the role we normally use.
Then workload-launcher will be able to start the replication pod.

Relevant log output

2024-12-19 09:32:02,134 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.p.s.m.Stage(apply):39 - APPLY Stage: BUILD — (workloadId = <REDACTED_WORKLOAD_ID>_check) — (dataplaneId = local)
2024-12-19 09:32:02,143 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.p.s.m.Stage(apply):39 - APPLY Stage: MUTEX — (workloadId = <REDACTED_WORKLOAD_ID>_check) — (dataplaneId = local)
2024-12-19 09:32:02,144 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.p.s.EnforceMutexStage(applyStage):50 - No mutex key specified for workload: <REDACTED_WORKLOAD_ID>_check. Continuing...
2024-12-19 09:32:02,144 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.p.s.m.Stage(apply):39 - APPLY Stage: LAUNCH — (workloadId = <REDACTED_WORKLOAD_ID>_check) — (dataplaneId = local)
2024-12-19 09:32:02,147 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.p.f.InitContainerFactory(create):42 - [initContainer] image: airbyte/workload-init-container:1.3.1 resources: ResourceRequirements(claims=], limits={}, requests={}, additionalProperties={})
    at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.apply(Unknown Source)
    at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:24)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158)
    at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2571)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:2367)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onSubscribe(FluxOnErrorResume.java:74)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117)
    at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:193)
    at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4560)
    at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.run(MonoSubscribeOn.java:126)
    at reactor.core.scheduler.ImmediateScheduler$ImmediateSchedulerWorker.schedule(ImmediateScheduler.java:84)
    at reactor.core.publisher.MonoSubscribeOn.subscribeOrReturn(MonoSubscribeOn.java:55)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4560)
    at reactor.core.publisher.Mono.subscribeWith(Mono.java:4642)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4403)
    at io.airbyte.workload.launcher.pipeline.LaunchPipeline.accept(LaunchPipeline.kt:50)
    at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:28)
    at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:12)
    at io.airbyte.commons.temporal.queue.QueueActivityImpl.consume(Internal.kt:87)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.base/java.lang.reflect.Method.invoke(Method.java:580)
    at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64)
    at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43)
    at io.temporal.common.interceptors.ActivityInboundCallsInterceptorBase.execute(ActivityInboundCallsInterceptorBase.java:39)
    at io.temporal.opentracing.internal.OpenTracingActivityInboundCallsInterceptor.execute(OpenTracingActivityInboundCallsInterceptor.java:78)
    at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:107)
    at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:124)
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:290)
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:254)
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:217)
    at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: io.airbyte.workers.exception.KubeClientException: Failed to create pod <REDACTED_POD_ID>.
    at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:227)
    at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:167)
    at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:49)
    at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24)
    at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42)
    ... 53 common frames omitted
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://<REDACTED_IP>:<REDACTED_PORT>/api/v1/namespaces/airbyte/pods/<REDACTED_POD_ID>?fieldManager=fabric8. Message: pods "<REDACTED_POD_ID>" is forbidden: error looking up service account airbyte/airbyte-admin: serviceaccount "airbyte-admin" not found. Received status: Status(apiVersion=v1, code=403, details=StatusDetails(causes=], group=null, kind=pods, name=<REDACTED_POD_ID>, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "<REDACTED_POD_ID>" is forbidden: error looking up service account airbyte/airbyte-admin: serviceaccount "airbyte-admin" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Forbidden, status=Failure, additionalProperties={}).
    at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:419)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:397)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handlePatch(BaseOperation.java:764)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:231)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:236)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:251)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:1179)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:98)
    at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:56)
    at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:51)
    at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand$lambda$2(KubePodLauncher.kt:322)
    at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
    at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
    at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
    at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187)
    at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
    at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
    at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand(KubePodLauncher.kt:322)
    at io.airbyte.workload.launcher.pods.KubePodLauncher.create(KubePodLauncher.kt:51)
    at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:224)
    ... 57 common frames omitted
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://<REDACTED_IP>:<REDACTED_PORT>/api/v1/namespaces/airbyte/pods/<REDACTED_POD_ID>?fieldManager=fabric8. Message: pods "<REDACTED_POD_ID>" is forbidden: error looking up service account airbyte/airbyte-admin: serviceaccount "airbyte-admin" not found. Received status: Status(apiVersion=v1, code=403, details=StatusDetails(causes=], group=null, kind=pods, name=<REDACTED_POD_ID>, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "<REDACTED_POD_ID>" is forbidden: error looking up service account airbyte/airbyte-admin: serviceaccount "airbyte-admin" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Forbidden, status=Failure, additionalProperties={}).
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
    at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:142)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
    at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:51)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
    at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:136)
    ... 3 common frames omitted

2024-12-19 09:32:02,543 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.c.WorkloadApiClient(updateStatusToFailed):54 - Attempting to update workload: <REDACTED_WORKLOAD_ID>_check to FAILED.
2024-12-19 09:32:02,562 [Activity Executor taskQueue="workload_default", namespace="default": 2]    INFO    i.a.w.l.p.h.FailureHandler(apply):62 - Pipeline aborted after error for workload: <REDACTED_WORKLOAD_ID>_check.
@AcidFlow
Copy link
Author

To add more insights, it seems that the replication pod first starts with the custom service account name, and might be changed to airbyte-admin from the PATCH call.

This then caused other issues (for instance accessing AWS secret manager from the pod) since we rely on IRSA to let the pod service account assume a specific AWS role to access the secrets (the airbyte-admin was not allowed to assume the appropriate role because we assumed we only had our own service account name).

@marcosmarxm
Copy link
Member

Hello @AcidFlow,

Thank you for bringing this to our attention. It appears that despite specifying a different service account in your values.yaml, the system defaults to airbyte-admin during replication pod launches.

This behavior may be due to hardcoded references within the workload launcher component.

Do you sharing the permissions or the configs you used to create the custom service account?

@marcosmarxm marcosmarxm changed the title Cannot start replication pods when using a service account not named airbyte-admin (Airbyte OSS 1.3.1) [helm] Cannot start replication pods when using a service account not named airbyte-admin (Airbyte OSS 1.3.1) Dec 20, 2024
@AcidFlow
Copy link
Author

Hello @marcosmarxm !

Thanks for the quick reply.

We copied the service account definition from the official helm chart and added an annotation to let it assume our AWS roles through EKS IRSA.

Here is the service account and bindings definition redacted:

---
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: true
metadata:
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-10"
    eks.amazonaws.com/role-arn: arn:aws:iam::<REDACTED_AWS_ACCOUNT>:role/<REDACTED_ROLE_NAME>
  name: "airbyte"
  namespace: "<REDACTED>"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: airbyte-role
  namespace: "<REDACTED>"
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-5"
rules:
  - apiGroups: ["*"]
    resources: ["jobs", "pods", "pods/log", "pods/exec", "pods/attach", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # over-permission for now
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: airbyte-binding
  namespace: "<REDACTED>"
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-3"
roleRef:
  apiGroup: ""
  kind: Role
  name: airbyte-role
subjects:
  - kind: ServiceAccount
    name: airbyte

@nullniverse
Copy link

nullniverse commented Dec 23, 2024

We were facing the same issue, and turns out what worked for us, is if the chart is already installed and running, then change "create": true to "create": false then restart the airbyte-workload-launcher pods (https://artifacthub.io/packages/helm/airbyte/airbyte?modal=values):

  # -- Specifies whether a ServiceAccount should be created
  create: true
  # -- Annotations for service account. Evaluated as a template. Only used if `create` is `true`.
  annotations: {}
  # -- Name of the service account to use. If not set and create is true, a name is generated using the fullname template.
  name: *service-account-name

@PaxonF
Copy link

PaxonF commented Jan 3, 2025

Can't update to 1.3.1 due to this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform community team/deployments type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants