Trouble with Connectors after Upgrading Airbyte Version

Summary

After upgrading Airbyte to the latest version, encountering issues with connectors failing to sync data due to Kubernetes pod initialization timeouts. User suspects the problem is related to Kubernetes even though they did not configure it. Seeking insights on how to resolve the issue.


Question

Hello guys I’m new to Airbyte and me and my team at work are having some trouble after we upgraded to the latest version.

We used to have an Airbyte (v0.50.35) deployment running on GCP via docker-compose and it was working fine, but we decided to upgrade to the latest version (v0.63.14) and now none of our connectors work and we can’t sync any data.
In order to upgrade our installation, we followed the official upgrade guide (https://docs.airbyte.com/using-airbyte/getting-started/oss-quickstart#migrating-from-docker-compose-optional) and when we ran this command:

abctl local install --migrate

We got a warning message saying that some Kubernetes pod(s) failed to initialize due to a timeout problem. It automatically retried a few times and eventually succeeded, so we didn’t give it much attention.

The problem is that we are getting a very similar issue when our connections start synchronizing data from our sources and we have no clue on what we should do. In the synchronization logs of one of our connectors (we didn’t change its definition after we upgraded), I see this error message:

message='io.airbyte.workers.exception.WorkloadLauncherException: io.airbyte.workload.launcher.pipeline.stages.model.StageError: io.airbyte.workload.launcher.pods.KubeClientException: Init container of orchestrator pod failed to start within allotted timeout of 900 seconds. (Timed out waiting for [900000] milliseconds for [Pod] with name:[orchestrator-repl-job-2016-attempt-1] in namespace [airbyte-abctl].) at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:46) at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:38) at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.$$access$$apply(Unknown Source) at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Exec.dispatch(Unknown Source) at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456) at io.micronaut.aop.chain.MethodInterceptorChain.proceed(MethodInterceptorChain.java:129) at io.airbyte.metrics.interceptors.InstrumentInterceptorBase.doIntercept(InstrumentInterceptorBase.kt:61) at io.airbyte.metrics.interceptors.InstrumentInterceptorBase.intercept(InstrumentInterceptorBase.kt:44) at io.micronaut.aop.chain.MethodInterceptorChain.proceed(MethodInterceptorChain.java:138) at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.apply(Unknown Source) at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:24) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2571) at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:2367) at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onSubscribe(FluxOnErrorResume.java:74) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:193) at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53) at reactor.core.publisher.Mono.subscribe(Mono.java:4552) at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.run(MonoSubscribeOn.java:126) at reactor.core.scheduler.ImmediateScheduler$ImmediateSchedulerWorker.schedule(ImmediateScheduler.java:84) at reactor.core.publisher.MonoSubscribeOn.subscribeOrReturn(MonoSubscribeOn.java:55) at reactor.core.publisher.Mono.subscribe(Mono.java:4552) at reactor.core.publisher.Mono.subscribeWith(Mono.java:4634) at reactor.core.publisher.Mono.subscribe(Mono.java:4395) at io.airbyte.workload.launcher.pipeline.LaunchPipeline.accept(LaunchPipeline.kt:50) at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:28) at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:12) at io.airbyte.commons.temporal.queue.QueueActivityImpl.consume(Internal.kt:87) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64) at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43) at io.temporal.common.interceptors.ActivityInboundCallsInterceptorBase.execute(ActivityInboundCallsInterceptorBase.java:39) at io.temporal.opentracing.internal.OpenTracingActivityInboundCallsInterceptor.execute(OpenTracingActivityInboundCallsInterceptor.java:78) at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:107) at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:124) at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:278) at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:243) at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:216) at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) Caused by: io.airbyte.workload.launcher.pods.KubeClientException: Init container of orchestrator pod failed to start within allotted timeout of 900 seconds. (Timed out waiting for [900000] milliseconds for [Pod] with name:[orchestrator-repl-job-2016-attempt-1] in namespace [airbyte-abctl].) at io.airbyte.workload.launcher.pods.KubePodClient.waitOrchestratorPodInit(KubePodClient.kt:118) at io.airbyte.workload.launcher.pods.KubePodClient.launchReplication(KubePodClient.kt:97) at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:43) at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ... 53 more Caused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [900000] milliseconds for [Pod] with name:[orchestrator-repl-job-2016-attempt-1] in namespace [airbyte-abctl]. at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:944) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:98) at io.fabric8.kubernetes.client.extension.ResourceAdapter.waitUntilCondition(ResourceAdapter.java:175) at io.airbyte.workload.launcher.pods.KubePodLauncher$waitForPodInitCustomCheck$initializedPod$1.invoke(KubePodLauncher.kt:108) at io.airbyte.workload.launcher.pods.KubePodLauncher$waitForPodInitCustomCheck$initializedPod$1.invoke(KubePodLauncher.kt:104) at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand$lambda$0(KubePodLauncher.kt:253) at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243) at dev.failsafe.Functions.lambda$get$0(Functions.java:46) at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74) at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187) at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand(KubePodLauncher.kt:253) at io.airbyte.workload.launcher.pods.KubePodLauncher.waitForPodInitCustomCheck(KubePodLauncher.kt:104) at io.airbyte.workload.launcher.pods.KubePodLauncher.waitForPodInit(KubePodLauncher.kt:66) at io.airbyte.workload.launcher.pods.KubePodClient.waitOrchestratorPodInit(KubePodClient.kt:115) ... 57 more ', type='java.lang.RuntimeException', nonRetryable=false
We didn’t configure anything to do with Kubernetes (we don’t even have kubectl installed), but from the log message, it’s safe to say that this seem to be related with Kubernetes failing to spin up pods to execute our workloads.

I’m running Airbyte on a VM on Debian 11 on GCP.

Can someone offer any insights on how to solve this problem? My first guess would be to increase the timeout period, but this is my first time working with Airbyte and I don’t really know what I’m doing. I was just following the documentation’s instructions when the problem started.

Any ideas?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["airbyte", "upgrade", "connector", "sync", "kubernetes", "pod initialization", "timeout", "error", "debian-11", "gcp"]

abctl uses kind cluster (https://kind.sigs.k8s.io/) under the hood

> https://sigs.k8s.io/kind|kind is a tool for running local Kubernetes clusters using Docker container “nodes”.
> kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.

If you want to debug, here’s an instruction

install kubectl (https://kubernetes.io/docs/tasks/tools/#kubectl) and kubectx + kubens (https://github.com/ahmetb/kubectx)
install also k9s (https://k9scli.io/)

then execute commands:

KUBECONFIG=~/.airbyte/abctl/abctl.kubeconfig kubectl config view --flatten > ~/.kube/config
kubectx kind-airbyte-abctl
kubens airbyte-abctl```
now you can check pods
`kubectl describe pods` or with `k9s`

you need to dig more into the code to find how to increase this value https://github.com/airbytehq/airbyte-platform/blob/9f4b277766e0fc3675dbe0d23cdbc90756209460/airbyte-workload-launcher/src/main/kotlin/pods/KubePodClient.kt#L119|KubePodClient.kt#L119

maybe there is a problem with pull Docker image or cpu/memory limits; hard to tell without checking details for pods

I’ll take a look into it. Thanks!

Updates: I upgraded the memory of the instance running Airbyte and tried synchronizing again. This time the error message changed slightly and now I’m facing an issue similar to this: https://github.com/airbytehq/airbyte/issues/39512

I get an error message saying:

io.airbyte.workers.exception.WorkerException: Failed to create pod for write step

Few things I’d do:
• start one terminal session with k9s running to monitor pods
• start second terminal session with stern --tail 0 . (https://github.com/stern/stern)
• in the third session, kubectl get events few times during synchronization to spot issues like FailedScheduling or any other problems
When k9s is running you select pods with arrows ↑, ↓, and d describe details about pod
There might be issues again with cpu/memory limits; pulling and creating pod takes too much time

How much memory/cpu does your machine have? How many synchronization are you running at the same time?
If you are using default values for helm chart, it might be useful to restrict requests/limits for resources.

My instance has 16GB of RAM and 4 vCPUs. I’m running one synchronization with 4 streams. All settings are set with default values

this is the output of top when running a synchronization:

When running kubectl describe pods, I saw a FailedScheduling error. The message was “0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.” I’ll try increasing the number of CPU cores available and try again

I switched to a helm installation now, so I’m no longer using abctl. Now I’m having issues with setting up a destination (I’m using the latest version of the Snowflake connector - 3.11.5).
When I try to create the destination using Airbyte’s web UI, I get an error saying “Airbyte is temporarily unavailable. Please try again (HTTP 502)”.
When I check the logs of the pod that was created for this task, I see this:

2024-08-09 14:11:17 WARN c.a.l.CommonsLog(warn):113 - JAXB is unavailable. Will fallback to SDK implementation which may be less performant.If you are using Java 9+, you will need to include javax.xml.bind:jaxb-api as a dependency.
2024-08-09 14:20:16 WARN i.a.c.ConnectorWatcher(run):74 - Failed to find output files from connector within timeout 9 minute(s). Is the connector still running?
2024-08-09 14:20:16 INFO i.a.c.ConnectorWatcher(failWorkload):277 - Failing workload 424892c4-daac-4491-b35d-c6688ba547ba_93c985f5-0dde-474a-b7d3-a1e2343dfc78_0_check.
2024-08-09 14:20:16 INFO i.a.c.ConnectorWatcher(exitFileNotFound):201 - Deliberately exiting process with code 2.
2024-08-09 14:20:16 WARN c.v.l.l.Log4j2Appender(close):108 - Already shutting down. Cannot remove shutdown hook.```
Needless to say that I tried again many times and got the same problem every time. Any ideas?

Now I'm using an instance with 30GB of memory and 8 vCPUs and since I'm not synchronizing any data (yet), I don't understand how that could be a resource limitation issue

Do you get "Airbyte is temporarily unavailable. Please try again (HTTP 502)" right away or after some delay?

Failed to find output files from connector within timeout 9 minute(s). Is the connector still running? comes from https://github.com/airbytehq/airbyte-platform/blob/0080471591d34176aed3cb331bf73a07cd1f3e36/airbyte-connector-sidecar/src/main/kotlin/io/airbyte/connectorSidecar/ConnectorWatcher.kt#L74|airbyte-connector-sidecar :thinking_face: I’ve never seen this before.

There is a delay. The destination test takes a long time and ultimately fails with this message.
When checking the server pod logs, I see an error related to Temporal:

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268) ~[grpc-stub-1.64.0.jar:1.64.0]```

Updates: When checking the temporal pod logs, I see this:

unable to health check "temporal.api.workflowservice.v1.WorkflowService" service: connection error: desc = "transport: Error while dialing: dial tcp 10.42.0.19:7233: connect: connection refused"
When running kubectl get services, I see that the temporal service is actually running on 10.43.112.175:7233. In fact, all services are running on the 10.43.x.x range. I think this might be the cause for this whole issue

With k9s you can check how much cpu/memory each pod consumes. It might be useful to set resources requests/limits in values.yaml and pass it to abctl local install --values values.yaml. That should help with scheduling pods.

I am having similar error. It looks like there is kubernetes network issue.
Caused by: io.airbyte.workload.launcher.pods.KubeClientException: Init container of orchestrator pod failed to start within allotted timeout of 900 seconds. (Timed out waiting for [900000] milliseconds for [Pod] with name:[orchestrator-repl-job-1-attempt-1] in namespace [airbyte-abctl].)

Hey <@U07AY8L61HT>, I managed to get it running. I was experiencing this issue when deploying with k3s. When I switched to minikube, things started working. There’s probably a way to make it work with k3s too, but I’m a kubernetes noob and didn’t manage to figure it out.
I basically followed this tutorial: https://docs.airbyte.com/deploying-airbyte/on-kubernetes-via-helm and managed to sync the data I needed