Troubleshooting Unexpected Error in Airbyte Platform

Summary

Error message indicates a failure to create a pod for the check step in Airbyte platform. User is seeking direction for troubleshooting.


Question

I keep seeing this and I simply can’t figure out why, short of destroying our installation, I have no clue how to debug this one, I’ve taken multiple steps on rancher to isolate the issue but nothing stands out, there is two tickets related but no clear answer as of what is happening. https://github.com/airbytehq/airbyte/issues/35346 + https://github.com/airbytehq/airbyte/discussions/35301

io.airbyte.workers.exception.WorkerException: Failed to create pod for check step
        at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:197) ~[io.airbyte-airbyte-commons-worker-0.53.0.jar:?]
        at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:149) ~[io.airbyte-airbyte-commons-worker-0.53.0.jar:?]
        at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:71) ~[io.airbyte-airbyte-commons-worker-0.53.0.jar:?]```
Any direction will be appreciated.

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1710424150247589) if you want to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["debugging", "error", "airbyte-platform", "pod-creation", "troubleshooting"]
</sub>

Too little information to suggest something
Do you use helm + kubernetes? In case of “yes” - does check pod start? Can you show it’s logs?

the pod doesnt start no, it simply hangs and gets terminated

and yes I use the official helm chart with the only custom part being: is adding an ingress to the webapp on ssl

  database:
    secretName: ''
    secretValue: ''
  deploymentMode: oss
  edition: community
  env_vars: {}
  jobs:
    kube:
      annotations: {}
      images:
        busybox: ''
        curl: ''
        socat: ''
      labels: {}
      main_container_image_pull_secret: ''
      nodeSelector: {}
      tolerations: []
    resources:
      limits:
        cpu: 2
        memory: 8Gi
      requests:
        cpu: 0.5
        memory: 128Mi```

this app has been working for 200 days (this is not a fresh install)

stopped working all the sudden 2 days ago, not upgraded or changed anything server side

Do you see any Warnings in kubernetes events?

> the pod doesnt start no, it simply hangs and gets terminated
Do you get init error similar to this?

*btw, this screenshot is from https://github.com/derailed/k9s|k9s - cool tool for k8s management

ill quickly check, but yes it does end like that

from what I can tell its struggeling to copy a config file in under 60 seconds

Can you check the logs of the failed pod?
Maybe there are something like this (it’s my usual problem with Airbyte)

thats exactly my error yes

cleanup ran so ive started a fresh sync

from what I can tell its struggeling to copy a config file in under 60 seconds
Ok, than I know this problem :slightly_smiling_face:
There are 2 news: the good one and the bad one

The good:
<https://github.com/airbytehq/airbyte-platform/commit/315fb087504bc935e530d37b951a6a1ad96e5f63|This commit> added retries to config copying. So it must help a little in case you use more old platform

The bad:
I don’t know ideal solution for this problem.
We regularly have the same errors.
Airbyte uses kubectl cp for transferring configs to new pods
https://github.com/airbytehq/airbyte-platform/blame/87a6e979dd3de2776a98c17bcdaff99a444ee6a7/airbyte-commons-worker/src/main/java/io/airbyte/workers/process/KubePodProcess.java#L339|https://github.com/airbytehq/airbyte-platform/blame/87a6e979dd3de2776a98c17bcdaff9[…]er/src/main/java/io/airbyte/workers/process/KubePodProcess.java

https://github.com/airbytehq/airbyte-platform/blob/87a6e979dd3de2776a98c17bcdaff99a444ee6a7/airbyte-commons-worker/src/main/java/io/airbyte/workers/process/AsyncOrchestratorPodProcess.java#L626|https://github.com/airbytehq/airbyte-platform/blob/87a6e979dd3de2776a98c17bcdaff99[…]ava/io/airbyte/workers/process/AsyncOrchestratorPodProcess.java

And kubectl cp is not a very reliable tool

anything that can be done to slim down the config or make it faster again

I don’t think the problem with the config sizes (but i can be wrong)
From what i saw the configs are quite small

The good solution is to move from kubectl cp to more reliable tool (just my opinion, i’m not part of the Airbyte team, just a user)

Meanwhile, you can inspect your kubernetes cluster. In case kubectl cp fails regulary there are maybe some problems with network/cluster controller

we are busy upgrading from 1.25 to 1.27 in hopes it helps

but yes this is completely random