Airbyte Orchestrator Job Scheduling Timeout with Karpenter.sh on EKS

Summary

Airbyte orchestrator is terminating pods before Karpenter provisions new nodes, leading to job failures. User seeks to extend orchestrator job scheduling timeout or explore other solutions for using Airbyte with Karpenter.sh on EKS.


Question

I’m running airbyte community 0.63.5 on an EKS 1.29 cluster that uses karpenter.sh 0.36 for autoscaling. The result of karpenter.sh is that launching pods is slightly slow, because karpenter provisions just in time nodes to support workloads. We keep running into an issue where the airbyte orchestrator will attempt to launch a bunch of pods that cannot be immediately scheduled, and Karpenter will notice the pods waiting to be scheduled and scale in a new node. but before the node comes online and is healthy (this takes about 25 seconds) Airbyte will decide that the pods are never going to launch and terminates them. Airbyte seems to give up on a pod becoming healthy after 30-45 seconds? So all my jobs fail with Failed to create pod. When I manually scale in EKS nodes, my jobs all work, but I don’t want to leave excess capacity around all the time, and manually scaling these nodes not really an option. I think the easy solution is to just make the orchestrator job scheduling timeout longer. However, there might be other tricks to fix this? Anyone use Karpenter.sh and Airbyte on EKS? Can I configure the orchestrator to be more patient?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["airbyte-orchestrator", "karpenter.sh", "EKS", "job-scheduling-timeout", "pod-failure"]

Searching slack, I found: https://airbytehq.slack.com/archives/C021JANJ6TY/p1722389058180809
and https://airbytehq.slack.com/archives/C021JANJ6TY/p1707940805804549

https://github.com/airbytehq/airbyte/discussions/35301

seems to have no answers, just a lot of people asking

We use cluster-autoscaler instead of karpenter in EKS and we keep 1 Airbyte node up so the first job starts immediately and only subsequent jobs that don’t fit on that node would need just in time node provisioning

In the past month we are seeing a max start time for a job of 2 minutes

the orchestrator launches a few pods successfully, it’s just when it tries to launch the read and write sync jobs that it gives up too soon.

Do all 5 attempts of the sync fail?

actually, I’m looking at some IAM issues, it’s possible my karpenter nodes don’t have EBS/EFS policies that they need?

If Karpenter doesn’t have IAM permissions for EBS and EFS it could be an issue yeah

I was just reading some of the troubleshooting steps on that github issue, and someone mentioned EBS permissions, so I checked and it looks like my ASG instance role include EBS/EFS/EFSEC2, but my karpenter role doesn’t, so I’m just trying to tell karpenter to provision nodes with the same instance role as what ASG is provisioning

Makes sense. We’ve talked about moving to Karpenter but haven’t had a good use case for it yet

Hello :wave:
Same problems here :confused:
Do you have found any solution?
Thx :pray: