Summary
Airbyte orchestrator is terminating pods before Karpenter provisions new nodes, leading to job failures. User seeks to extend orchestrator job scheduling timeout or explore other solutions for using Airbyte with Karpenter.sh on EKS.
Question
I’m running airbyte community 0.63.5 on an EKS 1.29 cluster that uses karpenter.sh 0.36 for autoscaling. The result of karpenter.sh is that launching pods is slightly slow, because karpenter provisions just in time nodes to support workloads. We keep running into an issue where the airbyte orchestrator will attempt to launch a bunch of pods that cannot be immediately scheduled, and Karpenter will notice the pods waiting to be scheduled and scale in a new node. but before the node comes online and is healthy (this takes about 25 seconds) Airbyte will decide that the pods are never going to launch and terminates them. Airbyte seems to give up on a pod becoming healthy after 30-45 seconds? So all my jobs fail with Failed to create pod
. When I manually scale in EKS nodes, my jobs all work, but I don’t want to leave excess capacity around all the time, and manually scaling these nodes not really an option. I think the easy solution is to just make the orchestrator job scheduling timeout longer. However, there might be other tricks to fix this? Anyone use Karpenter.sh and Airbyte on EKS? Can I configure the orchestrator to be more patient?
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.