Airbyte Orchestrator Job Scheduling Timeout with Karpenter.sh on EKS

slack-user-airbyte · September 13, 2024, 6:13am

Summary

Airbyte orchestrator is terminating pods before Karpenter provisions new nodes, leading to job failures. User seeks to extend orchestrator job scheduling timeout or explore other solutions for using Airbyte with Karpenter.sh on EKS.

Question

I’m running airbyte community 0.63.5 on an EKS 1.29 cluster that uses karpenter.sh 0.36 for autoscaling. The result of karpenter.sh is that launching pods is slightly slow, because karpenter provisions just in time nodes to support workloads. We keep running into an issue where the airbyte orchestrator will attempt to launch a bunch of pods that cannot be immediately scheduled, and Karpenter will notice the pods waiting to be scheduled and scale in a new node. but before the node comes online and is healthy (this takes about 25 seconds) Airbyte will decide that the pods are never going to launch and terminates them. Airbyte seems to give up on a pod becoming healthy after 30-45 seconds? So all my jobs fail with Failed to create pod. When I manually scale in EKS nodes, my jobs all work, but I don’t want to leave excess capacity around all the time, and manually scaling these nodes not really an option. I think the easy solution is to just make the orchestrator job scheduling timeout longer. However, there might be other tricks to fix this? Anyone use Karpenter.sh and Airbyte on EKS? Can I configure the orchestrator to be more patient?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["airbyte-orchestrator", "karpenter.sh", "EKS", "job-scheduling-timeout", "pod-failure"]}

slack-user-airbyte · September 13, 2024, 6:21am

Searching slack, I found: https://airbytehq.slack.com/archives/C021JANJ6TY/p1722389058180809
and https://airbytehq.slack.com/archives/C021JANJ6TY/p1707940805804549

slack-user-airbyte · September 13, 2024, 6:21am

https://github.com/airbytehq/airbyte/discussions/35301

slack-user-airbyte · September 13, 2024, 6:21am

seems to have no answers, just a lot of people asking

slack-user-airbyte · September 13, 2024, 6:22am

We use cluster-autoscaler instead of karpenter in EKS and we keep 1 Airbyte node up so the first job starts immediately and only subsequent jobs that don’t fit on that node would need just in time node provisioning

slack-user-airbyte · September 13, 2024, 6:22am

In the past month we are seeing a max start time for a job of 2 minutes

slack-user-airbyte · September 13, 2024, 6:22am

the orchestrator launches a few pods successfully, it’s just when it tries to launch the read and write sync jobs that it gives up too soon.

slack-user-airbyte · September 13, 2024, 6:22am

Do all 5 attempts of the sync fail?

slack-user-airbyte · September 13, 2024, 6:22am

actually, I’m looking at some IAM issues, it’s possible my karpenter nodes don’t have EBS/EFS policies that they need?

slack-user-airbyte · September 13, 2024, 6:22am

If Karpenter doesn’t have IAM permissions for EBS and EFS it could be an issue yeah

slack-user-airbyte · September 13, 2024, 6:22am

I was just reading some of the troubleshooting steps on that github issue, and someone mentioned EBS permissions, so I checked and it looks like my ASG instance role include EBS/EFS/EFSEC2, but my karpenter role doesn’t, so I’m just trying to tell karpenter to provision nodes with the same instance role as what ASG is provisioning

slack-user-airbyte · September 13, 2024, 6:22am

Makes sense. We’ve talked about moving to Karpenter but haven’t had a good use case for it yet

slack-user-airbyte · September 24, 2024, 6:21am

Hello
Same problems here
Do you have found any solution?
Thx

Topic		Replies	Views
Controlling Airbyte Pod Initialization Wait Time Platform Questions platform , airbyte , feature , sync , aws-eks	0	34	July 31, 2024
Destination, orchestrator, and source pod all stuck in terminating Connector Questions & Issues kubernetes	9	885	September 14, 2022
Troubleshooting Airbyte v0.57.3 setup on Kubernetes with orchestrator pod timeout Platform Questions kubernetes , platform , question , orchestrator-pod , airbyte-v0573	10	429	June 24, 2024
How to configure to have a larger number of concurrency? Q&A kubernetes , deploy	25	7386	June 12, 2023
Handling Severed Nodes/Pods in Airbyte on Kubernetes Cluster Platform Questions platform , airbyte , helm-chart , kubernetes-cluster , pods	2	25	September 13, 2024

Airbyte Orchestrator Job Scheduling Timeout with Karpenter.sh on EKS

Summary

Question

Related topics