Handling Severed Nodes/Pods in Airbyte on Kubernetes Cluster

Summary

Inquiring about modifying heartbeat and timeout values in the Airbyte helm chart to handle severed nodes/pods more gracefully on a Kubernetes cluster.


Question

Hey all, I’m running airbyte on a (high availability, on prem) kubernetes cluster. We’re having occasional issues where a node goes down unexpectedly, and we can’t cancel any sync jobs that were using pods on that node. I looked over the heartbeat (3hr for source pods)/timeout (24hr for destination pods) docs, I’m wondering if anyone’s found a way to modify those values in the helm chart, or if there’s any way to set airbyte to handle severed nodes/pods more gracefully?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["airbyte", "kubernetes-cluster", "helm-chart", "heartbeat", "timeout", "severed-nodes", "pods"]

What version are you using? We switched architectures for controlling job pods a while back, such that we have a controller service (the workload-api-server) that is the source of truth for job pods. Assuming you are using the new architecture, then you should be able to cancel through the UI/API while the nodes themselves are down

Setting the following env vars will enable this architecture
WORKLOAD_LAUNCHER_ENABLED: true WORKLOAD_API_SERVER_ENABLED: true