Troubleshooting Airbyte v0.57.3 setup on Kubernetes with orchestrator pod timeout

Summary

When setting CONTAINER_ORCHESTRATOR_ENABLED to true in Airbyte v0.57.3 setup on Kubernetes, orchestrator pod times out waiting for a connection. Looking for other necessary env variables for orchestrator to work.


Question

Hello! I could use some help with Airbyte v0.57.3 setup on Kubernetes. I’m currently deploying Airbyte k8s without using Helm charts and have a functional setup to some extent, connecting to an external PostgreSQL database and an external MinIO object storage instance. When the env variable CONTAINER_ORCHESTRATOR_ENABLED is set to false, I can successfully create a connection and perform a sync. However, no logs are written by the worker pod to MinIO (I understand this is by design). When setting the CONTAINER_ORCHESTRATOR_ENABLED env variable to true, a new orchestrator pod is provisioned when a sync is triggered, but the pod times out waiting for a connection Attempt 0 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException (I’ll leave the full error in the first comment of this thread so as to avoid spamming the channel). Are there any other known env variables that need to be set for the orchestrator to successfully work? This issue is also similar to the one mentioned https://airbytehq.slack.com/archives/C021JANJ6TY/p1708542621743269|here a couple of weeks ago.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["airbyte-v0.57.3", "kubernetes", "orchestrator-pod", "env-variables", "container-orchestrator-enabled"]

Out of curiosity, is there a reason you’re avoiding deploying via Helm?

This is the full error from the logs of the orchestrator pod:

Unsetting empty environment variable 'DATA_PLANE_SERVICE_ACCOUNT_CREDENTIALS_PATH'
Unsetting empty environment variable 'FEATURE_FLAG_CLIENT'
Unsetting empty environment variable 'DD_AGENT_HOST'
Unsetting empty environment variable 'OTEL_COLLECTOR_ENDPOINT'
Unsetting empty environment variable 'JAVA_OPTS'
Unsetting empty environment variable 'DATA_PLANE_SERVICE_ACCOUNT_EMAIL'
Unsetting empty environment variable 'FIELD_SELECTION_WORKSPACES'
Unsetting empty environment variable 'METRIC_CLIENT'
Unsetting empty environment variable 'LAUNCHDARKLY_KEY'
Unsetting empty environment variable 'AIRBYTE_API_AUTH_HEADER_NAME'
Unsetting empty environment variable 'CONTROL_PLANE_AUTH_ENDPOINT'
Unsetting empty environment variable 'DD_DOGSTATSD_PORT'
Unsetting empty environment variable 'AIRBYTE_API_AUTH_HEADER_VALUE'
INFO StatusConsoleListener Loading mask data from '/seed/specs_secrets_mask.yaml

    ___    _      __          __
   /   |  (_)____/ /_  __  __/ /____
  / /| | / / ___/ __ \/ / / / __/ _ \
 / ___ |/ / /  / /_/ / /_/ / /_/  __/
/_/  |_/_/_/  /_.___/\__, /\__/\___/
                    /____/
 : airbyte-container-orchestrator :

2024-05-30 12:30:54 INFO i.m.c.e.DefaultEnvironment(<init>):168 - Established active environments: [k8s, cloud, control-plane]
INFO StatusConsoleListener Loading mask data from '/seed/specs_secrets_mask.yaml
2024-05-30 12:30:56 INFO i.m.r.Micronaut(start):100 - Startup completed in 3989ms. Server Running: <http://orchestrator-repl-job-41-attempt-0:9000>
2024-05-30 12:30:56 INFO i.a.f.ConfigFileClient(&lt;init&gt;):105 - path /flags does not exist, will return default flag values
2024-05-30 12:30:57 WARN i.a.m.l.MetricClientFactory(initialize):72 - MetricClient was not recognized or not provided. Accepted values are `datadog` or `otel`. 
2024-05-30 12:30:58 replication-orchestrator &gt; Writing async status INITIALIZING for KubePodInfo[namespace=dev-airbyte, name=orchestrator-repl-job-41-attempt-0, mainContainerInfo=KubeContainerInfo[image=airbyte/container-orchestrator:0.57.3, pullPolicy=IfNotPresent]]...
2024-05-30 12:30:58 replication-orchestrator &gt; sourceLauncherConfig is: io.airbyte.persistence.job.models.IntegrationLauncherConfig@1d9bd1d6[jobId=41,attemptId=0,connectionId=62502f07-5e5d-45a8-b000-c04f7d926459,workspaceId=b77c6db2-99c5-4319-a576-7dac713d29e4,dockerImage=airbyte/source-greenhouse:0.5.1,normalizationDockerImage=&lt;null&gt;,supportsDbt=false,normalizationIntegrationType=&lt;null&gt;,protocolVersion=Version{version='0.2.0', major='0', minor='2', patch='0'},isCustomConnector=false,allowedHosts=io.airbyte.config.AllowedHosts@18ac4af6[hosts=[<http://harvest.greenhouse.io|harvest.greenhouse.io>, *.<http://datadoghq.com|datadoghq.com>, *.<http://datadoghq.eu|datadoghq.eu>, *.<http://sentry.io|sentry.io>],additionalProperties={}],additionalEnvironmentVariables=&lt;null&gt;,additionalLabels=&lt;null&gt;,priority=&lt;null&gt;,additionalProperties={}]
2024-05-30 12:30:58 replication-orchestrator &gt; Attempt 0 to get the source definition for feature flag checks
2024-05-30 12:30:58 replication-orchestrator &gt; Attempt 0 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:31:01 replication-orchestrator &gt; Attempt 1 to get the source definition for feature flag checks
2024-05-30 12:31:01 replication-orchestrator &gt; Attempt 1 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:31:05 replication-orchestrator &gt; Attempt 2 to get the source definition for feature flag checks
2024-05-30 12:31:05 replication-orchestrator &gt; Attempt 2 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:41:05 replication-orchestrator &gt; Attempt 3 to get the source definition for feature flag checks
2024-05-30 12:41:05 replication-orchestrator &gt; Attempt 3 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:41:05 replication-orchestrator &gt; retryWithJitter caught and ignoring exception:
get the source definition for feature flag checks: java.net.ConnectException
io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
	at io.airbyte.api.client.generated.SourceApi.getSourceWithHttpInfo(SourceApi.java:766) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.api.client.generated.SourceApi.getSource(SourceApi.java:733) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.lambda$create$0(ReplicationWorkerFactory.java:149) ~[io.airbyte-airbyte-commons-worker-0.57.3.jar:?]
	at io.airbyte.api.client.AirbyteApiClient.retryWithJitterThrows(AirbyteApiClient.java:292) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.api.client.AirbyteApiClient.retryWithJitter(AirbyteApiClient.java:232) [io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.api.client.AirbyteApiClient.retryWithJitter(AirbyteApiClient.java:200) [io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.create(ReplicationWorkerFactory.java:148) [io.airbyte-airbyte-commons-worker-0.57.3.jar:?]
	at io.airbyte.container_orchestrator.orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:118) [io.airbyte-airbyte-container-orchestrator-0.57.3.jar:?]
	at io.airbyte.container_orchestrator.Application.run(Application.java:78) [io.airbyte-airbyte-container-orchestrator-0.57.3.jar:?]
	at io.airbyte.container_orchestrator.Application.main(Application.java:38) [io.airbyte-airbyte-container-orchestrator-0.57.3.jar:?]
Caused by: java.net.ConnectException
	at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:951) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133) ~[java.net.http:?]
	at io.airbyte.api.client.generated.SourceApi.getSourceWithHttpInfo(SourceApi.java:747) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	... 9 more
Caused by: java.net.ConnectException
	at java.net.http/jdk.internal.net.http.common.Utils.toConnectException(Utils.java:1028) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.connectAsync(PlainHttpConnection.java:227) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.checkRetryConnect(PlainHttpConnection.java:280) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.lambda$connectAsync$2(PlainHttpConnection.java:238) ~[java.net.http:?]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[?:?]
Caused by: java.nio.channels.ClosedChannelException
	at java.base/sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:202) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.beginConnect(SocketChannelImpl.java:786) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:874) ~[?:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.lambda$connectAsync$1(PlainHttpConnection.java:210) ~[java.net.http:?]
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.connectAsync(PlainHttpConnection.java:212) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.checkRetryConnect(PlainHttpConnection.java:280) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.lambda$connectAsync$2(PlainHttpConnection.java:238) ~[java.net.http:?]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[?:?]
2024-05-30 12:41:05 replication-orchestrator &gt; An error occurred while fetch the max seconds between messages for this source. We are using a default of 24 hours
2024-05-30 12:41:05 replication-orchestrator &gt; Concurrent stream read enabled? false
2024-05-30 12:41:05 replication-orchestrator &gt; Setting up source...
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_MEMORY_LIMIT: '50Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_MEMORY_REQUEST: '25Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_LIMIT: '2.0'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_REQUEST: '0.1'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_MEMORY_LIMIT: '50Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_MEMORY_REQUEST: '25Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_LIMIT: '2.0'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_REQUEST: '0.1'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_MEMORY_LIMIT: '50Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_MEMORY_REQUEST: '25Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Setting up destination...
2024-05-30 12:41:05 ERROR i.a.c.Application(run):80 - Killing orchestrator because of an Exception
java.lang.NullPointerException: Parameter specified as non-null is null: method io.airbyte.featureflag.SourceDefinition.&lt;init&gt;, parameter key
	at io.airbyte.featureflag.SourceDefinition.&lt;init&gt;(Context.kt) ~[io.airbyte-airbyte-featureflag-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.createFieldSelector(ReplicationWorkerFactory.java:301) ~[io.airbyte-airbyte-commons-worker-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.create(ReplicationWorkerFactory.java:177) ~[io.airbyte-airbyte-commons-worker-0.57.3.jar:?]```

<@U035912NS77> We’re working on an open-source alternative approach to k8s deployments as discussed with <@U04197GAK9R>, we’ll be announcing it on the dev-and-contributions channel once we have something to show

yes, I have horizontal autoscaling and sometimes when there is not enough space in the cluster to provision new pods it takes some seconds for them to start and the orchestrator times out pretty fast it seems

<@U033K0EG51U> I’m deployed on a GKE autopilot cluster, so it can also take a bit for nodes to spin up and become schedulable, but I’m not seeing any errors related to it. (But I did need to make sure that timeouts are set high enough that I don’t get a 502 before it spins up—600 seconds seems to work well for me thus far.)

Hm, for us on GCP it actually lives in the load balancer config for a private cluster (the way they use Network Endpoint Groups/NEGs is a little unique), but unfortunately I’m not as familiar with how that works in EKS (or whether cross-service traffic even passes through the load balancer or not). But I’d start by looking for settings related to the backend timeouts on the load balancer if there is one, as even if this doesn’t apply to the cross-service I/O it will also come up during schema discovery where it’ll time out the client connection while waiting for the orchestrator to spin up the container for discovery. And then maybe check for any timeouts in the service configs as well.