Troubleshooting Airbyte v0.57.3 setup on Kubernetes with orchestrator pod timeout

slack-user-airbyte · June 8, 2024, 7:00pm

Summary

When setting CONTAINER_ORCHESTRATOR_ENABLED to true in Airbyte v0.57.3 setup on Kubernetes, orchestrator pod times out waiting for a connection. Looking for other necessary env variables for orchestrator to work.

Question

Hello! I could use some help with Airbyte v0.57.3 setup on Kubernetes. I’m currently deploying Airbyte k8s without using Helm charts and have a functional setup to some extent, connecting to an external PostgreSQL database and an external MinIO object storage instance. When the env variable CONTAINER_ORCHESTRATOR_ENABLED is set to false, I can successfully create a connection and perform a sync. However, no logs are written by the worker pod to MinIO (I understand this is by design). When setting the CONTAINER_ORCHESTRATOR_ENABLED env variable to true, a new orchestrator pod is provisioned when a sync is triggered, but the pod times out waiting for a connection Attempt 0 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException (I’ll leave the full error in the first comment of this thread so as to avoid spamming the channel). Are there any other known env variables that need to be set for the orchestrator to successfully work? This issue is also similar to the one mentioned https://airbytehq.slack.com/archives/C021JANJ6TY/p1708542621743269|here a couple of weeks ago.

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["airbyte-v0.57.3", "kubernetes", "orchestrator-pod", "env-variables", "container-orchestrator-enabled"]}

slack-user-airbyte · June 22, 2024, 6:19am

Out of curiosity, is there a reason you’re avoiding deploying via Helm?

slack-user-airbyte · June 22, 2024, 6:19am

This is the full error from the logs of the orchestrator pod:

Unsetting empty environment variable 'DATA_PLANE_SERVICE_ACCOUNT_CREDENTIALS_PATH'
Unsetting empty environment variable 'FEATURE_FLAG_CLIENT'
Unsetting empty environment variable 'DD_AGENT_HOST'
Unsetting empty environment variable 'OTEL_COLLECTOR_ENDPOINT'
Unsetting empty environment variable 'JAVA_OPTS'
Unsetting empty environment variable 'DATA_PLANE_SERVICE_ACCOUNT_EMAIL'
Unsetting empty environment variable 'FIELD_SELECTION_WORKSPACES'
Unsetting empty environment variable 'METRIC_CLIENT'
Unsetting empty environment variable 'LAUNCHDARKLY_KEY'
Unsetting empty environment variable 'AIRBYTE_API_AUTH_HEADER_NAME'
Unsetting empty environment variable 'CONTROL_PLANE_AUTH_ENDPOINT'
Unsetting empty environment variable 'DD_DOGSTATSD_PORT'
Unsetting empty environment variable 'AIRBYTE_API_AUTH_HEADER_VALUE'
INFO StatusConsoleListener Loading mask data from '/seed/specs_secrets_mask.yaml

    ___    _      __          __
   /   |  (_)____/ /_  __  __/ /____
  / /| | / / ___/ __ \/ / / / __/ _ \
 / ___ |/ / /  / /_/ / /_/ / /_/  __/
/_/  |_/_/_/  /_.___/\__, /\__/\___/
                    /____/
 : airbyte-container-orchestrator :

2024-05-30 12:30:54 INFO i.m.c.e.DefaultEnvironment(&lt;init&gt;):168 - Established active environments: [k8s, cloud, control-plane]
INFO StatusConsoleListener Loading mask data from '/seed/specs_secrets_mask.yaml
2024-05-30 12:30:56 INFO i.m.r.Micronaut(start):100 - Startup completed in 3989ms. Server Running: <http://orchestrator-repl-job-41-attempt-0:9000>
2024-05-30 12:30:56 INFO i.a.f.ConfigFileClient(&lt;init&gt;):105 - path /flags does not exist, will return default flag values
2024-05-30 12:30:57 WARN i.a.m.l.MetricClientFactory(initialize):72 - MetricClient was not recognized or not provided. Accepted values are `datadog` or `otel`. 
2024-05-30 12:30:58 replication-orchestrator &gt; Writing async status INITIALIZING for KubePodInfo[namespace=dev-airbyte, name=orchestrator-repl-job-41-attempt-0, mainContainerInfo=KubeContainerInfo[image=airbyte/container-orchestrator:0.57.3, pullPolicy=IfNotPresent]]...
2024-05-30 12:30:58 replication-orchestrator &gt; sourceLauncherConfig is: io.airbyte.persistence.job.models.IntegrationLauncherConfig@1d9bd1d6[jobId=41,attemptId=0,connectionId=62502f07-5e5d-45a8-b000-c04f7d926459,workspaceId=b77c6db2-99c5-4319-a576-7dac713d29e4,dockerImage=airbyte/source-greenhouse:0.5.1,normalizationDockerImage=&lt;null&gt;,supportsDbt=false,normalizationIntegrationType=&lt;null&gt;,protocolVersion=Version{version='0.2.0', major='0', minor='2', patch='0'},isCustomConnector=false,allowedHosts=io.airbyte.config.AllowedHosts@18ac4af6[hosts=[<http://harvest.greenhouse.io|harvest.greenhouse.io>, *.<http://datadoghq.com|datadoghq.com>, *.<http://datadoghq.eu|datadoghq.eu>, *.<http://sentry.io|sentry.io>],additionalProperties={}],additionalEnvironmentVariables=&lt;null&gt;,additionalLabels=&lt;null&gt;,priority=&lt;null&gt;,additionalProperties={}]
2024-05-30 12:30:58 replication-orchestrator &gt; Attempt 0 to get the source definition for feature flag checks
2024-05-30 12:30:58 replication-orchestrator &gt; Attempt 0 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:31:01 replication-orchestrator &gt; Attempt 1 to get the source definition for feature flag checks
2024-05-30 12:31:01 replication-orchestrator &gt; Attempt 1 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:31:05 replication-orchestrator &gt; Attempt 2 to get the source definition for feature flag checks
2024-05-30 12:31:05 replication-orchestrator &gt; Attempt 2 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:41:05 replication-orchestrator &gt; Attempt 3 to get the source definition for feature flag checks
2024-05-30 12:41:05 replication-orchestrator &gt; Attempt 3 to get the source definition for feature flag checks error: io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
2024-05-30 12:41:05 replication-orchestrator &gt; retryWithJitter caught and ignoring exception:
get the source definition for feature flag checks: java.net.ConnectException
io.airbyte.api.client.invoker.generated.ApiException: java.net.ConnectException
	at io.airbyte.api.client.generated.SourceApi.getSourceWithHttpInfo(SourceApi.java:766) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.api.client.generated.SourceApi.getSource(SourceApi.java:733) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.lambda$create$0(ReplicationWorkerFactory.java:149) ~[io.airbyte-airbyte-commons-worker-0.57.3.jar:?]
	at io.airbyte.api.client.AirbyteApiClient.retryWithJitterThrows(AirbyteApiClient.java:292) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.api.client.AirbyteApiClient.retryWithJitter(AirbyteApiClient.java:232) [io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.api.client.AirbyteApiClient.retryWithJitter(AirbyteApiClient.java:200) [io.airbyte-airbyte-api-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.create(ReplicationWorkerFactory.java:148) [io.airbyte-airbyte-commons-worker-0.57.3.jar:?]
	at io.airbyte.container_orchestrator.orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:118) [io.airbyte-airbyte-container-orchestrator-0.57.3.jar:?]
	at io.airbyte.container_orchestrator.Application.run(Application.java:78) [io.airbyte-airbyte-container-orchestrator-0.57.3.jar:?]
	at io.airbyte.container_orchestrator.Application.main(Application.java:38) [io.airbyte-airbyte-container-orchestrator-0.57.3.jar:?]
Caused by: java.net.ConnectException
	at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:951) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133) ~[java.net.http:?]
	at io.airbyte.api.client.generated.SourceApi.getSourceWithHttpInfo(SourceApi.java:747) ~[io.airbyte-airbyte-api-0.57.3.jar:?]
	... 9 more
Caused by: java.net.ConnectException
	at java.net.http/jdk.internal.net.http.common.Utils.toConnectException(Utils.java:1028) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.connectAsync(PlainHttpConnection.java:227) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.checkRetryConnect(PlainHttpConnection.java:280) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.lambda$connectAsync$2(PlainHttpConnection.java:238) ~[java.net.http:?]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[?:?]
Caused by: java.nio.channels.ClosedChannelException
	at java.base/sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:202) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.beginConnect(SocketChannelImpl.java:786) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:874) ~[?:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.lambda$connectAsync$1(PlainHttpConnection.java:210) ~[java.net.http:?]
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.connectAsync(PlainHttpConnection.java:212) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.checkRetryConnect(PlainHttpConnection.java:280) ~[java.net.http:?]
	at java.net.http/jdk.internal.net.http.PlainHttpConnection.lambda$connectAsync$2(PlainHttpConnection.java:238) ~[java.net.http:?]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[?:?]
2024-05-30 12:41:05 replication-orchestrator &gt; An error occurred while fetch the max seconds between messages for this source. We are using a default of 24 hours
2024-05-30 12:41:05 replication-orchestrator &gt; Concurrent stream read enabled? false
2024-05-30 12:41:05 replication-orchestrator &gt; Setting up source...
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_MEMORY_LIMIT: '50Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_MEMORY_REQUEST: '25Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_LIMIT: '2.0'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_REQUEST: '0.1'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_MEMORY_LIMIT: '50Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_MEMORY_REQUEST: '25Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_LIMIT: '2.0'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_CPU_REQUEST: '0.1'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_KUBE_MEMORY_LIMIT: '50Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Using default value for environment variable SIDECAR_MEMORY_REQUEST: '25Mi'
2024-05-30 12:41:05 replication-orchestrator &gt; Setting up destination...
2024-05-30 12:41:05 ERROR i.a.c.Application(run):80 - Killing orchestrator because of an Exception
java.lang.NullPointerException: Parameter specified as non-null is null: method io.airbyte.featureflag.SourceDefinition.&lt;init&gt;, parameter key
	at io.airbyte.featureflag.SourceDefinition.&lt;init&gt;(Context.kt) ~[io.airbyte-airbyte-featureflag-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.createFieldSelector(ReplicationWorkerFactory.java:301) ~[io.airbyte-airbyte-commons-worker-0.57.3.jar:?]
	at io.airbyte.workers.general.ReplicationWorkerFactory.create(ReplicationWorkerFactory.java:177) ~[io.airbyte-airbyte-commons-worker-0.57.3.jar:?]```

slack-user-airbyte · June 22, 2024, 6:19am

<@U035912NS77> We’re working on an open-source alternative approach to k8s deployments as discussed with <@U04197GAK9R>, we’ll be announcing it on the dev-and-contributions channel once we have something to show

slack-user-airbyte · June 23, 2024, 6:19am

yes, I have horizontal autoscaling and sometimes when there is not enough space in the cluster to provision new pods it takes some seconds for them to start and the orchestrator times out pretty fast it seems

slack-user-airbyte · June 23, 2024, 6:20am

<@U033K0EG51U> I’m deployed on a GKE autopilot cluster, so it can also take a bit for nodes to spin up and become schedulable, but I’m not seeing any errors related to it. (But I did need to make sure that timeouts are set high enough that I don’t get a 502 before it spins up—600 seconds seems to work well for me thus far.)

slack-user-airbyte · June 23, 2024, 6:20am

Hm, for us on GCP it actually lives in the load balancer config for a private cluster (the way they use Network Endpoint Groups/NEGs is a little unique), but unfortunately I’m not as familiar with how that works in EKS (or whether cross-service traffic even passes through the load balancer or not). But I’d start by looking for settings related to the backend timeouts on the load balancer if there is one, as even if this doesn’t apply to the cross-service I/O it will also come up during schema discovery where it’ll time out the client connection while waiting for the orchestrator to spin up the container for discovery. And then maybe check for any timeouts in the service configs as well.

slack-user-airbyte · June 24, 2024, 6:17am

having the same issue… it succeeds after a couple of retries but it usually fails some attempts due to timeout of the orchestrator pod waiting for the other pods to start
did you find a workaround?

slack-user-airbyte · June 24, 2024, 6:17am

This was an issue because the orchestrator pod was not able to connect to the other pods in the namespace, I had some of the relevant environment variables (e.g. INTERNAL_API_HOST and AIRBYTE_SERVER_HOST set to localhost:<service_port> ). Setting these explicitly to <k8s_service_name>:<service_port> resolved the issue for me and allowed the orchestrator pod to talk to the other services it needs to fetch the required information.

In your case, it sounds like it’s failing because the other pods/services are not ready yet.

slack-user-airbyte · June 24, 2024, 6:17am

great! where did you find that setting? Im deploying with helm on EKS

slack-user-airbyte · June 24, 2024, 6:17am

thx I’ll see what I find

Topic		Replies	Views
Running the launcher replication-orchestrator failed Platform, Deploy & Infra Issues connectors , deploy	7	2821	March 3, 2023
Trouble deploying Airbyte on EKS Platform Questions deployment , platform , airbyte , question , eks	9	302	August 1, 2024
Trouble with Connectors after Upgrading Airbyte Version Platform Questions kubernetes , airbyte , gcp , connector , error	16	1151	September 8, 2024
Scaling Airbyte components for high availability Platform Questions kubernetes , platform , airbyte-temporal , airbyte-worker , airbyte-server	15	138	August 19, 2024
Syncs failing with 'Failed to create pod orchestrator' error Platform Questions platform , question , syncs-failing , pod-orchestrator-error , settings-enable	7	51	June 24, 2024

Troubleshooting Airbyte v0.57.3 setup on Kubernetes with orchestrator pod timeout

Summary

Question

Related topics