Temporal Scheduler Causing Multiple Syncs to Hang

  • Is this your first time deploying Airbyte?: No
  • Memory / Disk: c5a.xlarge EC2 instance - 8 GiB of Memory, 4 vCPUs
  • Deployment: Kube
  • Airbyte Version: 0.39.7-alpha
  • Source name/version: Mindbody (custom source)
  • Destination name/version: Snowflake (0.4.26)
  • Step: sync
  • Description: we upgraded from 0.38.4 → 0.39.7 for the new scheduler (we used the default previously/did not flip the feature flag) and now a bunch of our syncs are getting hung up. Here’s an example log for a sync that normally takes <3 minutes:
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):110 - Cloud storage job log path: /workspace/2461/0/logs.log
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):113 - Executing worker wrapper. Airbyte version: 0.39.7-alpha
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):100 - Attempting to start pod = source-mindbody-check-2461-0-odrkb for lukeolson13/source-mindbody:0.1.24
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):103 - source-mindbody-check-2461-0-odrkb stdoutLocalPort = 9014
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):106 - source-mindbody-check-2461-0-odrkb stderrLocalPort = 9015
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$9):583 - Creating stdout socket server...
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):515 - Creating pod source-mindbody-check-2461-0-odrkb...
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$10):601 - Creating stderr socket server...
2022-06-02 05:02:36 e[32mINFOe[m i.a.w.p.KubePodProcess(waitForInitPodToRun):305 - Waiting for init container to be ready before copying files...
2022-06-02 05:07:36 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getWorkerThread$2):161 - Completing future exceptionally...
io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Timed out waiting for [300000] milliseconds for [Pod] with name:[source-mindbody-check-2461-0-odrkb] in namespace [default].
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:138) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 3 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[source-mindbody-check-2461-0-odrkb] in namespace [default].
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:860) ~[kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:83) ~[kubernetes-client-5.12.2.jar:?]
	at io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:308) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:544) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:134) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 3 more
2022-06-02 05:07:36 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-02 05:07:36 e[33mWARNe[m i.t.i.a.POJOActivityTaskHandler(activityFailureToResult):307 - Activity failure. ActivityId=8460eceb-8cb8-3778-8e5d-11f5caae584f, activityType=Run, attempt=1
java.util.concurrent.ExecutionException: io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.get(TemporalAttemptExecution.java:132) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.check.connection.CheckConnectionActivityImpl.run(CheckConnectionActivityImpl.java:76) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at jdk.internal.reflect.GeneratedMethodAccessor616.invoke(Unknown Source) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
	at io.temporal.internal.activity.POJOActivityTaskHandler$POJOActivityInboundCallsInterceptor.execute(POJOActivityTaskHandler.java:214) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.activity.POJOActivityTaskHandler$POJOActivityImplementation.execute(POJOActivityTaskHandler.java:180) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.activity.POJOActivityTaskHandler.handle(POJOActivityTaskHandler.java:120) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:204) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:164) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93) ~[temporal-sdk-1.8.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 1 more
Caused by: io.airbyte.workers.exception.WorkerException: Timed out waiting for [300000] milliseconds for [Pod] with name:[source-mindbody-check-2461-0-odrkb] in namespace [default].
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:138) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 1 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[source-mindbody-check-2461-0-odrkb] in namespace [default].
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:860) ~[kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:83) ~[kubernetes-client-5.12.2.jar:?]
	at io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:308) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:544) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:134) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 1 more

At the time of dumping this log the sync had been running for >15 hours. We kick off 27 syncs at the same time each night (same source and destination, just different config) and haven’t had any issues with concurrency and queueing. For this first batch run post upgrade, 4 out of those 27 jobs got hung up (all with the same log error prior to getting hung up).

Here’s a SQL Server (0.3.22) log if it’s helpful, too:

2022-06-02 19:11:27 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):110 - Cloud storage job log path: /workspace/2478/0/logs.log
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):113 - Executing worker wrapper. Airbyte version: 0.39.7-alpha
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):100 - Attempting to start pod = source-mssql-check-2478-0-myrts for airbyte/source-mssql:0.3.22
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):103 - source-mssql-check-2478-0-myrts stdoutLocalPort = 9024
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):106 - source-mssql-check-2478-0-myrts stderrLocalPort = 9025
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$9):583 - Creating stdout socket server...
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):515 - Creating pod source-mssql-check-2478-0-myrts...
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$10):601 - Creating stderr socket server...
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubePodProcess(waitForInitPodToRun):305 - Waiting for init container to be ready before copying files...
2022-06-02 19:11:27 e[32mINFOe[m i.a.w.p.KubePodProcess(waitForInitPodToRun):309 - Init container present..
2022-06-02 19:11:28 e[32mINFOe[m i.a.w.p.KubePodProcess(waitForInitPodToRun):312 - Init container ready..
2022-06-02 19:11:28 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):546 - Copying files...
2022-06-02 19:11:28 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):254 - Uploading file: source_config.json
2022-06-02 19:11:28 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):262 - kubectl cp /tmp/8d1d8180-e3cd-4a2c-819e-c4cece19fc50/source_config.json default/source-mssql-check-2478-0-myrts:/config/source_config.json -c init
2022-06-02 19:11:28 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):265 - Waiting for kubectl cp to complete
2022-06-02 19:11:29 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):279 - kubectl cp complete, closing process
2022-06-02 19:11:29 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):254 - Uploading file: FINISHED_UPLOADING
2022-06-02 19:11:29 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):262 - kubectl cp /tmp/de6c3e78-8911-4c32-a8d0-d3222bf032f0/FINISHED_UPLOADING default/source-mssql-check-2478-0-myrts:/config/FINISHED_UPLOADING -c init
2022-06-02 19:11:29 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):265 - Waiting for kubectl cp to complete
2022-06-02 19:11:29 e[32mINFOe[m i.a.w.p.KubePodProcess(copyFilesToKubeConfigVolume):279 - kubectl cp complete, closing process
2022-06-02 19:11:29 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):549 - Waiting until pod is ready...
2022-06-02 19:11:30 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$9):592 - Setting stdout...
2022-06-02 19:11:30 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$10):604 - Setting stderr...
2022-06-02 19:11:30 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):564 - Reading pod IP...
2022-06-02 19:11:30 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):566 - Pod IP: 10.226.4.239
2022-06-02 19:11:30 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):573 - Using null stdin output stream...
2022-06-02 19:11:30 e[1;31mERRORe[m i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Class path contains multiple SLF4J bindings.
2022-06-02 19:11:30 e[1;31mERRORe[m i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Found binding in [jar:file:/airbyte/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-06-02 19:11:30 e[1;31mERRORe[m i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Found binding in [jar:file:/airbyte/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-06-02 19:11:30 e[1;31mERRORe[m i.a.c.i.LineGobbler(voidCall):82 - SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2022-06-02 19:11:30 e[1;31mERRORe[m i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2022-06-02 19:11:31 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:31 e[32mINFOe[m i.a.i.s.m.MssqlSource(main):450 - starting source: class io.airbyte.integrations.source.mssql.MssqlSource
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[32mINFOe[m i.a.i.b.IntegrationCliParser(parseOptions):118 - integration args: {check=null, config=source_config.json}
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[32mINFOe[m i.a.i.b.IntegrationRunner(runInternal):121 - Running integration: io.airbyte.integrations.base.ssh.SshWrappedSource
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[32mINFOe[m i.a.i.b.IntegrationRunner(runInternal):122 - Command: CHECK
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[32mINFOe[m i.a.i.b.IntegrationRunner(runInternal):123 - Integration config: IntegrationConfig{command=CHECK, configPath='source_config.json', catalogPath='null', statePath='null'}
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[33mWARNe[m c.n.s.JsonMetaSchema(newValidator):338 - Unknown keyword order - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[33mWARNe[m c.n.s.JsonMetaSchema(newValidator):338 - Unknown keyword examples - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[33mWARNe[m c.n.s.JsonMetaSchema(newValidator):338 - Unknown keyword airbyte_secret - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[33mWARNe[m c.n.s.JsonMetaSchema(newValidator):338 - Unknown keyword multiline - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
2022-06-02 19:11:32 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:32 e[32mINFOe[m i.a.i.b.s.SshTunnel(getInstance):170 - Starting connection with method: NO_TUNNEL
2022-06-02 19:11:33 e[32mINFOe[m i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-02 19:11:33 e[32mINFOe[m i.a.i.s.j.AbstractJdbcSource(lambda$getCheckOperations$1):89 - Attempting to get metadata from the database to see if we can connect.
2022-06-02 19:11:34 e[32mINFOe[m i.a.w.p.KubePodProcess(close):713 - (pod: default / source-mssql-check-2478-0-myrts) - Closed all resources for pod
2022-06-02 19:11:34 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):110 - Cloud storage job log path: /workspace/2478/0/logs.log
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):113 - Executing worker wrapper. Airbyte version: 0.39.7-alpha
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):100 - Attempting to start pod = destination-snowflake-check-2478-0-eebzj for airbyte/destination-snowflake:0.4.26
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):103 - destination-snowflake-check-2478-0-eebzj stdoutLocalPort = 9022
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubeProcessFactory(create):106 - destination-snowflake-check-2478-0-eebzj stderrLocalPort = 9023
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$9):583 - Creating stdout socket server...
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubePodProcess(<init>):515 - Creating pod destination-snowflake-check-2478-0-eebzj...
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubePodProcess(lambda$setupStdOutAndStdErrListeners$10):601 - Creating stderr socket server...
2022-06-02 19:11:35 e[32mINFOe[m i.a.w.p.KubePodProcess(waitForInitPodToRun):305 - Waiting for init container to be ready before copying files...
2022-06-02 19:16:35 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getWorkerThread$2):161 - Completing future exceptionally...
io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Timed out waiting for [300000] milliseconds for [Pod] with name:[destination-snowflake-check-2478-0-eebzj] in namespace [default].
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:138) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 3 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[destination-snowflake-check-2478-0-eebzj] in namespace [default].
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:860) ~[kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:83) ~[kubernetes-client-5.12.2.jar:?]
	at io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:308) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:544) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:134) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 3 more
2022-06-02 19:16:35 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-02 19:16:35 e[33mWARNe[m i.t.i.a.POJOActivityTaskHandler(activityFailureToResult):307 - Activity failure. ActivityId=06b7068a-67f4-317e-a39a-9ae99f236d3b, activityType=Run, attempt=1
java.util.concurrent.ExecutionException: io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.get(TemporalAttemptExecution.java:132) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.check.connection.CheckConnectionActivityImpl.run(CheckConnectionActivityImpl.java:76) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at jdk.internal.reflect.GeneratedMethodAccessor616.invoke(Unknown Source) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
	at io.temporal.internal.activity.POJOActivityTaskHandler$POJOActivityInboundCallsInterceptor.execute(POJOActivityTaskHandler.java:214) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.activity.POJOActivityTaskHandler$POJOActivityImplementation.execute(POJOActivityTaskHandler.java:180) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.activity.POJOActivityTaskHandler.handle(POJOActivityTaskHandler.java:120) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:204) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:164) ~[temporal-sdk-1.8.1.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93) ~[temporal-sdk-1.8.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 1 more
Caused by: io.airbyte.workers.exception.WorkerException: Timed out waiting for [300000] milliseconds for [Pod] with name:[destination-snowflake-check-2478-0-eebzj] in namespace [default].
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:138) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 1 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[destination-snowflake-check-2478-0-eebzj] in namespace [default].
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:860) ~[kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:83) ~[kubernetes-client-5.12.2.jar:?]
	at io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:308) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:544) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:134) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.7-alpha.jar:?]
	... 1 more

This one doesn’t get kicked off with a bunch of others like the other sync does, but given the other hangs there were probably 5-10 syncs either running or queued when this one hung.

Hi @luke,
Could you please explain what was your upgrade procedure?
Did you follow the instructions from this page?
According to the description in the User Impact section of this PR the transition should be smooth and nothing specific should be done on the user side.
Could you please share your server and scheduler log? There might be helpful information in these files.

Our devops guy is out today but I can confirm next week, although this is our 3rd or 4th upgrade and we haven’t had any issues previously.
In the meantime here are the server logs starting just before the upgrade and here are the scheduler logs, although the appear to stop after upgrade.

Hi @luke

One of the recent changes to Airbyte involves running the “check connection” operation for the source and the destination at the beginning of each sync, in order to detect configuration issues before trying to actually sync data. This means a couple extra pods are now created at the beginning of each sync, in order to run these check connection operations.

From your logs, it looks like this is the operation where things are failing. This error from your logs:

Caused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[source-mindbody-check-2461-0-odrkb] in namespace [default].

means that the check-connection pod could not be created in your kube cluster. When we have seen this in the past, this is usually due to not having enough resources/capacity in your kube cluster to schedule the pod. Since all of your syncs are kicked off at the same time, this may be a contributing factor as all of the syncs are trying to launch check-connection and sync job pods at the same time, leading to resource contention.

For any connections that seem stuck, you can retry them by cancelling the running job (if there is one) and/or clicking the Sync Now button. My guess is that retrying the few syncs that are stuck while the rest of the 27 jobs are not running will not run into this same issue again, since there will not be as much resource contention.

For a longer-term fix, you will likely need to either spread out your syncs so that they are not running at the same time and fighting over resources on your node, or use a larger instance that is able to handle all of the pods for all 27 syncs at the same time.

1 Like

Thanks @lmossman, that makes sense. I’ll try spreading out the syncs a bit and see how it goes.
A couple of other questions though:

  1. Is there a way to turn off the connection checks either at the system or connection level? Or set the cadence in which we check? Seems like a nice feature for validation on a once a day sync, but we might get to the point where we’re running some of these every 10 minutes or so, and a source + destination check would just be overkill/unnecessary overhead for that scenario for each sync.
  2. Your solution might solve the root cause, but assuming this continues to be an issue/others may run into this, that doesn’t address the fact that syncs are hanging and require a user to actually check the sync status to prevent further missed syncs. Any current open issues or plans to fix that (IE auto-kill after a period of inactivity or at least notify somehow)?

@luke thank you for the suggestions! I will raise these to the team to figure out how we should address this.

@luke I’ve opened an issue to brainstorm how we might relax CHECKING before every Sync - please add any ideas you might have! Allow users to relax how often CHECK is run before each Sync · Issue #13459 · airbytehq/airbyte · GitHub

2 Likes

Just to update everyone here, we spaced out the syncs a bit and didn’t see any improvements. In fact, almost every single enabled connection had a stuck sync over the weekend (had to script killing all of them via the API).
However, one thing we did notice is that we’d bumped our sync parallelization to 10 syncs previously (via MAX_SYNC_WORKERS), but left our number of checks at the default 5. Theory being after the change to the new scheduler, MAX_CHECK_WORKERS was no longer independent from a sync and that MAX_SYNC_WORKERS needs to equal MAX_CHECK_WORKERS.
Once we made that change (set both to 10) and got back into a non-hung state (with same resources allocated on the cluster), we’ve since only seen one sync that hung with the “waiting on check pod” log. Can’t confirm this is for sure the cause given we did see that one sync hang, but I can say thing are in a much better place now.

We are having a similar problem after upgrading from 0.38.x to 0.39.17. We’ve had a few syncs hang for up to 10 days. Cancelling fixes it, but it would be nice if Airbyte would just restart it for us. Here are the logs for a couple examples (I took these logs on the 11th of July, just for a reference on how long they were hanging):
logs-45338.txt (173 Bytes)
logs-41966.txt (17.2 KB)
logs-45337.txt (173 Bytes)
logs-45336.txt (150 Bytes)
logs-40581.txt (9.0 KB)
I will look into the MAX_WORKERS vars and spreading out the syncs more to see if that will be any help.

A couple updates here from what we did that might be helpful:

  • The MAX_SYNC_WORKERS change doesn’t appear to have done anything - we later realized we could temporarily fix this issue by restarting the airbyte pods, so I’m thinking we just restarted the pods after the config change and that bought us a few days of no hangs
  • We updated to 0.39.21, which doesn’t fix the issue completely, but jobs at least have started failing now. Retries don’t do anything, though; basically the whole environment goes into a bad state at some point and every new sync fails after hanging for a bit
  • We’ve “fixed” this issue by running a daily job that completely restarts the entire instance (all pods). It’s not ideal as it will kill any running jobs, but we haven’t seen any hanging jobs after putting this in place ~10 days ago

Thanks @luke for the info, could you try to increase ALL the workers? sync, spec and discover?

@marcosmarxm all four have been set to “10” since early June

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.