@davinchia it’s been a while since our last conversation. CONTAINER_ORCHESTRATOR
works, and I saw the code that the temporal scheduler is turned on by default. In our cluster, the maximum connections I could achieve is 3600. Jobs start to fail after 3.6K, and I am able to reproduce it in the airbyte cloud.
https://cloud.airbyte.io/workspaces/e20f99b5-a18c-41d8-adc8-2526fe08af7f/connections/e32725de-eb48-45f9-ab58-6aa058d82143/status
I got the exact same log message. And the job hang at running, it doesn’t fail and consequently, retry is not kicked off.
at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:102) ~[io.airbyte-airbyte-workers-0.39.16-alpha.jar:?]
at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:80) ~[io.airbyte-airbyte-workers-0.39.16-alpha.jar:?]
at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:55) ~[io.airbyte-airbyte-workers-0.39.16-alpha.jar:?]
... 3 more
2022-06-13 19:28:18 INFO i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-13 19:28:18 WARN i.t.i.a.POJOActivityTaskHandler(activityFailureToResult):307 - Activity failure. ActivityId=e0439e52-3d6f-3949-afe6-d3017ad93973, activityType=Run, attempt=1
java.util.concurrent.CancellationException: null
at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) ~[?:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$3(TemporalAttemptExecution.java:204) ~[io.airbyte-airbyte-workers-0.39.16-alpha.jar:?]
at io.airbyte.workers.temporal.CancellationHandler$TemporalCancellationHandler.checkAndHandleCancellation(CancellationHandler.java:52) ~[io.airbyte-airbyte-workers-0.39.16-alpha.jar:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$4(TemporalAttemptExecution.java:207) ~[io.airbyte-airbyte-workers-0.39.16-alpha.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2022-06-13 19:28:18 INFO i.a.s.p.DefaultJobPersistence(enqueueJob):139 - enqueuing pending job for scope: baa5c76a-0bf5-4ac8-9326-ae8c729c9529
Since the cap is 3.6K, I changed the size to 3K, and let it run for days. All of the connections are e2e <-> e2e connections. Really appreciate your help. Now I start to check if the bottleneck on my side is temporal.
We are using temporalio/auto-setup:1.7.0
, and I read the documentation, it’s not recommended for production. And the replica set is 1/1. I wonder if it’s better to deploy a full suite of temporal to a new namespace. At least, the pod temporal admin tools would be useful for debugging.
And after running for a few days, I got the following error from that single and only temporal pod.
{"level":"info","ts":"2022-06-13T12:00:00.266Z","msg":"Taskqueue scavenger starting","service":"worker","logging-call-at":"scavenger.go:117"}
{"level":"info","ts":"2022-06-13T12:00:00.266Z","msg":"Taskqueue scavenger started","service":"worker","logging-call-at":"scavenger.go:122"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/CONNECTION_UPDATER/2","wf-task-queue-type":"Workflow","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/CONNECTION_UPDATER/1","wf-task-queue-type":"Workflow","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/CONNECTION_UPDATER/3","wf-task-queue-type":"Workflow","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/CONNECTION_UPDATER/3","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/CONNECTION_UPDATER/2","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/CONNECTION_UPDATER/1","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.303Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/SYNC/1","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.305Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/SYNC/3","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.305Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/SYNC/1","wf-task-queue-type":"Workflow","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.305Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/SYNC/2","wf-task-queue-type":"Workflow","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.306Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"/_sys/SYNC/2","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.306Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"CONNECTION_UPDATER","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.307Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"CONNECTION_UPDATER","wf-task-queue-type":"Workflow","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:00:00.308Z","msg":"scavenger.deleteHandler processed.","service":"worker","wf-namespace-id":"4b3cdd99-2db4-43cb-ae8a-4dcda771b92e","wf-task-queue-name":"SYNC","wf-task-queue-type":"Activity","number-processed":1,"number-deleted":0,"logging-call-at":"handler.go:141"}
{"level":"info","ts":"2022-06-13T12:01:00.304Z","msg":"Taskqueue scavenger stopping","service":"worker","logging-call-at":"scavenger.go:131"}
{"level":"info","ts":"2022-06-13T12:01:00.304Z","msg":"Taskqueue scavenger stopped","service":"worker","logging-call-at":"scavenger.go:135"}
The taskqueue keeps restarting. In my cluster, there are in total 3K connections, among them, about 1K jobs stuck in the running for days. And no more jobs are created for the remaining 2K connections.
And I don’t know if it’s only my issue, if I recreated all airbyte-worker pods and temporal pod, those jobs stuck at running are not picked up.
Just to share some findings, I appreciate your feedback and help. We could continue the conversation. So far, we are good for our load tests in our cluster. The airbyte team never stops improving the performance, I look forward to the update in the future version.