Sync Hangs on Kubernetes with Airbyte

slack-user-airbyte · May 14, 2024, 6:05pm

Summary

Sync hangs randomly on Kubernetes with Airbyte, showing as (BLOCKED on Feature.get) in Temporal. Spike in CPU usage on ‘worker’ pod when sync becomes stuck. Using helm charts 0.64.180 with GCS storage and CONTAINER_ORCHESTRATOR_ENABLED: false.

Question

Hi all,

We’re having trouble using airbyte on kubernetes. Randomly, on average at least once per day, a sync hangs. There’re no errors anywhere that we could find.
And the job can be stuck for a long time. For example, we’ve had cases where the sync usually takes a few minutes, but it got stuck for almost half a day until we cancelled it.
There doesn’t appear to be a problem with resources, no limits on CPU and the memory, disk space seems fine. The kubernetes node seems fine as well, nothing major seems to be happening. However, we noticed a correlation with a big spike on CPU usage on a worker pod when a sync becomes stuck.

Temporal shows the sync as (BLOCKED on Feature.get).

We’re using helm charts 0.64.180 with gcs storage and CONTAINER_ORCHESTRATOR_ENABLED: false (we had to disable it as we didn’t manage to deploy airbyte successfully otherwise, the instructions in airbyte website aren’t working for us).

All we did was slowly add more and more syncs with time, but decreasing concurrency (by a lot) didn’t help resolving the issue.

Is there anything else we could look into that would help identify the problem?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

_{["sync-hangs", "kubernetes", "airbyte", "helm-charts", "gcs-storage", "cpu-usage", "worker-pod", "temporal"]}

slack-user-airbyte · June 20, 2024, 6:19am

managed to “catch” this live today again.
The issue seems to be that the source-read pod completes the job, but the worker pod can’t seem register it or looses track of the pod for whatever reason. The CPU usage for affected worker is spiking like before.

airbyte-production-worker-7957db467d-pdz42                        airbyte-worker-container           6690m        4474M```
This is the source pod for the sync that's stuck. The last logs are:
```2024-05-06 21:12:46.826Z relay-stdout 2024/05/06 21:12:46 socat[8] N reading from and writing to stdio
                      2024-05-06T21:12:46.826541889Z 2024/05/06 21:12:46 socat[8] N opening connection to AF=2 10.60.39.237:9028
                                                2024-05-06T21:12:46.826895989Z 2024/05/06 21:12:46 socat[8] N successfully connected from local address AF=2 10.60.46.247:51328
               2024-05-06T21:12:46.827682789Z 2024/05/06 21:12:46 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
                                                 2024-05-06T21:38:33.183168668Z 2024/05/06 21:38:33 socat[8] N socket 1 (fd 0) is at EOF
                                                        2024-05-06T21:38:33.714448986Z 2024/05/06 21:38:33 socat[8] N socket 2 (fd 5) is at EOF
                                                               2024-05-06T21:38:33.714472216Z 2024/05/06 21:38:33 socat[8] N exiting with status 0
2024-05-06 21:12:46.913Z relay-stderr 2024/05/06 21:12:46 socat[8] N reading from and writing to stdio
                      2024-05-06T21:12:46.913460603Z 2024/05/06 21:12:46 socat[8] N opening connection to AF=2 10.60.39.237:9032
                                                2024-05-06T21:12:46.913850174Z 2024/05/06 21:12:46 socat[8] N successfully connected from local address AF=2 10.60.46.247:60900
2024-05-06 21:12:46.914Z relay-stderr 2024/05/06 21:12:46 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
                                         2024-05-06T21:38:33.183167519Z 2024/05/06 21:38:33 socat[8] N socket 1 (fd 0) is at EOF
                                                2024-05-06T21:38:33.714404106Z 2024/05/06 21:38:33 socat[8] N socket 2 (fd 5) is at EOF
                                                       2024-05-06T21:38:33.714441026Z 2024/05/06 21:38:33 socat[8] N exiting with status 0
2024-05-06 21:38:33.330Z call-heartbeat-server Terminated
2024-05-06 21:12:46.596Z main Using existing AIRBYTE_ENTRYPOINT: python /airbyte/integration_code/main.py
2024-05-06 21:12:46.596Z main Waiting on CHILD_PID 7
                                                    2024-05-06T21:12:46.596918173Z PARENT_PID: 1
2024-05-06 21:38:33.183Z main EXIT_STATUS: 0
call-heartbeat-server Stream closed: EOF
main Stream closed: EOF
relay-stderr Stream closed: EOF
relay-stdout Stream closed: EOF```

slack-user-airbyte · June 20, 2024, 6:19am

Would this be similar to https://github.com/airbytehq/airbyte/issues/37837?

slack-user-airbyte · June 21, 2024, 6:17am

Full stack from temporal:

io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:94)
io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:387)
io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:46)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:85)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:75)
io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:45)
io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:78)
io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:60)
jdk.proxy2/jdk.proxy2.$Proxy88.replicate(Unknown Source)
io.airbyte.workers.temporal.sync.SyncWorkflowImpl.run(SyncWorkflowImpl.java:106)
SyncWorkflowImplProxy.run$accessor$xJ9EkAcP(Unknown Source)
SyncWorkflowImplProxy$auxiliary$T7YDzngy.call(Unknown Source)
io.airbyte.workers.temporal.support.TemporalActivityStubInterceptor.execute(TemporalActivityStubInterceptor.java:79)
SyncWorkflowImplProxy.run(Unknown Source)
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
java.base/java.lang.reflect.Method.invoke(Method.java:578)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:302)```
I can also share what we've managed to dump from the process in the worker pod.

slack-user-airbyte · June 23, 2024, 6:17am

Hi <@U01MMSDJGC9>, it might be In the worker dump we do see BufferedReplicationWorker spending lots of CPU time:

   java.lang.Thread.State: RUNNABLE
   at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(java.base@19.0.2/ReentrantReadWriteLock.java:494)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@19.0.2/AbstractQueuedSynchronizer.java:1026)
   at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@19.0.2/ReentrantReadWriteLock.java:738)
   at io.airbyte.commons.concurrency.BoundedConcurrentLinkedQueue.isClosed(BoundedConcurrentLinkedQueue.java:118)
   at io.airbyte.commons.concurrency.BoundedConcurrentLinkedQueue.isDone(BoundedConcurrentLinkedQueue.java:94)
   at io.airbyte.workers.general.BufferedReplicationWorker.writeToDestination(BufferedReplicationWorker.java:403)
   at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1908/0x00000008020e6ed8.run(Unknown Source)
   at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:225)
   at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1664/0x000000080201e278.run(Unknown Source)
   at java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@19.0.2/CompletableFuture.java:1804)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.2/ThreadPoolExecutor.java:1144)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.2/ThreadPoolExecutor.java:642)
   at java.lang.Thread.run(java.base@19.0.2/Thread.java:1589)

   Locked ownable synchronizers:
   - <0x00000005050e5628> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"pool-126-thread-4" #685 [1837] prio=5 os_prio=0 cpu=5512742.59ms elapsed=5531.47s allocated=369G defined_classes=0 tid=0x00007bd6c8081f50 nid=1837 runnable  [0x00007bd5cf44f000]
   java.lang.Thread.State: RUNNABLE
	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@19.0.2/ReentrantReadWriteLock.java:738)
	at io.airbyte.commons.concurrency.BoundedConcurrentLinkedQueue.isDone(BoundedConcurrentLinkedQueue.java:93)
	at io.airbyte.workers.general.BufferedReplicationWorker.processMessage(BufferedReplicationWorker.java:362)
	at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1907/0x00000008020e6cc8.run(Unknown Source)
	at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:225)
	at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1664/0x000000080201e278.run(Unknown Source)
	at java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@19.0.2/CompletableFuture.java:1804)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.2/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.2/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.2/Thread.java:1589)

   Locked ownable synchronizers:
	- <0x0000000505203dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker)```

slack-user-airbyte · July 26, 2024, 6:17am

<@U03KEPNB864> really appreciate your answer. will try that! thankyou very much!

slack-user-airbyte · August 1, 2024, 6:16am

Hi, <@U03GG9X8Z4K>. We think that in our case the issue was kubernetes cluster killing sync pods and airbyte workers loosing track of the that happening. TLDR. we removed jobs.kube.tolerations form the chart and it worked. :woman-shrugging: The CPU getting hung up was a side effect of that. We didn’t investigate in detail what exactly was happening under the hood, though.

slack-user-airbyte · August 14, 2024, 6:18am

<@U03KEPNB864> Hi, I also got the same problems. It’s very frustrating as I can’t find the root cause till now. have you already had this issue solved ? Thanks and appreciate your input

Topic		Replies	Views
Deploy Kubernetes: Sync hangs on waiting until pod is ready Connector Questions & Issues connectors , kubernetes	3	791	July 14, 2022
Temporal Scheduler Causing Multiple Syncs to Hang Connector Questions & Issues data-loading	14	1210	July 14, 2022
Syncs getting stuck Connector Questions & Issues data-loading , connectors , kubernetes	3	882	February 4, 2023
Trouble with Connectors after Upgrading Airbyte Version Platform Questions kubernetes , airbyte , gcp , connector , error	16	1223	September 8, 2024
Scaling Airbyte components for high availability Platform Questions kubernetes , platform , airbyte-temporal , airbyte-worker , airbyte-server	15	147	August 19, 2024

Sync Hangs on Kubernetes with Airbyte

Summary

Question

Related topics