Sync Hangs on Kubernetes with Airbyte

Summary

Sync hangs randomly on Kubernetes with Airbyte, showing as (BLOCKED on Feature.get) in Temporal. Spike in CPU usage on ‘worker’ pod when sync becomes stuck. Using helm charts 0.64.180 with GCS storage and CONTAINER_ORCHESTRATOR_ENABLED: false.


Question

Hi all,

We’re having trouble using airbyte on kubernetes. Randomly, on average at least once per day, a sync hangs. There’re no errors anywhere that we could find.
And the job can be stuck for a long time. For example, we’ve had cases where the sync usually takes a few minutes, but it got stuck for almost half a day until we cancelled it.
There doesn’t appear to be a problem with resources, no limits on CPU and the memory, disk space seems fine. The kubernetes node seems fine as well, nothing major seems to be happening. However, we noticed a correlation with a big spike on CPU usage on a worker pod when a sync becomes stuck.

Temporal shows the sync as (BLOCKED on Feature.get).

We’re using helm charts 0.64.180 with gcs storage and CONTAINER_ORCHESTRATOR_ENABLED: false (we had to disable it as we didn’t manage to deploy airbyte successfully otherwise, the instructions in airbyte website aren’t working for us).

All we did was slowly add more and more syncs with time, but decreasing concurrency (by a lot) didn’t help resolving the issue.

Is there anything else we could look into that would help identify the problem?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["sync-hangs", "kubernetes", "airbyte", "helm-charts", "gcs-storage", "cpu-usage", "worker-pod", "temporal"]

managed to “catch” this live today again.
The issue seems to be that the source-read pod completes the job, but the worker pod can’t seem register it or looses track of the pod for whatever reason. The CPU usage for affected worker is spiking like before.

airbyte-production-worker-7957db467d-pdz42                        airbyte-worker-container           6690m        4474M```
This is the source pod for the sync that's stuck. The last logs are:
```2024-05-06 21:12:46.826Z relay-stdout 2024/05/06 21:12:46 socat[8] N reading from and writing to stdio
                      2024-05-06T21:12:46.826541889Z 2024/05/06 21:12:46 socat[8] N opening connection to AF=2 10.60.39.237:9028
                                                2024-05-06T21:12:46.826895989Z 2024/05/06 21:12:46 socat[8] N successfully connected from local address AF=2 10.60.46.247:51328
               2024-05-06T21:12:46.827682789Z 2024/05/06 21:12:46 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
                                                 2024-05-06T21:38:33.183168668Z 2024/05/06 21:38:33 socat[8] N socket 1 (fd 0) is at EOF
                                                        2024-05-06T21:38:33.714448986Z 2024/05/06 21:38:33 socat[8] N socket 2 (fd 5) is at EOF
                                                               2024-05-06T21:38:33.714472216Z 2024/05/06 21:38:33 socat[8] N exiting with status 0
2024-05-06 21:12:46.913Z relay-stderr 2024/05/06 21:12:46 socat[8] N reading from and writing to stdio
                      2024-05-06T21:12:46.913460603Z 2024/05/06 21:12:46 socat[8] N opening connection to AF=2 10.60.39.237:9032
                                                2024-05-06T21:12:46.913850174Z 2024/05/06 21:12:46 socat[8] N successfully connected from local address AF=2 10.60.46.247:60900
2024-05-06 21:12:46.914Z relay-stderr 2024/05/06 21:12:46 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
                                         2024-05-06T21:38:33.183167519Z 2024/05/06 21:38:33 socat[8] N socket 1 (fd 0) is at EOF
                                                2024-05-06T21:38:33.714404106Z 2024/05/06 21:38:33 socat[8] N socket 2 (fd 5) is at EOF
                                                       2024-05-06T21:38:33.714441026Z 2024/05/06 21:38:33 socat[8] N exiting with status 0
2024-05-06 21:38:33.330Z call-heartbeat-server Terminated
2024-05-06 21:12:46.596Z main Using existing AIRBYTE_ENTRYPOINT: python /airbyte/integration_code/main.py
2024-05-06 21:12:46.596Z main Waiting on CHILD_PID 7
                                                    2024-05-06T21:12:46.596918173Z PARENT_PID: 1
2024-05-06 21:38:33.183Z main EXIT_STATUS: 0
call-heartbeat-server Stream closed: EOF
main Stream closed: EOF
relay-stderr Stream closed: EOF
relay-stdout Stream closed: EOF```

Would this be similar to https://github.com/airbytehq/airbyte/issues/37837?

Full stack from temporal:

io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:94)
io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:387)
io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:46)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:85)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:75)
io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:45)
io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:78)
io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:60)
jdk.proxy2/jdk.proxy2.$Proxy88.replicate(Unknown Source)
io.airbyte.workers.temporal.sync.SyncWorkflowImpl.run(SyncWorkflowImpl.java:106)
SyncWorkflowImplProxy.run$accessor$xJ9EkAcP(Unknown Source)
SyncWorkflowImplProxy$auxiliary$T7YDzngy.call(Unknown Source)
io.airbyte.workers.temporal.support.TemporalActivityStubInterceptor.execute(TemporalActivityStubInterceptor.java:79)
SyncWorkflowImplProxy.run(Unknown Source)
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
java.base/java.lang.reflect.Method.invoke(Method.java:578)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:302)```
I can also share what we've managed to dump from the process in the worker pod.

Hi <@U01MMSDJGC9>, it might be :thinking_face: In the worker dump we do see BufferedReplicationWorker spending lots of CPU time:

   java.lang.Thread.State: RUNNABLE
   at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(java.base@19.0.2/ReentrantReadWriteLock.java:494)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@19.0.2/AbstractQueuedSynchronizer.java:1026)
   at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@19.0.2/ReentrantReadWriteLock.java:738)
   at io.airbyte.commons.concurrency.BoundedConcurrentLinkedQueue.isClosed(BoundedConcurrentLinkedQueue.java:118)
   at io.airbyte.commons.concurrency.BoundedConcurrentLinkedQueue.isDone(BoundedConcurrentLinkedQueue.java:94)
   at io.airbyte.workers.general.BufferedReplicationWorker.writeToDestination(BufferedReplicationWorker.java:403)
   at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1908/0x00000008020e6ed8.run(Unknown Source)
   at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:225)
   at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1664/0x000000080201e278.run(Unknown Source)
   at java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@19.0.2/CompletableFuture.java:1804)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.2/ThreadPoolExecutor.java:1144)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.2/ThreadPoolExecutor.java:642)
   at java.lang.Thread.run(java.base@19.0.2/Thread.java:1589)

   Locked ownable synchronizers:
   - <0x00000005050e5628> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"pool-126-thread-4" #685 [1837] prio=5 os_prio=0 cpu=5512742.59ms elapsed=5531.47s allocated=369G defined_classes=0 tid=0x00007bd6c8081f50 nid=1837 runnable  [0x00007bd5cf44f000]
   java.lang.Thread.State: RUNNABLE
	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@19.0.2/ReentrantReadWriteLock.java:738)
	at io.airbyte.commons.concurrency.BoundedConcurrentLinkedQueue.isDone(BoundedConcurrentLinkedQueue.java:93)
	at io.airbyte.workers.general.BufferedReplicationWorker.processMessage(BufferedReplicationWorker.java:362)
	at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1907/0x00000008020e6cc8.run(Unknown Source)
	at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:225)
	at io.airbyte.workers.general.BufferedReplicationWorker$$Lambda$1664/0x000000080201e278.run(Unknown Source)
	at java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@19.0.2/CompletableFuture.java:1804)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.2/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.2/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.2/Thread.java:1589)

   Locked ownable synchronizers:
	- <0x0000000505203dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker)```

<@U03KEPNB864> really appreciate your answer. will try that! thankyou very much!

Hi, <@U03GG9X8Z4K>. We think that in our case the issue was kubernetes cluster killing sync pods and airbyte workers loosing track of the that happening. TLDR. we removed jobs.kube.tolerations form the chart and it worked. :woman-shrugging: The CPU getting hung up was a side effect of that. We didn’t investigate in detail what exactly was happening under the hood, though.

<@U03KEPNB864> Hi, I also got the same problems. It’s very frustrating as I can’t find the root cause till now. have you already had this issue solved ? Thanks and appreciate your input :pray: