RuntimeException during sync: "Cannot find pod while trying to retrieve exit code"

BravoDeltaBD · August 30, 2022, 2:48pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: Ubuntu
Memory / Disk: 12Gb / 500Gb
Deployment: Kubernetes
Airbyte Version: 0.35.10-alpha
Source name/version: Custom python connector
Destination name/version: destination-snowflake 0.4.8
Step: sync
Description:

Scenario: we are syncing between a custom connector and the Snowflake connector.
~25GB / 25.000.000 records seem to be importing correctly for a little over 2 hours.
The log states that the streams are finalized, and that the tmp tables and stages have been successfully cleaned in the destination.

But then the following exception shows up in the airbyte-worker log:

2022-08-25 14:41:22 destination > 2022-08-25 14:41:22 INFO i.a.i.b.IntegrationRunner(runInternal):153 - Completed integration: io.airbyte.integrations.destination.snowflake.SnowflakeDestination
2022-08-25 14:41:54 INFO i.a.w.p.KubePodProcess(exitValue):710 - Closed all resources for pod airbyte-source-questmanager-6g-sync-2988-2-syksw
2022-08-25 14:41:54 INFO i.a.w.p.KubePodProcess(exitValue):710 - Closed all resources for pod airbyte-source-questmanager-6g-sync-2988-2-syksw
2022-08-25 14:41:54 INFO i.a.w.p.KubePodProcess(exitValue):710 - Closed all resources for pod airbyte-source-questmanager-6g-sync-2988-2-syksw
2022-08-25 14:41:54 INFO i.a.w.p.KubePodProcess(exitValue):710 - Closed all resources for pod airbyte-source-questmanager-6g-sync-2988-2-syksw
2022-08-25 14:41:54 ERROR i.a.w.DefaultReplicationWorker(run):141 - Sync worker failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.

Complete logs: logs-2988.txt (3.4 MB)

After this exception the logs seem to indicate that the syncing continues like normal, and finally the normalization step is executed successfully. The attempt is flagged as ‘failed’ and a new attempt is then started.

A few days ago there were 3 failed attempts in a row because of this, but the corresponding Kubernetes pod has been deleted automatically since then so I am unable to look at those logs.

I then performed a sync with the same connector with a much smaller subset of streams (40MB). This time it succeeded.

After that I performed a sync with the original set of streams again. The first attempt failed in the same way, and I canceled the run during the second attempt. The pod once again seems to have been deleted automatically, so I can not look into those logs either.

I would really appreciate any advice on how to approach further debugging or what might cause this issue and how to fix it. Let me know if you need any additional info. Thanks!

BravoDeltaBD · September 2, 2022, 10:02am

I have found an existing issue regarding this problem:

github.com/airbytehq/airbyte

Destination pods disappear before exit code can be retrieved

opened 02:04AM - 08 Mar 22 UTC

closed 11:55AM - 31 Mar 22 UTC

lmossman

type/bug area/platform team/cloud team/oss

## Environment - **Airbyte version**: - **OS Version / Instance**: - **Dep…loyment**: - **Source Connector and version**: - **Destination Connector and version**: - **Severity**: - **Step where error happened**: Sync Job ## Current Behavior It seems that for long-running syncs that run on GKE, pods are sometimes swept by the GKE pod-garbage-collector immediately after they complete, before the exit code can be retrieved from the pod at the end of the DefaultReplicationWorker process. The resulting effect is that even though a sync fully completes and the the source and destination pods both exit successfully, the following error is thrown when the DefaultReplicationWorker tries to retrieve the exit code of the pod that no longer exists: ``` 2022-03-07 21:37:10 [46mreplication-orchestrator[0m > Sync worker failed. java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created. at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?] at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:163) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:56) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:102) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ContainerOrchestratorApp.runInternal(ContainerOrchestratorApp.java:116) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ContainerOrchestratorApp.run(ContainerOrchestratorApp.java:146) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ContainerOrchestratorApp.main(ContainerOrchestratorApp.java:179) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] Suppressed: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created. at io.airbyte.workers.process.KubePodProcess.getReturnCode(KubePodProcess.java:714) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.process.KubePodProcess.exitValue(KubePodProcess.java:741) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at java.lang.Process.hasExited(Process.java:584) ~[?:?] at java.lang.Process.isAlive(Process.java:574) ~[?:?] at io.airbyte.workers.WorkerUtils.gentleClose(WorkerUtils.java:36) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:115) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:125) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:56) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:102) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ContainerOrchestratorApp.runInternal(ContainerOrchestratorApp.java:116) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ContainerOrchestratorApp.run(ContainerOrchestratorApp.java:146) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] at io.airbyte.container_orchestrator.ContainerOrchestratorApp.main(ContainerOrchestratorApp.java:179) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?] Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created. at io.airbyte.workers.DefaultReplicationWorker.lambda$getDestinationOutputRunnable$6(DefaultReplicationWorker.java:366) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?] Caused by: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created. at io.airbyte.workers.process.KubePodProcess.getReturnCode(KubePodProcess.java:714) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.process.KubePodProcess.exitValue(KubePodProcess.java:741) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at java.lang.Process.hasExited(Process.java:584) ~[?:?] at java.lang.Process.isAlive(Process.java:574) ~[?:?] at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.isFinished(DefaultAirbyteDestination.java:146) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at io.airbyte.workers.DefaultReplicationWorker.lambda$getDestinationOutputRunnable$6(DefaultReplicationWorker.java:340) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?] ``` This is affecting both cloud users (related [issue](https://github.com/airbytehq/oncall/issues/164)) and OSS users ([slack thread](https://airbytehq-team.slack.com/archives/C01MFR03D5W/p1646263011482819)) ## Expected Behavior The expected behavior is that if both the source and destination pods complete successfully, then the replication process should also finish successfully. To solve this issue, we will need to add logic to account for the case where the pod is swept before the exit code is retrieved from the pod. ## Logs I cut the uninteresting middle part of the logs out so that I was able to upload them: [logs-66220.txt](https://github.com/airbytehq/airbyte/files/8202100/logs-66220.txt) ## Steps to Reproduce 1. Run a sync for this connection: https://cloud.airbyte.io/workspaces/a2472741-7f90-4ff3-98c9-5096fd63f9a5/connections/75081114-612c-403b-b426-f0d917376613/status ## Are you willing to submit a PR?  Remove this with your answer.

As indicated in the comments it is a bug that should have been fixed in version 0.35.63-alpha. We are currently using version 0.35.10. We will have to update to a more recent version.

sajarin · September 8, 2022, 6:58pm

Hi @BravoDeltaBD, thanks for searching for the issue on Github. Let me know if you’re still experiencing the same problem after the upgrade.

Topic		Replies	Views
Destination, orchestrator, and source pod all stuck in terminating Connector Questions & Issues kubernetes	9	865	September 14, 2022
File Source -> Snowflake Destination stuck Connector Questions & Issues source-files , destination-snowflake , connectors , kubernetes	10	628	July 14, 2022
Deploy Kubernetes: Sync hangs on waiting until pod is ready Connector Questions & Issues connectors , kubernetes	3	788	July 14, 2022
Syncs getting stuck Connector Questions & Issues data-loading , connectors , kubernetes	3	837	February 4, 2023
Trouble with Connectors after Upgrading Airbyte Version Platform Questions kubernetes , airbyte , gcp , connector , error	16	969	September 8, 2024

RuntimeException during sync: "Cannot find pod while trying to retrieve exit code"

Related topics