Worker Pod Losing Track of Completed Job

Summary

The worker pod in the Airbyte platform is losing track of the completed job from the source-read pod, causing CPU spikes. Logs indicate issues with data transfer and EOF streams.


Question

managed to “catch” this live today again.
The issue seems to be that the source-read pod completes the job, but the worker pod can’t seem register it or looses track of the pod for whatever reason. The CPU usage for affected worker is spiking like before.

airbyte-production-worker-7957db467d-pdz42                        airbyte-worker-container           6690m        4474M```
This is the source pod for the sync that's stuck. The last logs are:
```2024-05-06 21:12:46.826Z relay-stdout 2024/05/06 21:12:46 socat[8] N reading from and writing to stdio
                      2024-05-06T21:12:46.826541889Z 2024/05/06 21:12:46 socat[8] N opening connection to AF=2 10.60.39.237:9028
                                                2024-05-06T21:12:46.826895989Z 2024/05/06 21:12:46 socat[8] N successfully connected from local address AF=2 10.60.46.247:51328
               2024-05-06T21:12:46.827682789Z 2024/05/06 21:12:46 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
                                                 2024-05-06T21:38:33.183168668Z 2024/05/06 21:38:33 socat[8] N socket 1 (fd 0) is at EOF
                                                        2024-05-06T21:38:33.714448986Z 2024/05/06 21:38:33 socat[8] N socket 2 (fd 5) is at EOF
                                                               2024-05-06T21:38:33.714472216Z 2024/05/06 21:38:33 socat[8] N exiting with status 0
2024-05-06 21:12:46.913Z relay-stderr 2024/05/06 21:12:46 socat[8] N reading from and writing to stdio
                      2024-05-06T21:12:46.913460603Z 2024/05/06 21:12:46 socat[8] N opening connection to AF=2 10.60.39.237:9032
                                                2024-05-06T21:12:46.913850174Z 2024/05/06 21:12:46 socat[8] N successfully connected from local address AF=2 10.60.46.247:60900
2024-05-06 21:12:46.914Z relay-stderr 2024/05/06 21:12:46 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
                                         2024-05-06T21:38:33.183167519Z 2024/05/06 21:38:33 socat[8] N socket 1 (fd 0) is at EOF
                                                2024-05-06T21:38:33.714404106Z 2024/05/06 21:38:33 socat[8] N socket 2 (fd 5) is at EOF
                                                       2024-05-06T21:38:33.714441026Z 2024/05/06 21:38:33 socat[8] N exiting with status 0
2024-05-06 21:38:33.330Z call-heartbeat-server Terminated
2024-05-06 21:12:46.596Z main Using existing AIRBYTE_ENTRYPOINT: python /airbyte/integration_code/main.py
2024-05-06 21:12:46.596Z main Waiting on CHILD_PID 7
                                                    2024-05-06T21:12:46.596918173Z PARENT_PID: 1
2024-05-06 21:38:33.183Z main EXIT_STATUS: 0
call-heartbeat-server Stream closed: EOF
main Stream closed: EOF
relay-stderr Stream closed: EOF
relay-stdout Stream closed: EOF```

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1715064075973539) if you want to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["worker-pod", "completed-job", "source-pod", "cpu-spikes", "data-transfer", "eof-streams"]
</sub>