Airbyte Worker Failing to Run Replication-Orchestrator on EKS

Summary

The Airbyte worker is consistently failing to run the replication-orchestrator on EKS, triggering a retry mechanism with a backoff period. The issue occurs during the replication orchestration process, despite efforts to address network connectivity, resource allocation, service availability, IAM roles, and permissions.


Question

Issue Title: Airbyte Worker Failing to Run Replication-Orchestrator on EKS
Is this your first time deploying Airbyte?: No
OS Version / Instance: Ubuntu
Memory / Disk: 8GB / 50GB
Deployment: Kubernetes EKS AWS
Airbyte Version: 0.59.1
Source name/version: All
Destination name/version: Redshift
Step to Reproduce:

  1. Deploy Airbyte on EKS following the official documentation.
  2. Create a connection with any source and Redshift as the destination.
  3. Run the connection.
  4. Observe the failure in the Airbyte worker logs.
    Description:
    We have been experiencing consistent failures with the Airbyte worker while attempting to run the replication-orchestrator on our EKS deployment. The issue occurs across all connections and sources. Below is the relevant log excerpt:

Running the launcher replication-orchestrator failed

message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false

Received failed response: 404, Request ID: XXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX==
Received failed response: 404, Request ID: XXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX==
Creating orchestrator-repl-job-2568-attempt-1 for attempt number: 1
There are currently running pods for the connection: [orchestrator-repl-job-2568-attempt-0]. Killing these pods to enforce one execution at a time.
Attempting to delete pods: [orchestrator-repl-job-2568-attempt-0]
Waiting for deletion...
Successfully deleted all running pods for the connection!
Waiting for pod to be running...
Ready to run resync and reflector for v1/namespaces/default/pods with resync 0
Resync skipped due to 0 full resync period for v1/namespaces/default/pods```
The error seems to occur during the replication orchestration process. The worker attempts to run the job but consistently fails, triggering a retry mechanism with a backoff period.
We have checked the following potential causes without success:
• Network connectivity issues
• Resource constraints (CPU and memory)
• Configuration issues
• Service availability (Temporal server and other dependent services)
• IAM roles and permissions
Actions Taken:
1. Verified network connectivity between the pods and the Temporal server.
2. Ensured sufficient resource allocation to the worker pods.
3. Checked the health and logs of dependent services.
4. Reviewed IAM roles and permissions for the worker pods.
5. Upgraded to the latest Airbyte version.
Despite these efforts, the issue persists. We need assistance in diagnosing and resolving this problem to ensure stable and reliable data replication.
*Environment Details*:
• *Kubernetes Cluster*: EKS
• *Instance Type*: t2.xlarge


<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1717362252845269) if you want 
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["airbyte-worker", "replication-orchestrator", "eks", "kubernetes", "failure", "retry", "backoff", "network-connectivity", "resource-allocation", "service-availability", "iam-roles", "permissions"]
</sub>