Retries for Connection Check Failure

Summary

The user is inquiring about the expected behavior regarding retries for connection check failures. They have observed sync attempts that fail a connection check only trigger one attempt without retries.


Question

Hi! Does anyone happen to know if it’s expected behaviour for there to be no retries when the failure is a connection check failure?

We’ve seen consistently sync attempts that fail a connection check only triggered one attempt, and the logs end with Backoff before next attempt: 10 seconds at the same time as Failing job: <ID>, reason: Connection Check Failed, eg:

 Backoff before next attempt: 10 seconds
2024-07-09 20:09:18 platform > Failing job: 14957, reason: Connection Check Failed 6b7caac9-113b-438f-8ecb-e24fa09bc059```

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1720601550524039) if you want 
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["retries", "connection-check-failure", "sync-attempts", "logs", "backoff-policy"]
</sub>

https://airbytehq.slack.com/archives/C01AHCD885S/p1720618121852509?thread_ts=1720186376.533389&amp;cid=C01AHCD885S|response from another thread :slightly_smiling_face:
> Yes, I believe that is expected behavior as the source and destination must both be valid to be used in a connection. I would go test those individually in the UI and see if you get a more helpful error

It is strange that you’re seeing this more on the current version . . . is there a specific connection type you’re seeing this on, or is it all of them?

I’m not sure if it’s related, but our syncs are taking longer and we’re started reach the MAX_*_WORKERS, so it seems like pods weren’t getting created in time for various steps… I don’t know if this could lead to some timeout errors in the connection check for example…?

You’re deploying via helm, right? And are you local or on GCP or AWS?

yes, via Helm on AWS

to clarify, the connection check fail hasn’t been as big of an issue for us as the longer / failing syncs. But the lack of retry in those syncs was surprising for us, so we wondered if it’s expected behaviour :slightly_smiling_face:

What are the other errors you’re seeing on the long-failing syncs?

we get a lot of rather generic sounding:
message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false
in these cases, the logs often end with something like:

2024-07-11 07:50:20 platform > Terminal state in the State Store is missing. Orchestrator pod orchestrator-repl-job-15131-attempt-2 non-existent. Assume failure.```
We also get eg:
• `Failed to create pod for check step` (or read/write steps)
• `Destination process is still alive, cannot retrieve exit value.`
• `Refresh Schema took too long` 
• in one case, we got `Failed to connect to airbyte-prod-airbyte-server-svc/172.20.148.18:8001` 
• in one case, the read/orchestrator/write pods from one attempt are still running even though that attempt as already failed

are you using EKS on AWS, or running your own cluster? I wonder if there’s some sort of connectivity issue between nodes (or against the db/Temporal) or some type of resource contention happening

Thanks! We’re using AWS/EKS

I’m not familiar with the resource contention topic, but we do share the cluster with other teams. The cluster has an autoscaling group and our devops team says we’re using more resources than other teams when they checked

we got close to the max threshold the devops team had set and the latest I’ve checked we now have double the nodes associated with our namespace, so I think they increased the limit temporarily to make sure we don’t cause other teams issues

if you have time, could you please let me know if you know how we could troubleshoot if it’s the connectivity issue…? I’ve seen a couple of syncs running with no related sync-specific pods or vice versa a sync failing but the three orchestrator/read/write pods remain for hours longer…

we’ve already doubled the max sync and check workers config, increased worker replica count from 1 to 2 and increased the cluster’s max nodes from 15 to 21 without any clear changes on the failure rate… does this make it less likely to be a resource contention issue…?

<@U069EMNRPA4> have you seen the Terminal state in the State Store is missing OP mentions in the thread? (not the original post, but down lower) . . . I haven’t seen this one, and the behavior they’re seeing seems strange to me. Didn’t know if you had any ideas.