Retries for Connection Check Failure

slack-user-airbyte · July 11, 2024, 6:11am

Summary

The user is inquiring about the expected behavior regarding retries for connection check failures. They have observed sync attempts that fail a connection check only trigger one attempt without retries.

Question

Hi! Does anyone happen to know if it’s expected behaviour for there to be no retries when the failure is a connection check failure?

We’ve seen consistently sync attempts that fail a connection check only triggered one attempt, and the logs end with Backoff before next attempt: 10 seconds at the same time as Failing job: <ID>, reason: Connection Check Failed, eg:

 Backoff before next attempt: 10 seconds
2024-07-09 20:09:18 platform &gt; Failing job: 14957, reason: Connection Check Failed 6b7caac9-113b-438f-8ecb-e24fa09bc059```

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1720601550524039) if you want 
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["retries", "connection-check-failure", "sync-attempts", "logs", "backoff-policy"]
</sub>

slack-user-airbyte · July 11, 2024, 6:16am

https://airbytehq.slack.com/archives/C01AHCD885S/p1720618121852509?thread_ts=1720186376.533389&cid=C01AHCD885S|response from another thread
> Yes, I believe that is expected behavior as the source and destination must both be valid to be used in a connection. I would go test those individually in the UI and see if you get a more helpful error

slack-user-airbyte · July 13, 2024, 6:15am

It is strange that you’re seeing this more on the current version . . . is there a specific connection type you’re seeing this on, or is it all of them?

slack-user-airbyte · July 13, 2024, 6:15am

I’m not sure if it’s related, but our syncs are taking longer and we’re started reach the MAX_*_WORKERS, so it seems like pods weren’t getting created in time for various steps… I don’t know if this could lead to some timeout errors in the connection check for example…?

slack-user-airbyte · July 13, 2024, 6:15am

You’re deploying via helm, right? And are you local or on GCP or AWS?

slack-user-airbyte · July 13, 2024, 6:15am

yes, via Helm on AWS

slack-user-airbyte · July 13, 2024, 6:15am

to clarify, the connection check fail hasn’t been as big of an issue for us as the longer / failing syncs. But the lack of retry in those syncs was surprising for us, so we wondered if it’s expected behaviour

slack-user-airbyte · July 13, 2024, 6:15am

What are the other errors you’re seeing on the long-failing syncs?

slack-user-airbyte · July 13, 2024, 6:15am

we get a lot of rather generic sounding:
message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false
in these cases, the logs often end with something like:

2024-07-11 07:50:20 platform > Terminal state in the State Store is missing. Orchestrator pod orchestrator-repl-job-15131-attempt-2 non-existent. Assume failure.```
We also get eg:
• `Failed to create pod for check step` (or read/write steps)
• `Destination process is still alive, cannot retrieve exit value.`
• `Refresh Schema took too long` 
• in one case, we got `Failed to connect to airbyte-prod-airbyte-server-svc/172.20.148.18:8001` 
• in one case, the read/orchestrator/write pods from one attempt are still running even though that attempt as already failed

slack-user-airbyte · July 13, 2024, 6:15am

are you using EKS on AWS, or running your own cluster? I wonder if there’s some sort of connectivity issue between nodes (or against the db/Temporal) or some type of resource contention happening

slack-user-airbyte · July 13, 2024, 6:15am

Thanks! We’re using AWS/EKS

I’m not familiar with the resource contention topic, but we do share the cluster with other teams. The cluster has an autoscaling group and our devops team says we’re using more resources than other teams when they checked

we got close to the max threshold the devops team had set and the latest I’ve checked we now have double the nodes associated with our namespace, so I think they increased the limit temporarily to make sure we don’t cause other teams issues

slack-user-airbyte · July 14, 2024, 6:18am

if you have time, could you please let me know if you know how we could troubleshoot if it’s the connectivity issue…? I’ve seen a couple of syncs running with no related sync-specific pods or vice versa a sync failing but the three orchestrator/read/write pods remain for hours longer…

we’ve already doubled the max sync and check workers config, increased worker replica count from 1 to 2 and increased the cluster’s max nodes from 15 to 21 without any clear changes on the failure rate… does this make it less likely to be a resource contention issue…?

slack-user-airbyte · July 15, 2024, 6:17am

<@U069EMNRPA4> have you seen the Terminal state in the State Store is missing OP mentions in the thread? (not the original post, but down lower) . . . I haven’t seen this one, and the behavior they’re seeing seems strange to me. Didn’t know if you had any ideas.

Topic		Replies	Views
Retry behavior for failed Airbyte connection sync Connector Questions airbyte-connector , connector , bigquery , question , klaviyo	2	45	October 4, 2024
Airbyte connections showing 2 attempts with first attempt failing Platform Questions platform , question , airbyte-connections , error-resolution , sync-attempts	0	83	May 14, 2024
Error in Scheduler causing Check Connection Failure Platform Questions platform , internal-error , error , question , scheduler	0	30	May 14, 2024
Monitoring Failed Connections Metric Issue Platform Questions platform , question , monitoring-failed-connections , temporal_workflow_failure_total , metric	0	2	November 11, 2024
Source Mailchimp - Syncs stays in Retrying state but no 2nd attempt is triggered Platform, Deploy & Infra Issues normalization , deploy	2	331	August 18, 2022

Retries for Connection Check Failure

Summary

Question

Related topics