Airbyte sync requiring multiple attempts after upgrade to 0.63.3

Summary

After upgrading to version 0.63.3, Airbyte syncs are failing and requiring multiple attempts on various sources like MySQL, PostgreSQL, Stripe API, and Shopify. Errors indicate issues with the destination process still being alive and stream closures. Upgrades to 0.63.4 and 0.63.5 did not resolve the issue. Snowflake connector version 3.11.0 is being used.


Question

Since upgrade to 0.63.3 I’m experiencing that Airbyte needs multiple attempts before successful sync. It’s happening on multiple sources (mysql, postgres, stripe api, shopify) without any clear pattern. Usually somewhere between 3rd to 6th attempt it successfully finishes, but this increases the sync times significantly (i.e. for a sync that usually took ~10 minutes it now needs 6x10min because of retries).

The situation did not improve with upgrade to 0.63.4 / 0.63.5

I’m seeing errors like this:

java.lang.IllegalStateException: Destination process is still alive, cannot retrieve exit value.
        at com.google.common.base.Preconditions.checkState(Preconditions.java:515) ~[guava-33.2.0-jre.jar:?]
        at io.airbyte.workers.internal.DefaultAirbyteDestination.getExitValue(DefaultAirbyteDestination.java:210) ~[io.airbyte-airbyte-commons-worker-0.63.5.jar:?]
        at io.airbyte.workers.general.BufferedReplicationWorker.readFromDestination(BufferedReplicationWorker.java:492) ~[io.airbyte-airbyte-commons-worker-0.63.5.jar:?]
        at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:235) ~[io.airbyte-airbyte-commons-worker-0.63.5.jar:?]
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
2024-07-10 05:16:01 platform > readFromDestination: done. (writeToDestFailed:true, dest.isFinished:false)
2024-07-10 05:17:01 platform > airbyte-destination gobbler IOException: Stream closed. Typically happens when cancelling a job.
2024-07-10 05:18:01 platform > airbyte-source gobbler IOException: Stream closed. Typically happens when cancelling a job.
2024-07-10 05:18:01 platform > Closing StateCheckSumCountEventHandler```
Destination is Snowflake, I’m using latest snowflake connector 3.11.0

Anyone else experienced something similar? I’m not even sure where to start looking for a problem so any hint is more than welcome.

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1720599281960249) if you want 
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["airbyte-sync", "upgrade-0.63.3", "multiple-attempts", "destination-process", "stream-closed", "snowflake-connector"]
</sub>

Hi Gasper, we’re on 0.63.1, and are also noticing that our syncs are taking longer/taking more retries and feeling confused. We’re syncing from eg. mysql, postgres, stripe, zendesk; destination is bigquery (2.6.0, not the latest version though)

Hi again, we increased MAX_SPEC_WORKERS, MAX_SYNC_WORKERS and worker.replicaCount as described https://docs.airbyte.com/enterprise-setup/scaling-airbyte#concurrent-sync-limits|here (even though we use the OSS version) in our dev environment and in our case, we see improvements. I guess it’s addressing the symptom rather than the cause, but thought I’d share just in case it’d be relevant for you too :slightly_smiling_face:

<@U06MYV00TKQ> did i saw your message yesterday, that you was trying to change Cocurent Sync limit parameters, as described here https://docs.airbyte.com/enterprise-setup/scaling-airbyte#concurrent-sync-limits

yes, it made things better in dev on some test syncs (we could reproduce and resolve the kinds of errors we were getting in prod). But once we rolled it out in prod, we still had some issues. So I thought maybe I shared that too early… will definitely update if we find something that works for us

i’m just testing change from 5 -> 10 on all workers (listed below), together with Airbyte upgrade to the latest release 0.63.6 but unfortunately it doesn’t seem to have any effect on the situation.
• MAX_SYNC_WORKERS
• MAX_SPEC_WORKERS
• MAX_CHECK_WORKERS
• MAX_DISCOVER_WORKERS
• MAX_NOTIFY_WORKERS

would it be possible to maybe try rolling back to an older version instead…? Our team decided to adjust the cluster autoscaling on top of the airbyte workers configurations and live with restarting some syncs for now… but I see things that suggest something’s wrong with the Airbyte instance itself. eg. a couple of syncs running with no related sync-specific pods or vice versa a sync failing but the three orchestrator/read/write pods remain for hours longer

sorry to bother you <@U01MMSDJGC9>, but have you heard of similar repeated failed sync attempt issues from people who upgraded to Airbyte 0.63.X…?

<@U06MYV00TKQ> just a quick update: one thing i did was that i reconfigured scheduled jobs so that they don’t overlap and it seems that this helps (with exception of one job that has custom connector).
It looks like a performance issue as we are currently running Airbyte on a single EC2 instance (r5.2xlarge) with docker compose deployment. As next step, I’ll migrate to Kuberentes and see, how much we have to scale to get this under control.

<@U075ZN9GT96>
I had a similar problem with this using Docker Compose and followed ur suggestion in making sure the scheduled jobs don’t overlap and it seems to have solved my current problem too.

Are you planning to move to Kubernetes running in docker containers in a single EC2 Instance, or running it using EKS or multiple EC2 Instances?