Large sync fails but successfully loads data after failed retry (with different error)

  • Is this your first time deploying Airbyte?: No
  • Deployment: Kubernetes
  • Airbyte Version: 0.36.1
  • Source name/version: MSSQL 0.3.19
  • Destination name/version: Snowflake 0.4.24
  • Step: sync
  • Description:
    One of our largest syncs keeps failing in the same manner:
  1. First, it gets all the way through extract, and then fails with a very generic java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143, after which normalization runs and seemingly succeeds (however destination data is not updated)
    • Example log here (178.89 GB | 98,506,711 emitted records | no records | 10h 6m 45s)
  2. A retry sync runs, and fails much more quickly, this time with the (much more explicit) error Stage 'AIRBYTE.SOME_SCHEMA_THAT_SHOULD_NOT_BE_WRITTEN_TO__OVERWRITE_IN_CONNECTION.MSSQL_QCS_QCS__LLC_ITEM_LEDGER_ENTRY' does not exist or not authorized.
    • Example log here (35.33 GB | 23,796,982 emitted records | no records | 2h 17m 44s)
    • The schema SOME_SCHEMA_THAT_SHOULD_NOT_BE_WRITTEN_TO__OVERWRITE_IN_CONNECTION is the default schema set in our snowflake destination config, but is overwritten in the connection in the “Namespace Custom Format” field as “MSSQL_QCS” - this is the schema name that you can see in both logs during normalization, too
    • The above schema does not contain any records, so it’s never written to
    • This is where things get weird, because the normalization on this failure actually successfully writes data to the normalized destination tables. For example, getting max(_airbyte_normalized_at) shows 2022-04-25 15:36:49.118 +0000, which was clearly during the first retry’s normalization, even though only the first run actually extracted that data
  3. The second retry looks very similar to the first sync, extracting all data and failing with the exit code 143 issue (can drop that log if it’d be useful)

So it kind of seems like there might be two issues here - the first which is seemingly “failing” the first job even though everything is extracting (just not normalizing even though those steps run just fine) and then normalizing fine on the retry, and the second (potentially snowflake specific) issue where the custom namespace format (schema) is being ignored somewhere in code (and instead being appended to the table name) but not during normalization where it matters

We are trying to evaluate airbyte and testing out the airbyte load features… have some questions… we have large volume of data load (full load refresh) from data source , we see issues that it fails to load large volumes in one full load using airbyte feature ( full load) …any suggestions or any ideas how to deal with the large volume ( full refresh ) for large volumes on airbyte level. Thank you… Please find log message…error message…what is the real issue of the error …

Please find the full log details
logs-64.txt (61.2 KB)

...is it source connection timeout or anything else ....2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 24294000 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 24295000 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.p.KubePodProcess(close):708 - (pod: airbyte-dev / source-mssql-sync-649-2-labsr) - Closed all resources for pod
2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 24295634 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.p.KubePodProcess(close):708 - (pod: airbyte-dev / destination-s3-sync-649-2-emnch) - Closed all resources for pod
2022-04-19 20:44:01 [33mWARN[m i.a.c.i.LineGobbler(voidCall):86 - airbyte-destination gobbler IOException: Socket closed. Typically happens when cancelling a job.
2022-04-19 20:44:01 [1;31mERROR[m i.a.w.DefaultReplicationWorker(run):169 - Sync worker failed.
java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:162) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
	Suppressed: io.airbyte.workers.WorkerException: Source process exit with code 143. This warning is normal if the job was cancelled.
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteSource.close(DefaultAirbyteSource.java:136) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]
	Suppressed: io.airbyte.workers.WorkerException: Destination process exit with code 143. This warning is normal if the job was cancelled.
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:119) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getReplicationRunnable$5(DefaultReplicationWorker.java:312) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] ```

Hey @luke can you create an issue in github so that team can check and get back

Hey @jyothi.kukunoor looks like the API is backing off can you try with some closer date and see if it working

Here we are conencting to Sql server data source…what API we r talking here that it is backing off.

Opened an issue, here

Thanks @luke. @jyothi.kukunoor the logs you have posted are of google analytics I think can you confirm?

@alafanechere this seems to be similar to the problem I’m experiencing with a Gitlab → Snowflake connection for mature projects/groups.

For reference: Sync Jobs Failing: Non-Zero Error Code 143

@harshith : Please find sync log fails for larger volume data… attaching logs …logs-64.txt (61.2 KB)
Source is Sql server and destination is aws S3.

Hey Jyoti I could see the above logs are still are for google analytics

Checking if airbyte/source-google-analytics-v4:0.1.16 exists...

@harshith I haven’t heard anything back yet on possible workarounds even. Is there anyone else able to look into this? We’ve lost a lot of confidence in Airbyte as our EL solution because of this impacting the majority of our DB jobs at this point.

Hey @luke sorry for that let me add the critical tag and ask the team to look into at early

1 Like

Hey @luke one last question can you help with the resource details like Memory, CPU?

Sure thing @harshith. We are running in an EKS cluster using AWS t3.large instances (2 vCPUS with 8 GiB memory), cluster-autoscaler and have deployed Airbyte using the helm chart.
We have tried overridding the memory resource limits with those from the stable-with-resource-limits kustomize and configured the chart to have jobs.resources.limits.memory: 512Mi with no noticeable impact.

Hey is it possible to increase the resources and try again? And see it it is happening?

Same issue here. Will try beefing up the virtual machine.

Alright, so we appear to have fixed this through a variety of changes:

  1. Moved our Kubernetes config from Helm chart to Kustomize (after seeing Kustomize gets preferential treatment for fixes within airbyte)
  2. Set jobs.resources.requests.memory to 1Gi
  3. Upgraded Airbyte to 0.38.4 as well as upgrading all of our sources to latest (after seeing a bunch of updates regarding memory usage)

So, leaning towards this being at least related to memory/resource allocation, but can’t say for sure given our shotgun approach to fixes. The only thing we ruled out as being beneficial here was upgrading our EKS cluster.