Large sync fails but successfully loads data after failed retry (with different error)

luke · April 25, 2022, 7:24pm

Is this your first time deploying Airbyte?: No
Deployment: Kubernetes
Airbyte Version: 0.36.1
Source name/version: MSSQL 0.3.19
Destination name/version: Snowflake 0.4.24
Step: sync
Description:
One of our largest syncs keeps failing in the same manner:

First, it gets all the way through extract, and then fails with a very generic java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143, after which normalization runs and seemingly succeeds (however destination data is not updated)
- Example log here (178.89 GB | 98,506,711 emitted records | no records | 10h 6m 45s)
A retry sync runs, and fails much more quickly, this time with the (much more explicit) error Stage 'AIRBYTE.SOME_SCHEMA_THAT_SHOULD_NOT_BE_WRITTEN_TO__OVERWRITE_IN_CONNECTION.MSSQL_QCS_QCS__LLC_ITEM_LEDGER_ENTRY' does not exist or not authorized.
- Example log here (35.33 GB | 23,796,982 emitted records | no records | 2h 17m 44s)
- The schema SOME_SCHEMA_THAT_SHOULD_NOT_BE_WRITTEN_TO__OVERWRITE_IN_CONNECTION is the default schema set in our snowflake destination config, but is overwritten in the connection in the “Namespace Custom Format” field as “MSSQL_QCS” - this is the schema name that you can see in both logs during normalization, too
- The above schema does not contain any records, so it’s never written to
- This is where things get weird, because the normalization on this failure actually successfully writes data to the normalized destination tables. For example, getting max(_airbyte_normalized_at) shows 2022-04-25 15:36:49.118 +0000, which was clearly during the first retry’s normalization, even though only the first run actually extracted that data
The second retry looks very similar to the first sync, extracting all data and failing with the exit code 143 issue (can drop that log if it’d be useful)

So it kind of seems like there might be two issues here - the first which is seemingly “failing” the first job even though everything is extracting (just not normalizing even though those steps run just fine) and then normalizing fine on the retry, and the second (potentially snowflake specific) issue where the custom namespace format (schema) is being ignored somewhere in code (and instead being appended to the table name) but not during normalization where it matters

jyothi.kukunoor · April 25, 2022, 7:36pm

We are trying to evaluate airbyte and testing out the airbyte load features… have some questions… we have large volume of data load (full load refresh) from data source , we see issues that it fails to load large volumes in one full load using airbyte feature ( full load) …any suggestions or any ideas how to deal with the large volume ( full refresh ) for large volumes on airbyte level. Thank you… Please find log message…error message…what is the real issue of the error …

Please find the full log details
logs-64.txt (61.2 KB)

...is it source connection timeout or anything else ....2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 24294000 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 24295000 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.p.KubePodProcess(close):708 - (pod: airbyte-dev / source-mssql-sync-649-2-labsr) - Closed all resources for pod
2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 24295634 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.p.KubePodProcess(close):708 - (pod: airbyte-dev / destination-s3-sync-649-2-emnch) - Closed all resources for pod
2022-04-19 20:44:01 [33mWARN[m i.a.c.i.LineGobbler(voidCall):86 - airbyte-destination gobbler IOException: Socket closed. Typically happens when cancelling a job.
2022-04-19 20:44:01 [1;31mERROR[m i.a.w.DefaultReplicationWorker(run):169 - Sync worker failed.
java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:162) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
	Suppressed: io.airbyte.workers.WorkerException: Source process exit with code 143. This warning is normal if the job was cancelled.
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteSource.close(DefaultAirbyteSource.java:136) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]
	Suppressed: io.airbyte.workers.WorkerException: Destination process exit with code 143. This warning is normal if the job was cancelled.
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:119) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getReplicationRunnable$5(DefaultReplicationWorker.java:312) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] ```

harshith · April 26, 2022, 6:30am

Hey @luke can you create an issue in github so that team can check and get back

harshith · April 26, 2022, 6:31am

Hey @jyothi.kukunoor looks like the API is backing off can you try with some closer date and see if it working

jyothi.kukunoor · April 26, 2022, 3:25pm

Here we are conencting to Sql server data source…what API we r talking here that it is backing off.

luke · April 26, 2022, 4:28pm

Opened an issue, here

harshith · April 26, 2022, 5:56pm

Thanks @luke. @jyothi.kukunoor the logs you have posted are of google analytics I think can you confirm?

adamatzip · April 27, 2022, 12:14am

@alafanechere this seems to be similar to the problem I’m experiencing with a Gitlab → Snowflake connection for mature projects/groups.

For reference: Sync Jobs Failing: Non-Zero Error Code 143

jyothi.kukunoor · April 27, 2022, 3:15am

@harshith : Please find sync log fails for larger volume data… attaching logs …logs-64.txt (61.2 KB)
Source is Sql server and destination is aws S3.

harshith · May 5, 2022, 6:18am

Hey Jyoti I could see the above logs are still are for google analytics

Checking if airbyte/source-google-analytics-v4:0.1.16 exists...

luke · May 6, 2022, 4:56pm

@harshith I haven’t heard anything back yet on possible workarounds even. Is there anyone else able to look into this? We’ve lost a lot of confidence in Airbyte as our EL solution because of this impacting the majority of our DB jobs at this point.

harshith · May 10, 2022, 7:42am

Hey @luke sorry for that let me add the critical tag and ask the team to look into at early

harshith · May 10, 2022, 7:53am

Hey @luke one last question can you help with the resource details like Memory, CPU?

luke · May 10, 2022, 5:35pm

Sure thing @harshith. We are running in an EKS cluster using AWS t3.large instances (2 vCPUS with 8 GiB memory), cluster-autoscaler and have deployed Airbyte using the helm chart.
We have tried overridding the memory resource limits with those from the stable-with-resource-limits kustomize and configured the chart to have jobs.resources.limits.memory: 512Mi with no noticeable impact.

harshith · May 11, 2022, 7:03am

Hey is it possible to increase the resources and try again? And see it it is happening?

maxkrog · May 12, 2022, 7:59am

Same issue here. Will try beefing up the virtual machine.

luke · May 25, 2022, 6:07pm

Alright, so we appear to have fixed this through a variety of changes:

Moved our Kubernetes config from Helm chart to Kustomize (after seeing Kustomize gets preferential treatment for fixes within airbyte)
Set jobs.resources.requests.memory to 1Gi
Upgraded Airbyte to 0.38.4 as well as upgrading all of our sources to latest (after seeing a bunch of updates regarding memory usage)

So, leaning towards this being at least related to memory/resource allocation, but can’t say for sure given our shotgun approach to fixes. The only thing we ruled out as being beneficial here was upgrading our EKS cluster.

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
Sync MSSQL to Snowflake "closing connection" Connector Questions & Issues data-loading , connectors	1	208	February 9, 2023
Failure to replicate data from MS SQL Server to Snowflake Connector Questions & Issues source-microsoft-sql-server-mssql , destination-snowflake , normalization , connectors	1	294	March 20, 2023
Failed attempts leading to duplicates Connector Questions & Issues source-postgres , destination-bigquery , data-loading , databases	5	1450	July 14, 2022
Intermittent sync failures from large Postgres table to Snowflake Connector Questions & Issues source-postgres , destination-snowflake	5	383	July 14, 2022
Snowflake Sync Source Failure after 26GB, Airbyte Cloud Connector Questions & Issues source-snowflake	3	299	July 14, 2022

Large sync fails but successfully loads data after failed retry (with different error)

Related topics