Description:
One of our largest syncs keeps failing in the same manner:
First, it gets all the way through extract, and then fails with a very generic java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143, after which normalization runs and seemingly succeeds (however destination data is not updated)
Example log here (178.89 GB | 98,506,711 emitted records | no records | 10h 6m 45s)
A retry sync runs, and fails much more quickly, this time with the (much more explicit) error Stage 'AIRBYTE.SOME_SCHEMA_THAT_SHOULD_NOT_BE_WRITTEN_TO__OVERWRITE_IN_CONNECTION.MSSQL_QCS_QCS__LLC_ITEM_LEDGER_ENTRY' does not exist or not authorized.
Example log here (35.33 GB | 23,796,982 emitted records | no records | 2h 17m 44s)
The schema SOME_SCHEMA_THAT_SHOULD_NOT_BE_WRITTEN_TO__OVERWRITE_IN_CONNECTION is the default schema set in our snowflake destination config, but is overwritten in the connection in the “Namespace Custom Format” field as “MSSQL_QCS” - this is the schema name that you can see in both logs during normalization, too
The above schema does not contain any records, so it’s never written to
This is where things get weird, because the normalization on this failure actually successfully writes data to the normalized destination tables. For example, getting max(_airbyte_normalized_at) shows 2022-04-25 15:36:49.118 +0000, which was clearly during the first retry’s normalization, even though only the first run actually extracted that data
The second retry looks very similar to the first sync, extracting all data and failing with the exit code 143 issue (can drop that log if it’d be useful)
So it kind of seems like there might be two issues here - the first which is seemingly “failing” the first job even though everything is extracting (just not normalizing even though those steps run just fine) and then normalizing fine on the retry, and the second (potentially snowflake specific) issue where the custom namespace format (schema) is being ignored somewhere in code (and instead being appended to the table name) but not during normalization where it matters
We are trying to evaluate airbyte and testing out the airbyte load features… have some questions… we have large volume of data load (full load refresh) from data source , we see issues that it fails to load large volumes in one full load using airbyte feature ( full load) …any suggestions or any ideas how to deal with the large volume ( full refresh ) for large volumes on airbyte level. Thank you… Please find log message…error message…what is the real issue of the error …
Please find the full log details logs-64.txt (61.2 KB)
...is it source connection timeout or anything else ....2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 24294000 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 24295000 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.p.KubePodProcess(close):708 - (pod: airbyte-dev / source-mssql-sync-649-2-labsr) - Closed all resources for pod
2022-04-19 20:44:01 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 24295634 (72 GB)
2022-04-19 20:44:01 [32mINFO[m i.a.w.p.KubePodProcess(close):708 - (pod: airbyte-dev / destination-s3-sync-649-2-emnch) - Closed all resources for pod
2022-04-19 20:44:01 [33mWARN[m i.a.c.i.LineGobbler(voidCall):86 - airbyte-destination gobbler IOException: Socket closed. Typically happens when cancelling a job.
2022-04-19 20:44:01 [1;31mERROR[m i.a.w.DefaultReplicationWorker(run):169 - Sync worker failed.
java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:162) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Suppressed: io.airbyte.workers.WorkerException: Source process exit with code 143. This warning is normal if the job was cancelled.
at io.airbyte.workers.protocols.airbyte.DefaultAirbyteSource.close(DefaultAirbyteSource.java:136) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Suppressed: io.airbyte.workers.WorkerException: Destination process exit with code 143. This warning is normal if the job was cancelled.
at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:119) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 143
at io.airbyte.workers.DefaultReplicationWorker.lambda$getReplicationRunnable$5(DefaultReplicationWorker.java:312) ~[io.airbyte-airbyte-workers-0.35.67-alpha.jar:?]
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] ```
@harshith : Please find sync log fails for larger volume data… attaching logs …logs-64.txt (61.2 KB)
Source is Sql server and destination is aws S3.
@harshith I haven’t heard anything back yet on possible workarounds even. Is there anyone else able to look into this? We’ve lost a lot of confidence in Airbyte as our EL solution because of this impacting the majority of our DB jobs at this point.
Sure thing @harshith. We are running in an EKS cluster using AWS t3.large instances (2 vCPUS with 8 GiB memory), cluster-autoscaler and have deployed Airbyte using the helm chart.
We have tried overridding the memory resource limits with those from the stable-with-resource-limits kustomize and configured the chart to have jobs.resources.limits.memory: 512Mi with no noticeable impact.
Alright, so we appear to have fixed this through a variety of changes:
Moved our Kubernetes config from Helm chart to Kustomize (after seeing Kustomize gets preferential treatment for fixes within airbyte)
Set jobs.resources.requests.memory to 1Gi
Upgraded Airbyte to 0.38.4 as well as upgrading all of our sources to latest (after seeing a bunch of updates regarding memory usage)
So, leaning towards this being at least related to memory/resource allocation, but can’t say for sure given our shotgun approach to fixes. The only thing we ruled out as being beneficial here was upgrading our EKS cluster.
Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.