Duplicated data on s3 destination after multiple failed attempts

nor · June 10, 2022, 10:05am

Is this your first time deploying Airbyte?: No
OS Version / Instance: Kubernetes
Memory / Disk: unlimited (scalable cluster)
Deployment: Kubernetes
Airbyte Version: 0.35.64-alpha
Source name/version: source-mssql 0.3.22
Destination name/version: destination-s3 / 0.3.5
Step: The issue is happening during sync (when the sync succeed after one or multiple attempts)
Description: we have some connection syncs that sometimes succeed after multiple failed attempts, I am fine with that (sync time is not important to us), However the data is duplicated in the destination (s3 bucket) because failed attempts data is not cleaned by the S3 destination, I am wondering if this is a choice make by the devs who developed the connector, or it is something that will be fixed in future releases of the connector ?

harshith · June 10, 2022, 12:36pm

Hey have a couple of questions about this?

Is there any reason why it gets succeeded after multiple attempts?
Also you could use custom DBT if you can figure out the unique cursor so that dedup can happen but yeah this is something not out of the box

nor · June 10, 2022, 1:22pm

we get generic errors like these :

errors: $.flattening: is missing but it is required, $.format_type: does not have a value in the enumeration [CSV]
2022-06-10 02:37:53 INFO i.a.v.j.JsonSchemaValidator(test):56 - JSON schema validation failed.
errors: $.format_type: does not have a value in the enumeration [JSONL]

we suspect that it is the MSSQL server that times out because this happens only during sync of large tables

I don’t think the s3 destination supports DBT transformations, even if it does, still not an option because the tables don’t have a field that can be used for dedup .

A solution for the S3 destination would be something like this : keeping track of the files written during an attempt, and then deleting them if the attempt fails.

harshith · June 10, 2022, 4:19pm

Got it. Looks like a problem could you create an issue around this so that team can give some suggestions or we can get to a solution

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
Failed attempts leading to duplicates Connector Questions & Issues source-postgres , destination-bigquery , data-loading , databases	5	1448	July 14, 2022
Problem with S3 incremental sync Connector Questions & Issues data-loading , source-s3	1	423	March 22, 2023
Large sync fails but successfully loads data after failed retry (with different error) Connector Questions & Issues data-loading	18	1454	July 14, 2022
Sync appears to be getting cancelled on it's own when handling large amounts of data Connector Questions & Issues destination-redshift , data-loading , connectors , kubernetes	2	880	October 17, 2022
SQL Server big table to S3 fails Connector Questions & Issues source-microsoft-sql-server-mssql , destination-s3 , data-loading	1	247	October 24, 2022

Duplicated data on s3 destination after multiple failed attempts

Related topics