Failed attempts leading to duplicates

Clovis · April 21, 2022, 3:52pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: Google Kubernetes Engine default OS
Memory / Disk: 4 x e2-medium
Deployment: Kubernetes
Airbyte Version: 0.35.67-alpha
Source name/version: postgres@0.4.10
Destination name/version: bigquery@1.1.0
Step: Sync (Append Incremental)
Description:

When running a sync, it may sometimes happen that the attempt encounters an error like this one :

State hash XXXXXXXXX was already committed, likely indicating a state hash collision.
(by the way, I have no idea what could be causing this error ?)

Anyway, it seems to happen randomly: the sync reads data correctly, then after some time the error is thrown, the attempt fails, a new attempt starts and with a bit of luck it succeeds. Great !

My problem here is that even if the first attempt failed in my example, there was still some data inserted in my destination database. For instance here are two lines resulting from this (group by _airbyte_emitted_at column):

| # | _airbyte_emitted_at            | inserted rows |
| 1 | 2022-04-21 13:00:35.375000 UTC | 71048         | # attempt 1
| 2 | 2022-04-21 13:20:58.160000 UTC | 72128         | # attempt 2

Here, we can see I have some data inserted in the first attempt, the attempt failed, the cursor is reset and eventually data is re-inserted during the second attempt.

In the end, all the data inserted in the first attempt is only duplicates and I wanted to know if there was some settings somewhere to either:

A / prevent data insertion as long as attempt did not succeed ?
B / delete data inserted during failed attempts ?

alafanechere · April 21, 2022, 4:43pm

Hello @Clovis,
Thank you for this great question.
You are in a partial failure scenario. In this context some records are successfully committed to the destination and the sync fails afterward on other records. Airbyte does not perform a clean up of the already synced data when a failure happens.
In incremental sync, our connectors usually use the notion of cursor in the state to start the read from the place they stopped in the last sync. Hence, the more frequent the cursor is checkpointed the less you are prone to have duplicate data in your destination.

I suggest two approaches to work around your problem:

If you use normalization you can choose the Incremental Dedupe mode that will remove duplicates in the destination
What is your current cursor field? Maybe using a more fine-grained cursor field might increase the state checkpointing interval and reduce the duplicates.

Clovis · April 27, 2022, 2:24pm

Hi @alafanechere ,
Thanks for the reply, now I know it is a classic behavior. I will try to tweak my cursor field as you mentioned and see what happens

alafanechere · April 27, 2022, 3:28pm

I investigated a bit more and it look like the state checkpointing for databases happens at the end of a sync, in other words no cursor are saved if the sync fails. A second attempts might then re-read the exact same data of the previous sync and lead to generation of duplicates.
This is something specific to database connector, our API connectors perform more frequent state checkpointing.

To limit the amount of duplicate I would:

Try to understand the root cause of intermittent failure and limit these
Run frequent sync, this will increase the chance of successful read and will lead to more frequent state checkpointing
Use Incremental Dedupe mode to remove duplicates in the destination

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
State hash XXX was already committed, likely indicating a state hash collision using Incremental Dedup Connector Questions & Issues data-loading	8	707	July 24, 2022
Airbyte committing records for failed sync Q&A	9	337	July 14, 2022
Duplicated data on s3 destination after multiple failed attempts Connector Questions & Issues destination-s3	5	280	July 14, 2022
Incremental dedup sync hangs until timeout after sync is complete Connector Questions & Issues source-postgres , connectors	6	436	March 2, 2023
Suddenly after two days some of the syncs are failing Connector Questions & Issues source-mysql , destination-bigquery	9	906	October 4, 2022

Failed attempts leading to duplicates

Related topics