Failed attempts leading to duplicates

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Google Kubernetes Engine default OS
  • Memory / Disk: 4 x e2-medium
  • Deployment: Kubernetes
  • Airbyte Version: 0.35.67-alpha
  • Source name/version: postgres@0.4.10
  • Destination name/version: bigquery@1.1.0
  • Step: Sync (Append Incremental)
  • Description:

When running a sync, it may sometimes happen that the attempt encounters an error like this one :

State hash XXXXXXXXX was already committed, likely indicating a state hash collision.
(by the way, I have no idea what could be causing this error ?)

Anyway, it seems to happen randomly: the sync reads data correctly, then after some time the error is thrown, the attempt fails, a new attempt starts and with a bit of luck it succeeds. Great !

My problem here is that even if the first attempt failed in my example, there was still some data inserted in my destination database. For instance here are two lines resulting from this (group by _airbyte_emitted_at column):

| # | _airbyte_emitted_at            | inserted rows |
| 1 | 2022-04-21 13:00:35.375000 UTC | 71048         | # attempt 1
| 2 | 2022-04-21 13:20:58.160000 UTC | 72128         | # attempt 2

Here, we can see I have some data inserted in the first attempt, the attempt failed, the cursor is reset and eventually data is re-inserted during the second attempt.

In the end, all the data inserted in the first attempt is only duplicates and I wanted to know if there was some settings somewhere to either:

  • A / prevent data insertion as long as attempt did not succeed ?
  • B / delete data inserted during failed attempts ?

Hello @Clovis,
Thank you for this great question.
You are in a partial failure scenario. In this context some records are successfully committed to the destination and the sync fails afterward on other records. Airbyte does not perform a clean up of the already synced data when a failure happens.
In incremental sync, our connectors usually use the notion of cursor in the state to start the read from the place they stopped in the last sync. Hence, the more frequent the cursor is checkpointed the less you are prone to have duplicate data in your destination.

I suggest two approaches to work around your problem:

  • If you use normalization you can choose the Incremental Dedupe mode that will remove duplicates in the destination
  • What is your current cursor field? Maybe using a more fine-grained cursor field might increase the state checkpointing interval and reduce the duplicates.

Hi @alafanechere ,
Thanks for the reply, now I know it is a classic behavior. I will try to tweak my cursor field as you mentioned and see what happens :+1:

1 Like

I investigated a bit more and it look like the state checkpointing for databases happens at the end of a sync, in other words no cursor are saved if the sync fails. A second attempts might then re-read the exact same data of the previous sync and lead to generation of duplicates.
This is something specific to database connector, our API connectors perform more frequent state checkpointing.

To limit the amount of duplicate I would:

  • Try to understand the root cause of intermittent failure and limit these
  • Run frequent sync, this will increase the chance of successful read and will lead to more frequent state checkpointing
  • Use Incremental Dedupe mode to remove duplicates in the destination