Duplicate records with exactly matching _AIRBYTE_START_AT

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu 2.04.4 LTS
  • Memory / Disk: 16Gb / 1 Tb
  • Deployment: Docker
  • Airbyte Version: 0.39.20
  • Source name/version: Salesforce / 1.0.10
  • Destination name/version: Snowflake / 0.4.28
  • Step: Unknown
  • Description:

Whilst developing, we’ve realised that every SCD table contains a duplicate record for a given Airbyte run. Every record loaded in every table for that run is duplicated, with the table’s native fields absolutely identical and the Airbyte fields looking something like this:

_AIRBYTE_UNIQUE_KEY |_AIRBYTE_UNIQUE_KEY_SCD |_AIRBYTE_START_AT |_AIRBYTE_END_AT
123 |456 |2022-04-29 11:26:45.000 +0000 |2022-04-29 11:26:45.000 +0000
123 |789 |2022-04-29 11:26:45.000 +0000 |

What’s more, these datetimes pre-date the server’s existence (we think that the run was on 2022-06-20 and the server’s clock and Airbyte logs appear to give the correct time).

We’ve got no explanation other than some form of Airbyte bug / load failure. We’re obviously concerned that this could present issues and would like to understand what’s gone wrong.

Input welcome.

Thanks

Stuart
logs-49.txt (305.5 KB)

Pretty much as soon as I finished posting I noticed something potentially significant in the _AIRBYTE_EMITTED_AT and _AIRBYTE_NORMALIZED_AT fields.

_AIRBYTE_EMITTED_AT | _AIRBYTE_NORMALIZED_AT
2022-06-13 16:29:34.520 -0700 | 2022-06-20 12:32:13.848 +0000
2022-06-20 12:23:15.774 -0700 | 2022-06-20 12:32:13.848 +0000

It looks a bit like like the data didn’t normalise on an old run and then two records were normalised on a second run.

Still not sure that I understand, but it might give some flavour to the problem.

Stuart

Hey do you have dedup feature enabled over this connection?

Hi. All set to dedup + history.

Hey if there was no normalisation run that would have happened in the previous run and later sync had normalisation then yeah it would have dedup in the later run