When using Airbyte to sync CDC records from Microsoft SQL Server, we are facing an issue with duplicated records. The problem is that the _airbyte_emitted_at field is the same, but the contents of the records differ. This is very problematic as there is no way for us to know which of the records is the most recent when doing the dense rank for the subsequent layers of the data lake.
The two images represent the same four highlighted records. The _airbyte_ab_id for each one is different, but the _airbyte_emitted_at field is exactly the same, and other fields such as dt_fim_transacao_preset also differ (with one being null). This is happening on a multitude of different tables, and directly impacts our operation, as there is no reliable way of knowing which of the records is the most recent one that accurately represents the source.
Can you guys help shed some light on this?
We’ve been trying to use Airbyte as our main CDC solution, but the problems so far have been myriad. I really do like the product, it’s simple to use, the open source nature is something we value a lot at our company, with active development and what seems to be an incredibly talented team, but trying to reach out for help sometimes with issues like this can be very frustrating.
Hey, @Matheus_Pinheiro! I’m glad to hear you like the product and sorry to hear about the frustrating problems - I am part of the community assistance team that helps the community figure things like this out and we are currently understaffed. I’ll try my best to help out here though!
In cases like this it’s always best to start with the basics:
Upgrade the Airbyte instance to the latest version, you’re a few versions behind. You’re right - there’s a lot of development going on and simply updating sometimes fixes bugs.
However, as can be seen, the __$operation is different for each one. Consulting the Microsoft documentation, it seems like the __$operation=3 rows refer to the data pre-update, while the __$operation=4 rows refer to the data post-update, meaning only __$operation=4 rows are relevant to us.
Also, there seems to be no other fields on the generated Airbyte files that are uploaded to Google Cloud Storage besides _airbyte_ab_id, and _airbyte_emitted_at. This is very strange as the Airbyte documentation for SQL Server CDC states that INSERT , UPDATE and DELETE operations are caught and ingested. How are those identified without an operation field in the output file?
How can I make it so that Airbyte ignores the __$operation=3 events? Or at least, how can I identify the operation so that I can filter these on my ingested csv and avro files?
Can anyone help shed some light on this? Just knowing how airbyte differentiates between CDC operations like updates and deletes would already be very helpful.