Duplicated registries when syncing from mssql CDC

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu
  • Memory / Disk: 16gb / 1 Tb
  • Deployment: Docker
  • Airbyte Version: 0.40.26
  • Source name/version: source-mssql 0.4.26
  • Destination name/version: destination-gcs 0.2.12
  • Step: Sync

Hello everyone,

When using Airbyte to sync CDC records from Microsoft SQL Server, we are facing an issue with duplicated records. The problem is that the _airbyte_emitted_at field is the same, but the contents of the records differ. This is very problematic as there is no way for us to know which of the records is the most recent when doing the dense rank for the subsequent layers of the data lake.

Here are some examples:

image

The two images represent the same four highlighted records. The _airbyte_ab_id for each one is different, but the _airbyte_emitted_at field is exactly the same, and other fields such as dt_fim_transacao_preset also differ (with one being null). This is happening on a multitude of different tables, and directly impacts our operation, as there is no reliable way of knowing which of the records is the most recent one that accurately represents the source.

Can you guys help shed some light on this?

We’ve been trying to use Airbyte as our main CDC solution, but the problems so far have been myriad. I really do like the product, it’s simple to use, the open source nature is something we value a lot at our company, with active development and what seems to be an incredibly talented team, but trying to reach out for help sometimes with issues like this can be very frustrating.

Thanks in advance!

Hey, @Matheus_Pinheiro! I’m glad to hear you like the product and sorry to hear about the frustrating problems - I am part of the community assistance team that helps the community figure things like this out and we are currently understaffed. I’ll try my best to help out here though!

In cases like this it’s always best to start with the basics:

  1. Upgrade the Airbyte instance to the latest version, you’re a few versions behind. You’re right - there’s a lot of development going on and simply updating sometimes fixes bugs.
  2. Update the connectors: I see MySQL is up to 1.0.20 https://docs.airbyte.com/integrations/sources/mysql#changelog and GCS is up to 0.2.14 https://docs.airbyte.com/integrations/destinations/gcs#changelog
  3. Provide the full logs - this will allow me to see what might be going wrong.

I’ll try to update the connectors, but you mentioned MySQL when my problem is with SQL Server.

So, as expected, upgrading connectors did not solve the issue.

We’re still getting duplicate events.

Here’s the log file:

cc862601_2cef_4dda_868e_035220f3bf2a_logs_53556_txt.txt (677.5 KB)

On our SQL Server, I checked the CDC and confirmed there are indeed events from the same transaction with the same timestamp:

+-------------------------+------------+------------------------+----------------+
| start_time              | end_time   | __$seqval              | __$operation   |
|-------------------------+------------+------------------------+----------------|
| 2023-02-04 02:11:48.880 | NULL       | 0x0002D69200000DF1000C | 3              |
| 2023-02-04 02:11:48.880 | NULL       | 0x0002D69200000DF1000C | 4              |
| 2023-02-04 02:11:54.223 | NULL       | 0x0002D692000013460004 | 3              |
| 2023-02-04 02:11:54.223 | NULL       | 0x0002D692000013460004 | 4              |
| 2023-02-04 02:11:54.527 | NULL       | 0x0002D6920000137F0003 | 3              |
| 2023-02-04 02:11:54.527 | NULL       | 0x0002D6920000137F0003 | 4              |
| 2023-02-04 02:15:02.687 | NULL       | 0x0002D6920001622F0003 | 3              |
| 2023-02-04 02:15:02.687 | NULL       | 0x0002D6920001622F0003 | 4              |
| 2023-02-04 02:15:06.270 | NULL       | 0x0002D6920001F8DC0003 | 3              |
| 2023-02-04 02:15:06.270 | NULL       | 0x0002D6920001F8DC0003 | 4              |
+-------------------------+------------+------------------------+----------------+

However, as can be seen, the __$operation is different for each one. Consulting the Microsoft documentation, it seems like the __$operation=3 rows refer to the data pre-update, while the __$operation=4 rows refer to the data post-update, meaning only __$operation=4 rows are relevant to us.

Also, there seems to be no other fields on the generated Airbyte files that are uploaded to Google Cloud Storage besides _airbyte_ab_id, and _airbyte_emitted_at. This is very strange as the Airbyte documentation for SQL Server CDC states that INSERT , UPDATE and DELETE operations are caught and ingested. How are those identified without an operation field in the output file?

How can I make it so that Airbyte ignores the __$operation=3 events? Or at least, how can I identify the operation so that I can filter these on my ingested csv and avro files?

Can anyone help shed some light on this? Just knowing how airbyte differentiates between CDC operations like updates and deletes would already be very helpful.

Upgrade the connector. Extra column is added to figured out the changes if lsn is same.

I have created a pull request for this usecase