Summary
User reports an issue with the incremental append sync in Airbyte where the same data is loaded multiple times due to identical cursor values in the source MSSQL table. The problem arises when updating a row to a newer date, but the sync still loads all rows instead of just the updated one.
Question
Did anyone had this experience with cursor where it loads the same data again?
I have 10 rows in the MSSQL source table. They all have the same date (1975-01-01T00:00:00.000000) in the column that is used as cursor field.
I run the sync in Airbyte (Incremental Append) and it loads these 10 rows nicely to the target table in Clickhouse. The cursor value is set to 1975-01-01T00:00:00.000000.
I update one of the rows in the source table so it has newer date - 1980-01-01T00:00:00.000000. I run the sync for the second time and it loads 10 rows again instead of only one row with the updated date. Why?
But it works good when I first load 10 rows and then insert one more row with newer date and then it correctly loads only this one row.
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
['incremental-append', 'cursor', 'mssql', 'clickhouse', 'data-loading']
It’s likely that the cursor is being stored with the original value (1975-01-01
). Then that cursor value is sent and rows newer or equal to it are retrieved. This helps to catch conditions in which data is incomplete or still being written (e.g. when you’re at the current date and the type is date only).
Since these overlaps tend to be tiny in real-world scenarios, it’s better to err on the side of caution than to risk missing even a single row of data.
At least that’s my assumption of how this is working; it varies a bit for each system and connector type depending on how they interpret the cursors. (A few connectors also have settings where you can add a lookback window to re-pull additional overlap.)
It’s likely that the cursor is being stored with the original value (1975-01-01
). Then that cursor value is sent and rows newer or equal to it are retrieved. This helps to catch conditions in which data is incomplete or still being written (e.g. when you’re at the current date and the type is date only).
Since these overlaps tend to be tiny in real-world scenarios, it’s better to err on the side of caution than to risk missing even a single row of data.
At least that’s my assumption of how this is working; it varies a bit for each system and connector type depending on how they interpret the cursors. (A few connectors also have settings where you can add a lookback window to re-pull additional overlap.)