Duplicate Rows Issue with Airbyte Sync

Summary

Airbyte is bringing the same rows again despite having the timestamp of the last retrieval. The issue seems to be related to the cursor field ‘updated_at’ and the incremental append sync. The user is looking for a solution to fix this problem.


Question

Hello everyone,

I’m using airbyte with Snowflake as Source and Postgres as Destination.
I’m using “updated_at” as cursor field, and I use “Incremental | Append” sync.

For some reason the airbyte brings the same rows even though it knows it has timestamp it already retrieved:

airbyte_emitted_at updated_at
2024-02-26 00:52:50.615000 +00:00,2024-02-26T00:12:02.700
2024-02-26 00:52:50.615000 +00:00,2024-02-26T00:12:02.700
2024-02-26 00:52:50.615000 +00:00,2024-02-26T00:12:02.700
2024-02-26 00:20:07.308000 +00:00,2024-02-26T00:12:02.700
2024-02-26 00:20:07.308000 +00:00,2024-02-26T00:12:02.700
2024-02-26 00:20:07.308000 +00:00,2024-02-26T00:12:02.700

As you can see in this example, it brought the same rows again and if I go to the sync logs I can see it already “knew” that this is the timestamp of the last retrieval:

2024-02-26 00:52:50 source > INFO i.a.i.s.r.s.CursorManager(createCursorInfoForStream):169 Found matching cursor in state. Stream: MY_TABLE. Cursor Field: UPDATED_AT Value: 2024-02-26T00:12:02.700 Count: 2051

What can cause this issue and how can I fix this?

Thank you!



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["duplicate-rows", "airbyte-sync", "cursor-field", "updated-at", "incremental-append-sync"]

If it helps I also get this schema validation error (which shouldn’t effect anything):
Error messages: [$.DATE_: 2023-08-29T17:33:23 is an invalid date-time]

And when reading the data I get this count by cursor:
2024-02-26 00:55:18 source > INFO i.a.i.s.j.AbstractJdbcSource(lambda$queryTableIncremental$18):337 Table MY_TABLE cursor count: expected 2051, actual 2534

But when I look at the actual count of records it brought I get:
2024-02-26 00:55:24 destination > INFO i.a.i.d.r.InMemoryRecordBufferingStrategy(flushAllBuffers):85 Flushing MY_TABLE: 4432 records (24 MB)

What is going on?

<@U042JC23FHU> Did you also try using incremental_deduped ?