Am I missing something? It appears that the Incremental feature on various connectors pulls duplicate records i.e. it finds the last updated_at timestamp (or equivalent) and then uses that to pull records for the next sync. Meaning that the last record will be readded to the _AIRBYTE_RAW table.
I am noticing this on a few connectors already and am curious why no other users have experienced this issue.
@alafanechere i’ve tagged the survey monkey and zendesk connectors in the post.
both survey-monkey and zendesk-talk
…for records with a cursor value that is greater than the one retrieved from the state.
that doesn’t seem to be the case from what i’ve seen in most non-DB connectors in airbyte. it’s greater than or equal to.
see the survey monkeyget_updated_state definition, which gets the most recently updated_at response and sets that as the updated cursor field. It then uses this timestamp for the next incremental sync, meaning this last record is readded (duplicate record).
Airbyte’s incremental syncs are at-least-once delivery, so this is in line with that contract. If you use a dedupe sync mode, I believe these duped records should get taken care of.
The reason this works this way is to avoid race conditions where on sync 1, the connector reads all records until Oct 1, but if a record was generated on oct 1 that wasn’t caught by that sync (e.g if sync 1 was also on Oct 1) then it would get exported the next time out.
the connector reads all records until Oct 1, but if a record was generated on oct 1 that wasn’t caught by that sync
so that is truncating by day, the timestamps here are 2020-01-01T00:01:05Z (i.e. seconds), meaning a race-condition would be reduced to a 1 second window.
I completely appreciate at-least-once (dealt with a lot of segment event tracking frameworks e.g. at-least-once + deduping within 24 hour window) but for airbyte, this implies that the incremental mode has an implied dependency on the dedupe sync mode. and dedupe mode requires basic normalization in order to work (please confirm that).
basic normalization has undesireable implications e.g. costs in scanning large tables to dedupe as well as unnecessary additional tables required to dedupe (raw, _SCD and final).
but i guess that’s the trade-off between at-least-once and definitely-not-duplicate records
It would go against the at-least-once philosophy so I’m inclined not to modify this. It would act differently to all other connectors and id prefer consistency in the interest of understanding how they all work.