Incremental Syncing Issues on Several Connectors

Am I missing something? It appears that the Incremental feature on various connectors pulls duplicate records i.e. it finds the last updated_at timestamp (or equivalent) and then uses that to pull records for the next sync. Meaning that the last record will be readded to the _AIRBYTE_RAW table.

I am noticing this on a few connectors already and am curious why no other users have experienced this issue.

Hello @danieldiamond,
Could you please share for which connector your see this happening?

The incremental stream should use the cursor value stored in the state and query the source API / database for records with a cursor value that is greater than the one retrieved from the state.

@alafanechere i’ve tagged the survey monkey and zendesk connectors in the post.
both survey-monkey and zendesk-talk

…for records with a cursor value that is greater than the one retrieved from the state.

that doesn’t seem to be the case from what i’ve seen in most non-DB connectors in airbyte. it’s greater than or equal to.

see the survey monkey get_updated_state definition, which gets the most recently updated_at response and sets that as the updated cursor field. It then uses this timestamp for the next incremental sync, meaning this last record is readded (duplicate record).

same for zendesk-talk get_updated_state

the general logic being:

        latest_state = current_stream_state.get(self.cursor_field, current_stream_state.get(self.legacy_cursor_field))
        new_cursor_value = max(latest_record[self.cursor_field], latest_state or latest_record[self.cursor_field])
        return {self.cursor_field: new_cursor_value}

the API calls records that are on or after this timestamp.

See screenshot below. syncing immediately after previous sync and it keeps pulling the same last record

Airbyte’s incremental syncs are at-least-once delivery, so this is in line with that contract. If you use a dedupe sync mode, I believe these duped records should get taken care of.

The reason this works this way is to avoid race conditions where on sync 1, the connector reads all records until Oct 1, but if a record was generated on oct 1 that wasn’t caught by that sync (e.g if sync 1 was also on Oct 1) then it would get exported the next time out.

1 Like

the connector reads all records until Oct 1, but if a record was generated on oct 1 that wasn’t caught by that sync

so that is truncating by day, the timestamps here are 2020-01-01T00:01:05Z (i.e. seconds), meaning a race-condition would be reduced to a 1 second window.

I completely appreciate at-least-once (dealt with a lot of segment event tracking frameworks e.g. at-least-once + deduping within 24 hour window) but for airbyte, this implies that the incremental mode has an implied dependency on the dedupe sync mode. and dedupe mode requires basic normalization in order to work (please confirm that).

basic normalization has undesireable implications e.g. costs in scanning large tables to dedupe as well as unnecessary additional tables required to dedupe (raw, _SCD and final).

but i guess that’s the trade-off between at-least-once and definitely-not-duplicate records

Thank you @sherifnada1 for the explanation!
Yes @danieldiamond you are right, if you want to guarantee uniqueness of records you need to run the normalization that will deduplicate the records.

@sherifnada1 interestingly I just stumbled upon this line
# +1 because the API returns all records where generated_timestamp >= start_time

Is that increment currently happening? I’m currently noticing missing records and I wonder if that would be why

Hi Daniel, did you try to tweak the method this to check if it solves your missing records issue?
I don’t get the # + 1 comment as I don’t see an incremented variable in this method :thinking:

It would go against the at-least-once philosophy so I’m inclined not to modify this. It would act differently to all other connectors and id prefer consistency in the interest of understanding how they all work.