Incremental Syncing Issues on Several Connectors

danieldiamond · April 8, 2022, 6:41am

Am I missing something? It appears that the Incremental feature on various connectors pulls duplicate records i.e. it finds the last updated_at timestamp (or equivalent) and then uses that to pull records for the next sync. Meaning that the last record will be readded to the _AIRBYTE_RAW table.

I am noticing this on a few connectors already and am curious why no other users have experienced this issue.

alafanechere · April 8, 2022, 10:13pm

Hello @danieldiamond,
Could you please share for which connector your see this happening?

The incremental stream should use the cursor value stored in the state and query the source API / database for records with a cursor value that is greater than the one retrieved from the state.

danieldiamond · April 9, 2022, 9:11am

@alafanechere i’ve tagged the survey monkey and zendesk connectors in the post.
both survey-monkey and zendesk-talk

…for records with a cursor value that is greater than the one retrieved from the state.

that doesn’t seem to be the case from what i’ve seen in most non-DB connectors in airbyte. it’s greater than or equal to.

see the survey monkey get_updated_state definition, which gets the most recently updated_at response and sets that as the updated cursor field. It then uses this timestamp for the next incremental sync, meaning this last record is readded (duplicate record).

same for zendesk-talk get_updated_state

the general logic being:

        latest_state = current_stream_state.get(self.cursor_field, current_stream_state.get(self.legacy_cursor_field))
        new_cursor_value = max(latest_record[self.cursor_field], latest_state or latest_record[self.cursor_field])
        return {self.cursor_field: new_cursor_value}

the API calls records that are on or after this timestamp.

See screenshot below. syncing immediately after previous sync and it keeps pulling the same last record

sherifnada1 · April 12, 2022, 12:42am

Airbyte’s incremental syncs are at-least-once delivery, so this is in line with that contract. If you use a dedupe sync mode, I believe these duped records should get taken care of.

The reason this works this way is to avoid race conditions where on sync 1, the connector reads all records until Oct 1, but if a record was generated on oct 1 that wasn’t caught by that sync (e.g if sync 1 was also on Oct 1) then it would get exported the next time out.

danieldiamond · April 12, 2022, 6:38am

the connector reads all records until Oct 1, but if a record was generated on oct 1 that wasn’t caught by that sync

so that is truncating by day, the timestamps here are 2020-01-01T00:01:05Z (i.e. seconds), meaning a race-condition would be reduced to a 1 second window.

I completely appreciate at-least-once (dealt with a lot of segment event tracking frameworks e.g. at-least-once + deduping within 24 hour window) but for airbyte, this implies that the incremental mode has an implied dependency on the dedupe sync mode. and dedupe mode requires basic normalization in order to work (please confirm that).

basic normalization has undesireable implications e.g. costs in scanning large tables to dedupe as well as unnecessary additional tables required to dedupe (raw, _SCD and final).

but i guess that’s the trade-off between at-least-once and definitely-not-duplicate records

alafanechere · April 14, 2022, 3:50pm

Thank you @sherifnada1 for the explanation!
Yes @danieldiamond you are right, if you want to guarantee uniqueness of records you need to run the normalization that will deduplicate the records.

danieldiamond · April 15, 2022, 4:36am

@sherifnada1 interestingly I just stumbled upon this line
# +1 because the API returns all records where generated_timestamp >= start_time

Is that increment currently happening? I’m currently noticing missing records and I wonder if that would be why

github.com

airbytehq/airbyte/blob/d41c3f7d6f9e95b5c53087d2aa9a09956b0940ec/airbyte-integrations/connectors/source-zendesk-support/source_zendesk_support/streams.py#L245

      
        
            def request_params(
                self, stream_state: Mapping[str, Any] = None, next_page_token: Mapping[str, Any] = None, **kwargs
            ) -> MutableMapping[str, Any]:
                params = {}
                stream_state = stream_state or {}
                # try to search all records with generated_timestamp > start_time
                current_state = stream_state.get(self.cursor_field)
                if current_state and isinstance(current_state, str) and not current_state.isdigit():
                    current_state = self.str2unixtime(current_state)
                start_time = current_state or calendar.timegm(pendulum.parse(self._start_date).utctimetuple())
                # +1 because the API returns all records where generated_timestamp >= start_time
            
            
    now = calendar.timegm(datetime.now().utctimetuple())
                if start_time > now - 60:
                    # start_time must be more than 60 seconds ago
                    start_time = now - 61
                params["start_time"] = start_time
            
            
    return params
            
            
def read_records(

alafanechere · April 29, 2022, 12:34pm

Hi Daniel, did you try to tweak the method this to check if it solves your missing records issue?
I don’t get the # + 1 comment as I don’t see an incremented variable in this method

danieldiamond · May 5, 2022, 12:54pm

It would go against the at-least-once philosophy so I’m inclined not to modify this. It would act differently to all other connectors and id prefer consistency in the interest of understanding how they all work.

harshith · July 25, 2022, 5:03am

Hey is this resolved?

Topic		Replies	Views
Incremental Sync for basic Python source connector Connector Development	12	1780	July 14, 2022
Duplicate Rows Issue with Airbyte Sync Connector Questions connector , question , cursor-field , updated-at , airbyte-sync	2	52	May 16, 2024
Issue with custom connector incremental sync pulling duplicate rows API, Terraform and Other Topics connectorbuilder , incremental-sync , question , custom-connector , builder	1	20	June 23, 2024
[source-amazon-ads] Incremental Deduped + History sync duplication Connector Questions & Issues connectors	10	413	October 31, 2022
Issue with Incremental Sync in Custom Python Connector Connector Questions connector , incremental-sync , bug , python , custom-connector	0	47	May 16, 2024

Incremental Syncing Issues on Several Connectors

Related topics