Configuring Incremental Substreams in Connector Builder

Summary

The user is trying to create an incremental substream in Connector Builder to pull survey responses based on the ‘date_submitted’ field since the last sync. However, Airbyte is extracting and loading all survey responses on every sync. The user provided a configuration snippet for incremental sync.


Question

Having successfully created an incremental stream, I am now trying to create an incremental substream. My substream is <https://api.alchemer.com/v5/survey/><survey_id>/surveyresponse, where I iterate through the surveys to pull the survey responses for each survey. Each response has a date_submitted field. The API allows filtering based on this field. So, for each survey, I would like to pull only the responses that were submitted since the last sync (and I will iterate through all surveys on each sync). I am using the connector builder.

However, Airbyte extracts and loads all survey responses for each survey on every sync.

I am using the following configuration:

      type: DatetimeBasedCursor
      datetime_format: '%Y-%m-%d+%H:%M:%S'
      cursor_field: date_submitted
      cursor_datetime_formats:
        - '%Y-%m-%d %H:%M:%S EST'
      start_datetime:
        type: MinMaxDatetime
        datetime: '{{ config[''start_date''] }}'
        datetime_format: '%Y-%m-%dT%H:%M:%SZ'
      start_time_option:
        type: RequestOption
        field_name: filter[value][0]
        inject_into: request_parameter```
What am I doing wrong? Do incremental substreams need to be configured differently than incremental streams?

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1704132730898039) if you want to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["incremental-substream", "connector-builder", "datetime-based-cursor", "survey-responses", "api-filtering"]
</sub>

Based on the information provided in the extracts, it seems like you have correctly set up the incremental_sync configuration for your substream. However, there are a few things you might want to check: 1. Cursor Field: Ensure that the cursor_field is correctly set to the field that represents the “last changed” or “updated at” timestamp in your data. In your case, it should be submitted_at if that’s the field in your data that represents when the record was last updated. 2. Datetime Format: Make sure the datetime_format matches the format of the date and time in your cursor_field. You have mentioned two different datetime formats in your configuration ('%Y-%m-%d+%H:%M:%S' and '%Y-%m-%d %H:%M:%S EST'). Ensure that these formats correctly match the format of the submitted_at field in your data. 3. Start and End Datetime: The start_datetime and end_datetime fields are used to specify the time range for fetching records. If you’re using the MinMaxDatetime type for start_datetime, make sure the datetime field is correctly set to represent the start of the time range for fetching records. The end_datetime is usually set to the current date and time when the sync is started to sync all changes that happened so far. 4. Request Options: The start_time_option and end_time_option fields are used to set request options. Make sure the field_name is correctly set to the request parameter that should be used to filter the data based on the start and end time. Also, ensure that inject_into is set to request_parameter if the start and end time should be added as a query parameter to the request. If you’ve checked all these points and you’re still facing issues, it might be helpful to refer back to the Airbyte documentation or ask for help on the Airbyte forum. Remember, incremental syncs work by pulling only the data that has changed since the previous sync. This is usually implemented using a cursor value (like a timestamp) that delineates which data was pulled and which data is new. The cursor value is typically an updated_at timestamp. This cursor means that records whose updated_at value is less than or equal to that cursor value have been synced already, and that the next sync should only export records whose updated_at value is greater than the cursor value.

<@U04SE163WC8> is the start_datetime required?

Yes, the start_datetime is required when configuring a stream for incremental syncs in Airbyte. It represents the initial start date of the time range to fetch records for. When doing incremental syncs, the second sync will overwrite this date with the last record that got synced so far. In the context of the Python CDK tutorial, the start_date parameter is passed into the stream and used to specify the date to pull exchange rates for. It is also added to the constructor of the ExchangeRates class. Here is an example of how it’s used in the code: python def streams(self, config: Mapping[str, Any]) -&gt; List[Stream]: auth = NoAuth() # Parse the date from a string into a datetime object start_date = datetime.strptime(config['start_date'], '%Y-%m-%d') return [ExchangeRates(authenticator=auth, config=config, start_date=start_date)] And in the constructor: python class ExchangeRates(HttpStream, IncrementalMixin): url_base = "<https://api.apilayer.com/exchangerates_data/>" cursor_field = "date" primary_key = "date" def __init__(self, config: Mapping[str, Any], start_date: datetime, **kwargs): super().__init__() self.base = config['base'] self.apikey = config['apikey'] self.start_date = start_date self._cursor_value = None Sources: - Python CDK Tutorial - Connector Builder UI

I am trying to check the value of the start_datetime in the logs when the second sync runs. I see that the state is printed to the logs. The cursor value (2024-01-01+17:36:46) is what I expect, but Airbyte seems not to use this value in the request, and rather syncs all records submitted after 2023-12-01+00:00:00. Is this a bug?

2024-01-01 17:39:04 source &gt; Setting state of ManifestDeclarativeSource stream to {'states': [{'cursor': {'date_submitted': '2024-01-01+17:36:46'}, 'partition': {'parent_id': '7628838', 'parent_slice': {'end_time': '2024-01-01+17:36:46', 'start_time': '2023-12-01+00:00:00'}}}]}```

Hi Trevor, two questions to help me troubleshoot this:
• does the survey response include a timestamp or do you need to rely on the timestamp of the survey?
• Is 2023-12-01 your configured start date?

can you also confirm whether the cursor field should be submitted_at or date_submitted? I’m seeing different values between the config and this statement
> Each response has a submitted_at field

<@U033JKD3W8G> thanks for following up.

> • does the survey response include a timestamp or do you need to rely on the timestamp of the survey?
The survey response includes a timestamp

> • Is 2023-12-01 your configured start date?
Yes

> can you also confirm whether the cursor field should be submitted_at or date_submitted?
It is date_submitted. I accidentally wrote submitted_at in my message (which I have now edited).

<@U033JKD3W8G> any updates here?

<@U0687FZ8PF0> we started looking into this issue last week. I’ll get back to you on Thursday with an updated status