Summary
The user is trying to create an incremental substream in Connector Builder to pull survey responses based on the ‘date_submitted’ field since the last sync. However, Airbyte is extracting and loading all survey responses on every sync. The user provided a configuration snippet for incremental sync.
Question
Having successfully created an incremental stream, I am now trying to create an incremental substream. My substream is <https://api.alchemer.com/v5/survey/><survey_id>/surveyresponse
, where I iterate through the surveys to pull the survey responses for each survey. Each response has a date_submitted
field. The API allows filtering based on this field. So, for each survey, I would like to pull only the responses that were submitted since the last sync (and I will iterate through all surveys on each sync). I am using the connector builder.
However, Airbyte extracts and loads all survey responses for each survey on every sync.
I am using the following configuration:
type: DatetimeBasedCursor
datetime_format: '%Y-%m-%d+%H:%M:%S'
cursor_field: date_submitted
cursor_datetime_formats:
- '%Y-%m-%d %H:%M:%S EST'
start_datetime:
type: MinMaxDatetime
datetime: '{{ config[''start_date''] }}'
datetime_format: '%Y-%m-%dT%H:%M:%SZ'
start_time_option:
type: RequestOption
field_name: filter[value][0]
inject_into: request_parameter```
What am I doing wrong? Do incremental substreams need to be configured differently than incremental streams?
<br>
---
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1704132730898039) if you want to access the original thread.
[Join the conversation on Slack](https://slack.airbyte.com)
<sub>
["incremental-substream", "connector-builder", "datetime-based-cursor", "survey-responses", "api-filtering"]
</sub>
Based on the information provided in the extracts, it seems like you have correctly set up the incremental_sync
configuration for your substream. However, there are a few things you might want to check: 1. Cursor Field: Ensure that the cursor_field
is correctly set to the field that represents the “last changed” or “updated at” timestamp in your data. In your case, it should be submitted_at
if that’s the field in your data that represents when the record was last updated. 2. Datetime Format: Make sure the datetime_format
matches the format of the date and time in your cursor_field
. You have mentioned two different datetime formats in your configuration ('%Y-%m-%d+%H:%M:%S'
and '%Y-%m-%d %H:%M:%S EST'
). Ensure that these formats correctly match the format of the submitted_at
field in your data. 3. Start and End Datetime: The start_datetime
and end_datetime
fields are used to specify the time range for fetching records. If you’re using the MinMaxDatetime
type for start_datetime
, make sure the datetime
field is correctly set to represent the start of the time range for fetching records. The end_datetime
is usually set to the current date and time when the sync is started to sync all changes that happened so far. 4. Request Options: The start_time_option
and end_time_option
fields are used to set request options. Make sure the field_name
is correctly set to the request parameter that should be used to filter the data based on the start and end time. Also, ensure that inject_into
is set to request_parameter
if the start and end time should be added as a query parameter to the request. If you’ve checked all these points and you’re still facing issues, it might be helpful to refer back to the Airbyte documentation or ask for help on the Airbyte forum. Remember, incremental syncs work by pulling only the data that has changed since the previous sync. This is usually implemented using a cursor value (like a timestamp) that delineates which data was pulled and which data is new. The cursor value is typically an updated_at
timestamp. This cursor means that records whose updated_at
value is less than or equal to that cursor value have been synced already, and that the next sync should only export records whose updated_at
value is greater than the cursor value.
<@U04SE163WC8> is the start_datetime
required?
Yes, the start_datetime
is required when configuring a stream for incremental syncs in Airbyte. It represents the initial start date of the time range to fetch records for. When doing incremental syncs, the second sync will overwrite this date with the last record that got synced so far. In the context of the Python CDK tutorial, the start_date
parameter is passed into the stream and used to specify the date to pull exchange rates for. It is also added to the constructor of the ExchangeRates
class. Here is an example of how it’s used in the code: python def streams(self, config: Mapping[str, Any]) -> List[Stream]: auth = NoAuth() # Parse the date from a string into a datetime object start_date = datetime.strptime(config['start_date'], '%Y-%m-%d') return [ExchangeRates(authenticator=auth, config=config, start_date=start_date)]
And in the constructor: python class ExchangeRates(HttpStream, IncrementalMixin): url_base = "<https://api.apilayer.com/exchangerates_data/>" cursor_field = "date" primary_key = "date" def __init__(self, config: Mapping[str, Any], start_date: datetime, **kwargs): super().__init__() self.base = config['base'] self.apikey = config['apikey'] self.start_date = start_date self._cursor_value = None
Sources: - Python CDK Tutorial - Connector Builder UI
I am trying to check the value of the start_datetime
in the logs when the second sync runs. I see that the state
is printed to the logs. The cursor value (2024-01-01+17:36:46
) is what I expect, but Airbyte seems not to use this value in the request, and rather syncs all records submitted after 2023-12-01+00:00:00
. Is this a bug?
2024-01-01 17:39:04 source > Setting state of ManifestDeclarativeSource stream to {'states': [{'cursor': {'date_submitted': '2024-01-01+17:36:46'}, 'partition': {'parent_id': '7628838', 'parent_slice': {'end_time': '2024-01-01+17:36:46', 'start_time': '2023-12-01+00:00:00'}}}]}```
Hi Trevor, two questions to help me troubleshoot this:
• does the survey response include a timestamp or do you need to rely on the timestamp of the survey?
• Is 2023-12-01 your configured start date?
can you also confirm whether the cursor field should be submitted_at
or date_submitted
? I’m seeing different values between the config and this statement
> Each response has a submitted_at
field
<@U033JKD3W8G> thanks for following up.
> • does the survey response include a timestamp or do you need to rely on the timestamp of the survey?
The survey response includes a timestamp
> • Is 2023-12-01 your configured start date?
Yes
> can you also confirm whether the cursor field should be submitted_at
or date_submitted
?
It is date_submitted
. I accidentally wrote submitted_at
in my message (which I have now edited).
<@U033JKD3W8G> any updates here?
<@U0687FZ8PF0> we started looking into this issue last week. I’ll get back to you on Thursday with an updated status