I am having an API with for each model 3 different endpoints:
-
/data.xml
Giving back all records until now
-
/updates.xml
Giving back all updates since a certain point in time
-
/deletes.xml
Giving back all deletes since a certain point in time
What are the best practices setting up an incremental sync in Python knowing the concept of these 3 endpoints. With only updates it was easy, and I could use the incremental sync - deduped history concept I suppose after reading about the sync modes. But how can I implement this if I want to also manage incremental deletes? Or is the only way with this set-up to use a full refresh every time, and incremental isn’t possible?
Or is the idea that I should add another field to the model in the data warehouse, for instance deleted true/false, and handle the “deletes” as an actual update to the records where deleted is set to true?
The upside to this is that you still have those records in your data warehouse.Does anyone know any connectors with similar behavior in airbyte (python), would be nice to take a look at an example implementation.
Hey @ramonvermeulen,
Thank you for this great question.
In the context of the development of a new connector here are my guidelines:
-
The source connector should not be aware nor dependent on the data available at the destination. Source and destination must be independent, the source should not read the destination to replicate new date
-
Following ELT (Extract / Load / Transform) principles I would suggest replicating and archiving all the states of a record you can have, giving the responsibility to the data warehouse to represent the latest state of a record.
For your use case, I believe your XML endpoint can’t be queried for a specific time interval so here’s my suggestion:
-
Declare a full refresh stream for data.xml
-
Declare incremental streams for updates.xml
and deletes.xml
and drop records whose cursor value are older than the latest cursor value stored in the state object.
-
If the updates.xml
gives you the whole state of the record on each update
and not only the updated
fields:
- I suggest you build a final table in your data warehouse (outside of Airbyte or with custom DBT transformation with Airbyte): retrieve the latest state of each record with a
group by
on the record id and use a MAX aggregation on the cursor field + LEFT JOIN
with the deletes
table on record id.
Let me know if it makes sense for your use case!
Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.