Airbyte incremental sync with incremental deletes

I am having an API with for each model 3 different endpoints:

  • /data.xml Giving back all records until now
  • /updates.xml Giving back all updates since a certain point in time
  • /deletes.xml Giving back all deletes since a certain point in time

What are the best practices setting up an incremental sync in Python knowing the concept of these 3 endpoints. With only updates it was easy, and I could use the incremental sync - deduped history concept I suppose after reading about the sync modes. But how can I implement this if I want to also manage incremental deletes? Or is the only way with this set-up to use a full refresh every time, and incremental isn’t possible?

Or is the idea that I should add another field to the model in the data warehouse, for instance deleted true/false, and handle the “deletes” as an actual update to the records where deleted is set to true?

The upside to this is that you still have those records in your data warehouse.Does anyone know any connectors with similar behavior in airbyte (python), would be nice to take a look at an example implementation.

Hey @ramonvermeulen,
Thank you for this great question.
In the context of the development of a new connector here are my guidelines:

  • The source connector should not be aware nor dependent on the data available at the destination. Source and destination must be independent, the source should not read the destination to replicate new date

  • Following ELT (Extract / Load / Transform) principles I would suggest replicating and archiving all the states of a record you can have, giving the responsibility to the data warehouse to represent the latest state of a record.
    For your use case, I believe your XML endpoint can’t be queried for a specific time interval so here’s my suggestion:

  • Declare a full refresh stream for data.xml

  • Declare incremental streams for updates.xml and deletes.xmland drop records whose cursor value are older than the latest cursor value stored in the state object.

  • If the updates.xml gives you the whole state of the record on each update and not only the updated fields:

    • I suggest you build a final table in your data warehouse (outside of Airbyte or with custom DBT transformation with Airbyte): retrieve the latest state of each record with a group by on the record id and use a MAX aggregation on the cursor field + LEFT JOIN with the deletes table on record id.

Let me know if it makes sense for your use case!