If I understood correctly state message only works if cursor field is sorted in ascending order. Is my assumption correct?
You’re correct.
Due to inability to have sorted data, I was thinking to create a unique (md5sum hash of record) field, which would be unique for each record. Then I will create a destination connector where data would be written to disk (ie json lines). I would then make sure that these files do not contain entries with incoming unique hashes. What do you think about this approach?
You have a lot of data to sync? Because of this do you want to have incremental? To be honest, the cost of checking in the destination if the record exists I don’t think it’s be best.
Let’s do a little exercise:
You’re trying to sync orders from the API
[{"order_id": 1, "amount": 100, "created_at": "2021-01-01T00:00:00Z", "updated_at": "2021-01-01T00:00:00Z"}, {"order_id": 2, "amount": 30, "created_at": "2021-10-01T00:00:00Z", "updated_at": "2021-10-01T00:00:00Z"},]
the first sync will execute and you’re going to save the state of the stream as update_at = 2021-10-01T00:00:00Z
when doing the second sync your data at source is:
[{"order_id": 1, "amount": 150, "created_at": "2021-01-01T00:00:00Z", "updated_at": "2021-11-01T00:00:00Z"}, {"order_id": 2, "amount": 30, "created_at": "2021-10-01T00:00:00Z", "updated_at": "2021-10-01T00:00:00Z"},]
so the second sync will capture the update from the first record, because the call to the API will be something like:
api-example.com/orders?update_at>=2021-10-01T00:00:Z
Trying to mock this using md5 hash I dont think you’re going to achieve that.
Hope this help you, for you knowledge there are some streams from sources doesn’t support incremental because of their API design and Airbyte only do full refresh.