Thought it would be worth having a thread for best practices on data normalisation and loading with respect to connector data resets.
So you have your connector set up and you’re loading data via incremental-append from one database to another.
Some schema changes happen in your source - changed columns, new tables, whatever - and you now need to reset your data and run the sync again.
This can lead to:
- data loss as that historical data doesn’t exist in the source any more
- data delays as if you have many large tables the sync can take a very long time
In our (and I’m sure many others) case, we use dbt for our transforms. If we lose the landed airbyte data, any full refresh we do on downstream tables would be affected, losing the data history as mentioned.
I’d like to explore ways to handle this.
Hey I hope this is what you are looking for https://github.com/airbytehq/airbyte/issues/12089. You can comment on the issue
That’s the dream! I understand right now it needs to be handled outside of airbyte though.
My thinking is we need to have a landing area that airbyte sends data to. We’d have dbt then do incremental updates grabbing this data and loading it to a “source” table. Then if we need to reset the airbyte connection and reset the data, we can do so without losing the history in the “source” table.
Or maybe people are using the airbyte tables directly - backing up before any schema change reset and then adding the backed up data back in? Seems messier but means you’re not duplicating the entire database twice (once in the airbyte landed tables and again in the dbt “source” tables).
Another possibility is just backing up the table, then upserting it back in after the reset (I don’t know if this messes with airbyte though and generally seems poor practice).
Hey that can be done by using our custom dbt feature you run write custom dbt and add it over the airbyte configuration. You can check it here https://docs.airbyte.com/understanding-airbyte/operations#dbt-transformations
1 Like
Thanks! I’ll try it out and let you know how I get on. I was a bit concerned when I saw that before that it would still leave us open to the issue of data loss during schema changes/resets. I would expect that no matter what you put in the custom dbt transformation it will do a full refresh (and cause data loss) when the connection is reset right?