Recommended approach to performing a one-time partial data resync?


Airbyte ingests hourly data from our app DBs to BigQuery in sync mode Incremental → Deduped + history.

For some reason, we have missing or outdated records in BigQuery. I’ve identified these records. I’d like to [re]ingest them in BigQuery.

I can’t perform a full refresh because the tables are too huge.

Is there a recommended approach to performing a one-time partial data resync?


Hey @Florent, thanks for your question, it’s certainly a good one. Could you talk more about how many of these records you’ve identified that you’d want to partially resync? I don’t know if there is a recommended approach but have you thought about creating a separate connector with the same BigQuery destination but with the source being those identified records? Just trying to throw an idea out there, let me know if that makes sense.

Thank @sajarin for your answer!

There are about 100k records. We thought about the separate connector, creating in the source a PostgreSQL view that lists these records. But that’s quite involved.

Another approach we may go with would be to no longer use the updated_at column as the cursor field. Instead, we would use a new to_ingest_at column that a DB trigger would automatically update with the updated_at value anytime the updated_at value is changed. So Airbyte would still ingest recently updated records.

To resync the records missing or outdated in BigQuery, we would just reset their to_ingest_at value to the current time, and thus our regular Airbyte pipeline would automatically re-ingest them.

We could skip this new column and just reset the updated_at value of the records but the updated_at value for other purposes in the application; we prefer to avoid changing this value if that’s only to re-ingest record in BigQuery.

Do you think it makes sense?

Hey @Florent, creating a new column that tracks the updated_at value seems to make sense and it’s a pretty clever workaround. I know it’s been some time since your last post. Have you managed to resync the data? Did you end up using the to_ingest_at column or did you do something else to resolve the issue?

We’re exploring another solution, I’ll report in this thread our final solution once it’s completed.

The final decision is to put in place a separate connection to do a one-time full re ingestion of the records created or updated in the past 12 months (to “fix” records currently missing or outdated in BigQuery) and another connection to perform recurring monthly full reingestion of the records created or updated in the past month (to “fix” records that got missing or outdated in BigQuery during the past month).

Thanks @Florent for coming back here to report your team’s solution!