Sync BigQuery Daily Partitioned Tables to another destination

Our Google Analytics data lands in tables named:

  • ga_sessions_20220614
  • ga_sessions_20220613
  • ga_sessions_20220612

Using the AirByte UI we’d have to re-scan the schema, and reimport all the data each day to detect new tables that are created each day in BigQuery. The other anti-pattern solution would be to create a new connector for each day that only contains a single daily table, then delete the connector once the sync is complete.

What is the best low-code way of syncing these newly created tables each day to another source, or in my case to Snowflake?

Hey @alexberryman,
Let me know if I got this right:

  1. You have Google Analytics data automatically replicated to BigQuery, having one table per day of data.
  2. You are using Airbyte to replicate these data to Snowflake
  3. As you have a new table every day you need to change the stream selection in Airbyte UI to sync the new days. It leads to a reset and a full reload.

We are currently working on trying to avoid a full reload when the stream selection changes. So in a near future, you might be able to use our API to automatically select the new source table to sync and avoid the full reload.
This is still not an optimal solution as you’ll have to maintain additional scripting on top of Airbyte.

In my opinion, the best approach to take would be to create a view in Bigquery that is unionizing the daily tables and sync this unionized with Airbyte to Snowflake. You can use wildcard queries in BQ to create the view.

Thanks for your reply!

Your recap accurately describes my problem. The in-progress “avoid a full reload when the stream selection” changes sounds like the best long-term solution.

We were planning on unioning the individual tables once they were copied to Snowflake, but I like the idea of doing this on the BigQuery side. We’re setting up a BigQuery view now to sync with Airbyte. We’re curious to see if an “Incremental Deduped History” in Airbyte has to scan all the terabytes of data for each sync or if it’ll minimal so we can use a high sync frequency without racking up a bunch of BigQuery charges.

I’ll report back on our findings.

The Incremental Deduped history relies on “technical” scd tables on the destination side which should optimize the deduplication process. Feel free to open a new topic on the forum if the performance are not those you’d expect.