Sync BigQuery Daily Partitioned Tables to another destination

alexberryman · June 16, 2022, 7:19pm

Our Google Analytics data lands in tables named:

ga_sessions_20220614
ga_sessions_20220613
ga_sessions_20220612

Using the AirByte UI we’d have to re-scan the schema, and reimport all the data each day to detect new tables that are created each day in BigQuery. The other anti-pattern solution would be to create a new connector for each day that only contains a single daily table, then delete the connector once the sync is complete.

What is the best low-code way of syncing these newly created tables each day to another source, or in my case to Snowflake?

alafanechere · June 20, 2022, 10:03am

Hey @alexberryman,
Let me know if I got this right:

You have Google Analytics data automatically replicated to BigQuery, having one table per day of data.
You are using Airbyte to replicate these data to Snowflake
As you have a new table every day you need to change the stream selection in Airbyte UI to sync the new days. It leads to a reset and a full reload.

We are currently working on trying to avoid a full reload when the stream selection changes. So in a near future, you might be able to use our API to automatically select the new source table to sync and avoid the full reload.
This is still not an optimal solution as you’ll have to maintain additional scripting on top of Airbyte.

In my opinion, the best approach to take would be to create a view in Bigquery that is unionizing the daily tables and sync this unionized with Airbyte to Snowflake. You can use wildcard queries in BQ to create the view.

alexberryman · June 21, 2022, 7:49pm

Thanks for your reply!

Your recap accurately describes my problem. The in-progress “avoid a full reload when the stream selection” changes sounds like the best long-term solution.

We were planning on unioning the individual tables once they were copied to Snowflake, but I like the idea of doing this on the BigQuery side. We’re setting up a BigQuery view now to sync with Airbyte. We’re curious to see if an “Incremental Deduped History” in Airbyte has to scan all the terabytes of data for each sync or if it’ll minimal so we can use a high sync frequency without racking up a bunch of BigQuery charges.

I’ll report back on our findings.

alafanechere · June 22, 2022, 11:10am

The Incremental Deduped history relies on “technical” scd tables on the destination side which should optimize the deduplication process. Feel free to open a new topic on the forum if the performance are not those you’d expect.

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
Best practices for syncing large tables in Airbyte Connector Questions airbyte , connector , incremental-sync , question , best-practices	12	1201	July 14, 2024
Recommended approach to performing a one-time partial data resync? Q&A data-loading	6	661	August 11, 2022
High costs due to data load into BigQuery Connector Questions & Issues destination-bigquery	4	578	July 14, 2022
Detect new tables after the source added Connector Development	1	528	August 22, 2022
Incremental append and deduped + history Connector Questions & Issues source-mysql , destination-bigquery , data-loading , connectors	4	1145	August 8, 2022

Sync BigQuery Daily Partitioned Tables to another destination

Related topics