Syncing data sources to different BigQuery datasets

slack-user-airbyte · July 2, 2024, 6:12am

Summary

Determining whether to sync different data sources to different BigQuery datasets or have a different dataset for each one.

Question

Maybe dumb question: should I be syncing different data sources to different bigquery datasets (e.g. Intercom, PostHog, postgres db) or should I have a different dataset for each one?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["syncing-data-sources", "bigquery-datasets", "intercom", "posthog", "postgres-db"]}

slack-user-airbyte · July 3, 2024, 6:17am

Thank you, super helpful! Went with the second approach

slack-user-airbyte · July 11, 2024, 6:16am

Also, not at all a dumb question. always smarter to ask stuff like this now than to have to ask how to untangle the mess later!

slack-user-airbyte · July 16, 2024, 6:16am

Airbyte doesn’t enforce any constraint here, and it really depends on your needs—but here are the two models I would personally recommend:

Agency-like orgs:
• Airbyte lives in a separate project (helps with access control and cost reporting)
• One dataset per client
• All the client’s sources under that dataset
• This segregates client data while minimizing unique datasets (but also allows for client-level cost reporting from the GCP log sink if needed)
• Data modeling (e.g. dbt) builds models into either a dedicated modeling project or your production project
Internal org use:
• Airbyte lives in a separate project (separates access control/billing)
• One dataset per source system
• Similar benefits for reporting and management, making it easy to delete data no longer needed
• Data modeling (e.g. dbt) builds models into either a dedicated modeling project or your production project
In both of these cases, I would generally leave all the Airbyte temp tables in airbyte_internal (the default), unless you have strict compliance requirements that would prevent intermixing at that transient level—this just reduces the noise in the client/system datasets so all you see is the final output tables.

Advantages of having some level of segregation is that it makes searching/filtering/reporting easier, but it also makes it easier to purge data for systems or clients that are no longer needed. And keeping each ETL system in its own project not only helps you know total cost of ownership, it also means when you migrate you can shut projects done safely, knowing they’re no longer in use for the old system.

In cases where you need stricter compliance, it may be more desirable to have a separate destination for each client and write them into their own projects (for example, if some need strict EU data residency and others don’t).

Again, just my personal suggestions . . . but choosing a good architecture early on just makes your future life easier.

Topic		Replies	Views
Custom Connector Builder - Issue with Dataset Creation in BigQuery Connector Development connection , connectorbuilder , bigquery , question , custom-connector-builder	5	32	May 14, 2024
Partial database sync with BigQuery as a source Q&A source-bigquery	1	197	October 24, 2022
Support for Multi Tenancy Q&A	1	531	April 4, 2022
Airbyte requesting service account permissions at project level instead of bucket and dataset levels Connector Questions airbyte , connector , question , bigquery-connector , gcs-connector	1	131	June 25, 2024
Recommended way to sync PG tables of varying sizes to BigQuery in Airbyte Connector Questions performance , multiple-connections , connector , bigquery , postgresql	7	11	September 23, 2024

Syncing data sources to different BigQuery datasets

Summary

Question

Related topics