Syncing data sources to different BigQuery datasets

Summary

Determining whether to sync different data sources to different BigQuery datasets or have a different dataset for each one.


Question

Maybe dumb question: should I be syncing different data sources to different bigquery datasets (e.g. Intercom, PostHog, postgres db) or should I have a different dataset for each one?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["syncing-data-sources", "bigquery-datasets", "intercom", "posthog", "postgres-db"]

Thank you, super helpful! Went with the second approach

Also, not at all a dumb question. always smarter to ask stuff like this now than to have to ask how to untangle the mess later!

Airbyte doesn’t enforce any constraint here, and it really depends on your needs—but here are the two models I would personally recommend:

Agency-like orgs:
• Airbyte lives in a separate project (helps with access control and cost reporting)
• One dataset per client
• All the client’s sources under that dataset
• This segregates client data while minimizing unique datasets (but also allows for client-level cost reporting from the GCP log sink if needed)
• Data modeling (e.g. dbt) builds models into either a dedicated modeling project or your production project
Internal org use:
• Airbyte lives in a separate project (separates access control/billing)
• One dataset per source system
• Similar benefits for reporting and management, making it easy to delete data no longer needed
• Data modeling (e.g. dbt) builds models into either a dedicated modeling project or your production project
In both of these cases, I would generally leave all the Airbyte temp tables in airbyte_internal (the default), unless you have strict compliance requirements that would prevent intermixing at that transient level—this just reduces the noise in the client/system datasets so all you see is the final output tables.

Advantages of having some level of segregation is that it makes searching/filtering/reporting easier, but it also makes it easier to purge data for systems or clients that are no longer needed. And keeping each ETL system in its own project not only helps you know total cost of ownership, it also means when you migrate you can shut projects done safely, knowing they’re no longer in use for the old system.

In cases where you need stricter compliance, it may be more desirable to have a separate destination for each client and write them into their own projects (for example, if some need strict EU data residency and others don’t).

Again, just my personal suggestions . . . but choosing a good architecture early on just makes your future life easier.