Best practices for syncing large tables in Airbyte

slack-user-airbyte · July 10, 2024, 6:11am

Summary

Exploring best practices for syncing large tables in Airbyte, specifically whether it’s better to set up a connector per table for incremental syncs.

Question

Hi I’ve used airbyte to manually load data from MySQL to Postgres using a full-refresh. There are some tables that should only require periodic refreshes, but there’s a decent subset of at least 20 tables I’d like to keep in sync daily. I’m curious what are the best practices, particularly for large tables. Is it better to set up a connector per table I’d like to keep in sync incrementally so they can run in parallel rather than in sequence?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["airbyte", "syncing-large-tables", "best-practices", "connector", "incremental-sync"]}

slack-user-airbyte · July 10, 2024, 6:17am

Hi PJ - Great question that ultimately boils down to your requirements.

When the streams are all in the same connection they are processed sequentially so there’s a single job that will run longer overall.

With a connection per table, you will achieve greater overall throughput but your CPU and Memory consumption will also increase as each Connection will spin up their own jobs.

Without knowing your data volume it’s difficult to provide guidance. But daily frequency leads me to believe you’d be okay with having 2 connections:

One for the tables you want to run daily
Another for the tables to sync infrequently (using a cron string)
The consideration then is how long the daily job takes if they are full refresh tables. If you need to increase throughput you can split that connection up further.

slack-user-airbyte · July 11, 2024, 6:16am

Hi Wass, thank you for your reply. I’m dealing with some tables in the GB range, the largest being around 14GB that I need to keep in sync. the majority are in the MB range. I was thinking about having a connection per table and then staggering them based on a cron so it would be easier to manually refresh if needed. It sounds like this may not be the way to go either as I likely will run into jobs stepping on each other without grouping them. is this correct?

slack-user-airbyte · July 11, 2024, 6:17am

That approach is fine with a large enough machine(s) - basically each job requires it’s own default cpu and memory, so if there isn’t enough some jobs will fail to start. The issue is it could be overkill for the amount of data you need to move.

Any reason you can’t use incremental CDC sync instead of full refresh? This would resolve a lot of issues for you.

slack-user-airbyte · July 12, 2024, 6:17am

I had some large data to start and was trying to get a sense of the total volume so I used the full refresh approach. I’ve started switching them to Incremental now that I’ve moved the data needed the first time through and that large 14 GB table now only took 27s to sync as there were only 61 records overnight. I’m planning to now add in more streams to this connector over the next few days to keep an eye on the load.

I am curious about the warning message that pops up recommending deleting data when you change the sync type for a stream. If I used full refresh to start and the first time Incremental + dedup acts the same as a full refresh, why would I want to delete the data in the destination and have to re-insert all of it when changing from full refresh to incremental + dedup?

slack-user-airbyte · July 13, 2024, 6:15am

Refreshes are recommended when switching sync modes because the incremental method requires a cursor to track changes. When starting on Full Refresh, that is not tracked or stores so an initial refresh is required.

slack-user-airbyte · July 14, 2024, 6:17am

Interesting, so what happens if the incremental + dedup has some issue and you need to execute a full refresh to resolve? Do you execute the full refresh, then change back to incremental + dedup and delete the data downstream having to load the data twice?

slack-user-airbyte · July 14, 2024, 6:17am

There is a separate process, idependant of sync mode, called https://docs.airbyte.com/operator-guides/refreshes|refreshes. In those scenarios, keep the sync mode as incremental and just trigger a stream-level or connection-level refresh

slack-user-airbyte · July 14, 2024, 6:18am

I see the option in Settings but there’s a message saying it’s not available for my connector. Is this not available in the open source self hosted version?

slack-user-airbyte · July 14, 2024, 6:18am

ah okay my version is 0.63.4

slack-user-airbyte · July 14, 2024, 6:18am

Ah yeah, you’ll need to upgrade to get refreshes. Otherwise you can still run a clear which is nearly the same process except data is first fully removed from the destination before backfilling

slack-user-airbyte · July 14, 2024, 6:18am

clear removes everything from all streams on the connector though right? It would be nice to be able to target. If I set up a stream on a separate connector as incremental + dedup to target only that stream for deletion of data downstream and backfill, can I then add it back to another connector I have set on a daily cron?

slack-user-airbyte · July 14, 2024, 6:18am

You can clear specific streams on the Status tab of a connection. There will be a button for each stream with Clear data option

Topic		Replies	Views
Issue with syncing large MySQL tables to PostgreSQL Connector Questions mysql , airbyte , connector , postgresql , large-tables	0	18	December 12, 2024
Recommended way to sync PG tables of varying sizes to BigQuery in Airbyte Connector Questions performance , multiple-connections , connector , bigquery , postgresql	7	11	September 23, 2024
Help Needed to Configure Airbyte for MySQL and BigQuery Real-Time Data Synchronisation Guides	0	28	October 16, 2024
Issue with Airbyte Postgres connector not behaving as expected with full-refresh/overwrite streams Connector Development connector , full-refresh , bug , sync-issue , airbyte-postgres-connector	0	90	June 6, 2024
Handling Deletes in Incremental Sync with Airbyte Connector Questions cdc , connector , full-refresh , incremental-sync , question	0	414	May 14, 2024

Best practices for syncing large tables in Airbyte

Summary

Question

Related topics