Summary
Exploring best practices for syncing large tables in Airbyte, specifically whether it’s better to set up a connector per table for incremental syncs.
Question
Hi I’ve used airbyte to manually load data from MySQL to Postgres using a full-refresh. There are some tables that should only require periodic refreshes, but there’s a decent subset of at least 20 tables I’d like to keep in sync daily. I’m curious what are the best practices, particularly for large tables. Is it better to set up a connector per table I’d like to keep in sync incrementally so they can run in parallel rather than in sequence?
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
["airbyte", "syncing-large-tables", "best-practices", "connector", "incremental-sync"]
Hi PJ - Great question that ultimately boils down to your requirements.
When the streams are all in the same connection they are processed sequentially so there’s a single job that will run longer overall.
With a connection per table, you will achieve greater overall throughput but your CPU and Memory consumption will also increase as each Connection will spin up their own jobs.
Without knowing your data volume it’s difficult to provide guidance. But daily frequency leads me to believe you’d be okay with having 2 connections:
- One for the tables you want to run daily
- Another for the tables to sync infrequently (using a cron string)
The consideration then is how long the daily job takes if they are full refresh tables. If you need to increase throughput you can split that connection up further.
Hi Wass, thank you for your reply. I’m dealing with some tables in the GB range, the largest being around 14GB that I need to keep in sync. the majority are in the MB range. I was thinking about having a connection per table and then staggering them based on a cron so it would be easier to manually refresh if needed. It sounds like this may not be the way to go either as I likely will run into jobs stepping on each other without grouping them. is this correct?
That approach is fine with a large enough machine(s) - basically each job requires it’s own default cpu and memory, so if there isn’t enough some jobs will fail to start. The issue is it could be overkill for the amount of data you need to move.
Any reason you can’t use incremental CDC sync instead of full refresh? This would resolve a lot of issues for you.
I had some large data to start and was trying to get a sense of the total volume so I used the full refresh approach. I’ve started switching them to Incremental now that I’ve moved the data needed the first time through and that large 14 GB table now only took 27s to sync as there were only 61 records overnight. I’m planning to now add in more streams to this connector over the next few days to keep an eye on the load.
I am curious about the warning message that pops up recommending deleting data when you change the sync type for a stream. If I used full refresh to start and the first time Incremental + dedup acts the same as a full refresh, why would I want to delete the data in the destination and have to re-insert all of it when changing from full refresh to incremental + dedup?
Refreshes are recommended when switching sync modes because the incremental method requires a cursor to track changes. When starting on Full Refresh, that is not tracked or stores so an initial refresh is required.
Interesting, so what happens if the incremental + dedup has some issue and you need to execute a full refresh to resolve? Do you execute the full refresh, then change back to incremental + dedup and delete the data downstream having to load the data twice?
There is a separate process, idependant of sync mode, called https://docs.airbyte.com/operator-guides/refreshes|refreshes. In those scenarios, keep the sync mode as incremental and just trigger a stream-level or connection-level refresh
I see the option in Settings
but there’s a message saying it’s not available for my connector. Is this not available in the open source self hosted version?
ah okay my version is 0.63.4
Ah yeah, you’ll need to upgrade to get refreshes. Otherwise you can still run a clear
which is nearly the same process except data is first fully removed from the destination before backfilling
clear removes everything from all streams on the connector though right? It would be nice to be able to target. If I set up a stream on a separate connector as incremental + dedup to target only that stream for deletion of data downstream and backfill, can I then add it back to another connector I have set on a daily cron?
You can clear specific streams on the Status tab of a connection. There will be a button for each stream with Clear data
option