Syncing huge tables (10B+ rows) from MySQL

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Amazon Linux 2
  • Memory / Disk: 16 GB / 50GB
  • Deployment: Docker on EC2 (c6a.2xlarge)
  • Airbyte Version: 0.40.31
  • Source name/version: MySQL / 1.0.21
  • Destination name/version: Snowflake / 0.4.47
  • Step: Sync
  • Description:

Hello Everyone,

Sorry if this has been asked and answered elsewhere.

I am looking for some advice on how to sync a 10B+ row table from MySQL to Snowflake using CDC. We have successfully synced 1B rows using CDC from MySQL but it took 24~ hours for the initial load. We only have 3 days of binlogs available and we can’t increase the retention. This means that a 10B row sync won’t complete before we run out of binlogs to continue the CDC incremental load.

The only strategy I can think of is to get Airbyte to skip the initial load of the table, and only retrieve CDC changes. This would be similar to “New Changes Only” which can be set on the MSSQL connector: Microsoft SQL Server (MSSQL) | Airbyte Documentation. If this is possible then I can split the table into multiple views and do a full-load of each view and join everything together in Snowflake.

I requested this on GitHub late last year after raising the issue on Slack: Source MySQL: Add "Data to Sync" CDC Parameter · Issue #20112 · airbytehq/airbyte · GitHub

I’m just wondering if there are any other approaches that anyone has tried to sync such large tables? I was also thinking I could setup a CDC connection, abort the initial load, and edit the state so that it will only capture the changes?

I would be happy to help with implementation if I can, although my Java skills are fairly non-existent :slight_smile:

Thanks for your help!

Hey, @adammacleod! I’d say that sounds like a great approach but these types of issues go into the backlog and are not prioritized. We are prioritizing speed and reliability improvements this quarter though :slight_smile: I think the best thing for you to do is post in Slack or directly message Slack community members who have posted about this kind of thing before. It’s much more active over there

Great, thanks for the feedback and info @natalyjazzviolin! I will give that a go :slight_smile: