Syncing huge tables (10B+ rows) from MySQL

adammacleod · January 31, 2023, 1:07am

Is this your first time deploying Airbyte?: No
OS Version / Instance: Amazon Linux 2
Memory / Disk: 16 GB / 50GB
Deployment: Docker on EC2 (c6a.2xlarge)
Airbyte Version: 0.40.31
Source name/version: MySQL / 1.0.21
Destination name/version: Snowflake / 0.4.47
Step: Sync
Description:

Hello Everyone,

Sorry if this has been asked and answered elsewhere.

I am looking for some advice on how to sync a 10B+ row table from MySQL to Snowflake using CDC. We have successfully synced 1B rows using CDC from MySQL but it took 24~ hours for the initial load. We only have 3 days of binlogs available and we can’t increase the retention. This means that a 10B row sync won’t complete before we run out of binlogs to continue the CDC incremental load.

The only strategy I can think of is to get Airbyte to skip the initial load of the table, and only retrieve CDC changes. This would be similar to “New Changes Only” which can be set on the MSSQL connector: Microsoft SQL Server (MSSQL) | Airbyte Documentation. If this is possible then I can split the table into multiple views and do a full-load of each view and join everything together in Snowflake.

I requested this on GitHub late last year after raising the issue on Slack: Source MySQL: Add "Data to Sync" CDC Parameter · Issue #20112 · airbytehq/airbyte · GitHub

I’m just wondering if there are any other approaches that anyone has tried to sync such large tables? I was also thinking I could setup a CDC connection, abort the initial load, and edit the state so that it will only capture the changes?

I would be happy to help with implementation if I can, although my Java skills are fairly non-existent

Thanks for your help!
Adam

natalyjazzviolin · January 31, 2023, 2:27pm

Hey, @adammacleod! I’d say that sounds like a great approach but these types of issues go into the backlog and are not prioritized. We are prioritizing speed and reliability improvements this quarter though I think the best thing for you to do is post in Slack or directly message Slack community members who have posted about this kind of thing before. It’s much more active over there

adammacleod · January 31, 2023, 9:48pm

Great, thanks for the feedback and info @natalyjazzviolin! I will give that a go

Topic		Replies	Views
Failing to ingest a "big" MySQL table Connector Questions & Issues source-mysql , destination-snowflake , data-loading , connectors , cdc	6	579	July 14, 2022
Failing to ingest a "big" MySQL table (38Gb) Connector Questions & Issues data-loading	20	3242	June 24, 2022
MySQL to Snowflake incremental loading fails Connector Questions & Issues source-mysql , destination-snowflake	24	1782	November 1, 2022
Mysql to Snowflake sync issue with CDC not working as expected Connector Questions cdc , airbyte , connector , debug , eks	0	49	June 15, 2024
MySQL to Snowflake Fails Normalization (Doesn't build RAW table) Connector Questions & Issues source-mysql , destination-snowflake , normalization	20	988	March 24, 2023

Syncing huge tables (10B+ rows) from MySQL

Related topics