- Is this your first time deploying Airbyte?: No
- OS Version / Instance: Amazon Linux 2
- Memory / Disk: 16 GB / 50GB
- Deployment: Docker on EC2 (c6a.2xlarge)
- Airbyte Version: 0.40.31
- Source name/version: MySQL / 1.0.21
- Destination name/version: Snowflake / 0.4.47
- Step: Sync
Sorry if this has been asked and answered elsewhere.
I am looking for some advice on how to sync a 10B+ row table from MySQL to Snowflake using CDC. We have successfully synced 1B rows using CDC from MySQL but it took 24~ hours for the initial load. We only have 3 days of binlogs available and we can’t increase the retention. This means that a 10B row sync won’t complete before we run out of binlogs to continue the CDC incremental load.
The only strategy I can think of is to get Airbyte to skip the initial load of the table, and only retrieve CDC changes. This would be similar to “New Changes Only” which can be set on the MSSQL connector: Microsoft SQL Server (MSSQL) | Airbyte Documentation. If this is possible then I can split the table into multiple views and do a full-load of each view and join everything together in Snowflake.
I requested this on GitHub late last year after raising the issue on Slack: Source MySQL: Add "Data to Sync" CDC Parameter · Issue #20112 · airbytehq/airbyte · GitHub
I’m just wondering if there are any other approaches that anyone has tried to sync such large tables? I was also thinking I could setup a CDC connection, abort the initial load, and edit the state so that it will only capture the changes?
I would be happy to help with implementation if I can, although my Java skills are fairly non-existent
Thanks for your help!