Failing to ingest a "big" MySQL table

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu
  • Memory / Disk: 64Gb
  • Deployment: Docker
  • Airbyte Version: 0.39.10
  • Source name/version: MySQL 0.5.11
  • Destination name/version: Snowflake (0.4.28)
  • Step: Sync
  • Description: MySQL to Snowflake unable to ingest large table, the first sync for a CDC incremental + dedupe sync

NOTE: This was previously working but then the airbyte instance died and it lost the position of the binlogs which meant it needed to perform a full-refresh sync. I was lucky to get the full-refresh sync to work a month ago and since then it’s been working fine as incremental. I just cannot get over this first initial sync! Can you please review the logs and let me know what’s going wrong.

57.77 GB | 127,417,153 emitted records | no records | 5h 0m 42s

mylogs-111.txt (62.1 KB)

FYI not a timeout issue AKAIK

MySQL has its wait_timeout variable default value set to 28800 seconds (8 hours).

Therefore, if both sides of the connection still keep the defaults, the problem will never happen, as MySQL will never timeout a connection before airbyte.

Hey @danieldiamond,
I’m wondering if the root cause could be on the destination side.
Do you see part of the data being written to Snowflake? Which loading method did you choose? We recommend using internal staging.
Moreover, I did not exactly get if your attempting an initial load again after you Airbyte instance died or if this problem happens for your incremental loads?

This is an issue with initial load after the instance died. there are no issues with incremental loads.

we use AWS S3 Staging for loading. Isn’t that more efficient for loading large pieces of data instead of internal?

What is the issue from the logs? I thought it was an issue with debezium or source and not necessarily snowflake side

What is the issue from the logs?

I can’t tell, there are no errors displayed. It looks like the destination gets cancelled 2022-06-07 01:22:50 e[32mINFOe[m i.a.w.g.DefaultReplicationWorker(cancel):441 - Cancelling destination...

Isn’t that more efficient for loading large pieces of data instead of internal?

The recommended way is to use internal staging because it leverages Snowflake’s own library to perform the staging which can be more efficitent.

Could you please check the following:

  • Do you have partial writes to your snowflake destination in the raw tables?
  • What is the memory consumption of your sync and destination container when the sync run?
  • Try to use internal staging loading to check if it changes anything?

Do you mind sharing your full logs? I think you truncated it, maybe you missed an error :thinking: ?