MySQL to Snowflake incremental loading fails

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: EC2 Linux AMI 2
  • Memory / Disk: M6a.xlarge instance - 4 vCPUs, 16 GiB memory, 60 Gb storage
  • Deployment: Docker
  • Airbyte Version: 0.40.9 (latest as of writing)
  • Source name/version: MySQL 0.6.14 (latest as of writing) connected through SSH tunnel
  • Destination name/version: Snowflake 0.4.38 (latest as of writing) internal staging
  • Step: During incremental sync
  • Description:

Hi.

I am trying to sync moodle data sitting in Amazon Aurora to Snowflake. I was able to do so successfully in a previous Airbyte version 0.39.19-alpha.

Syncing 7.21Gb of data (about 40 tables) is successful on the initial run.
All subsequent syncs are failing. See image below.

Here is the log for the successful initial sync:
success-logs.txt (7.0 MB)
Errors/warnings in the log that may be of interest:

  • JSON schema validation failed
  • Signalling close because record’s binlog file : mysql-bin-changelog.002503 , position : 75169383 is after target file : mysql-bin-changelog.002503 , target position : 75125447
  • The main thread is exiting while children non-daemon threads from a connector are still active. Ideally, this should not happen

Here is the log for the first failed sync:
failure-logs.txt (5.3 MB)
Errors/warnings in the log that may be of interest:

  • 2022-09-28 19:47:53 - Additional Failure Information: ScheduleActivityTaskCommandAttributes.Input exceeds size limit.
  • Repeat of the errors/warnings found in successful.logs

Even with the warnings, the initial sync is successful. However, subsequent syncs are failing. I believe there is enough CPU and RAM for this sync. The error:

ScheduleActivityTaskCommandAttributes.Input exceeds size limit.

seems to be related to temporal, but it makes no sense that the initial sync ran without an issue.

Please could someone assist me on this issue. Thank you.

Hi @jamo, looks like there are type errors in your sync. Let me look into this more and I’ll have some more ideas for you tomorrow!

@natalyjazzviolin Thank you. I just wonder why it succeeds in the initial run.

Not sure if this helps but, I’ve also run another test with 9 tables that are a few Mb in size. The incremental loading syncs take much longer that the initial, historical load, which I find odd.


Though it seems the smaller tables are succeeding.

OK a few questions -

  1. Does your cursor field have trailing zeroes?
  2. Do you have normalization enabled?

@natalyjazzviolin

  1. My cursor field should be the timemodified column. It is a timestamp and if you mean trailing zeroes for example 1647509600 then yes, fields may have trailing zeroes
  2. Correct. I’ve selected the “Normalized tabular data”, not the “Raw data (JSON)”

@natalyjazzviolin

I am not 100% sure though if the cursor field is timemodified as I am using CDC and don’t have the option of selecting the cursor field.

Interestingly, this link describes that the source does not necessarily need a suitable cursor field.

“On the other hand, CDC incremental replication reads a log of the changes that have been made to the source database and transmits these changes to the destination. Because changes are read from a transaction log when using CDC, it is not necessary for the source data to have a suitable cursor field.”

I’ve also tested the same connection on a bit older Airbyte 0.39.19-alpha instance with source MySql 0.5.11, destination Snowflake 0.4.28, and the incremental loading works.

It is really odd that you’re having this issue with only 7Gb of data. Let me get some input from my team and I’ll get back to you shortly!

Could you update to Airbyte 0.40.10? I’m getting input from my team, and we’re thinking updating could possibly fix the problem.

@natalyjazzviolin
Hope you had a great weekend.

The incremental syncs are still failing after upgrading to Airbyte 0.40.10. Connectors are the latest versions.
The initial historical sync also succeeded after failing once.

Here is a screenshot of the UI:

Here is the historical sync log that failed: (/tmp/workspace/46/0)
historical-0-fail.log (4.1 MB)

Here is the historical sync log that succeeded: (/tmp/workspace/46/1)
historical-1-success.log (7.6 MB)
Note: I’ve removed some “Records read” logs to reduce size.

Here is the incremental sync log that fails:
failure-logs.log (5.0 MB)

Could you share your schema please?

You should be able to see the whole Moodle schema here:
https://www.examulator.com/er/3.11/index.html
or:
https://docs.moodle.org/dev/Database_Schema

@natalyjazzviolin It also appears on the historical success logs in this format:

For table ‘uctohsDB.mdl_glossary’ using select statement: ‘SELECT’ id, course, name, intro

Hope this is what you’re looking for.

After discussing this with the team, I’ve escalated this to GitHub. You can follow the issue here:
https://github.com/airbytehq/airbyte/issues/17512

This looks like it is failing because of the StandardSyncInput which contains the full catalog and the state. It is hard to estimate what will be the size for this specific connection. The state is only added on the second sync, which is why the initial sync works.

Could you try a few things? These could be fixes in the short term, but we will work on making a longterm fix.

  1. Split the tables into 2 or more connections so as not to hit the maximum message size.
  2. Tweak the dynamic.config and build temporal with a bigger BlobSize

Let me know if this helps!

@natalyjazzviolin

I have about 65 tables of interest that are about 21 Gb in size.
To split the tables into 2 or more connections, I have selected only a few tables that add up to 1.74 Gb, which I believe is small, yet the incremental sync keeps failing.

This error came up in Cloud as well, and a fix has been merged:
https://github.com/airbytehq/airbyte/pull/17538

Please update Airbyte and let me know if the incremental sync works!

Hi @natalyjazzviolin

I have a critical critical issue. The connections are passing but there is a huge data integrity issue.
The counts of the records are significantly off. Please could you and the team look into this urgently.

And thank you for all the help thus far.

Hi! Please make a GitHub issue with all the details and we’ll be able to look into it!

hey @jamo there is a workaround for the issue Sync hangs with error: ScheduleActivityTaskCommandAttributes.Input exceeds size limit. · Issue #16236 · airbytehq/airbyte · GitHub

Thanks @alexnikitchuk I’ve updated to the latest version and I no longer get that error

@natalyjazzviolin I can’t seem to replicate the previous issue, but after updating to the latest version the incremental syncs still aren’t fully reliable

The syncs work at times

And fails/retries at times even though it looks like it is successful

Logs:
fail-logs.log (7.8 MB)

I’ve found an issue for the latest error you’re encountering. Looks like it happens sporadically just like in your case, and the only current fix is to reset the data and start a new sync, unfortunately.
https://github.com/airbytehq/airbyte/issues/17372

My suggestion would be to update the MySQL source to 1 or higher! As the user in the issue is also on a 6.x version.

We stick to one issue per forum post for documentation purposes. If you’d like to discuss this further, please start a new thread!