Failing to ingest a "big" MySQL table (38Gb)

  • Is this your first time deploying Airbyte?: No
  • Memory / Disk / CPU: you can use something like 16Gb / 25 Gb and 6 vCPU
  • Deployment: Docker on GCP VM
  • Airbyte Version: 0.38.4-alpha
  • Source name/version: MySQL 0.5.11
  • Destination name/version: BigQuery 1.1.6
  • Step: The issue is happening during sync
  • Description: Remove this with the description of your problem.

Hello everyone,

So I’m trying to ingest just one table of a database which is the biggest one with 150 million lines and 38Gb size.

The sync is going well for about an hour after ingesting 25 million records but after that the sync fails with the following error:

2022-05-22 15:53:35 ERROR i.a.w.DefaultReplicationWorker(run):174 - Sync worker failed.
java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$SourceException: Source cannot be stopped!

Have you any advices for handling this amount of data ?

This is not the first time I’m struggling with ingestion of table Gb size postgres and mysql table and I wonder if Airbyte is really able to load ‘big’ tables (and I’m not talking about Big Data) or maybe my set up is not suitable to handle this.

Thanks a lot for your help !

logs-4987.txt (2.2 MB)

Sorry to hear that Mehdi. I saw you’re using CDC to sync data. As a test, would be possible to disable CDC and use standard method with full refresh. This would help us understand if the problem is with CDC itself or the table size.

Hi Marco, thanks for answering.

I’ve switched to standard method, erased the target dataset, and launched a new sync.

But the job only last 43s with the “Source cannot be stopped” error…
logs-5139.txt (74.1 KB)

Update: I’ve created a new connection. For now it’s still processing after 40 million rows ingested. Finger crossed.

The job has failed again after 3h49 and almost 68 million lines ingested.

Here are the logs (only the las 20k lines) for investigation:
tail_20k_logs-5146.txt (2.9 MB)

Thanks

From your latest logs:

2022-05-25 13:00:38 e[32mINFOe[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$6):331 - Records read: 67996000 (14 GB)2022-05-25 13:00:38 e[44msourcee[0m > Exception in thread "main" java.lang.RuntimeException: java.sql.SQLException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.2022-05-25 13:00:38 e[44msourcee[0m > at io.airbyte.db.jdbc.StreamingJdbcDatabase$1.tryAdvance(StreamingJdbcDatabase.java:100)2022-05-25 13:00:38 e[44msourcee[0m > at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)2022-05-25 13:00:38 e[44msourcee[0m > at io.airbyte.commons.util.DefaultAutoCloseableIterator.computeNext(DefaultAutoCloseableIterator.java:38)2022-05-25 13:00:38 e[44msourcee[0m > at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)2022-05-25 13:00:38 e[44msourcee[0m > at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)2022-05-25 13:00:38 e[44msourcee[0m > at io.airbyte.commons.util.LazyAutoCloseableIterator.computeNext(LazyAutoCloseableIterator.java:42)2022-05-25 13:00:38 e[44msourcee[0m > at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)2022-05-25 13:00:38 e[44msourcee[0m >

Looks the connection was timeout. Are you using MySQL in AWS Aurora? Can you check the timeout in your db configuration?

It is a MySQL instance from Google Cloud SQL but this could also be a timeout issue. I’ll check and I’ll let you know.

I still don’t have news about the timeout. I’ve figured out that the Airbyte job might be blocked by the cloud team as they was worried about the high volume requested. I’ll let you know as soon as I get more information.

Ok Mehdi, let me know when you have any updates here.

we’re experiencing the same issue with large table from MySQL to Snowflake. Any upates?

@marcosmarxm this is most likely a CDC issue. I don’t think any airbyte user has ever been able to successfully sync a large table from MySQL using CDC. I know previously the issue of hanging workers was resolved but now that is leading to the above issue.

Airbyte Version: 0.39.10-alpha
Source: MySQL (0.5.11)
Destination: Snowflake (0.4.28)

57.77 GB | 127,417,153 emitted records | no records | 5h 0m 42s

added my logs here Failing to ingest a "big" MySQL table

Hello everyone. After talking with the team in charge of the databases it seems there is no timeout set.

I’ve tried a new ingestion few days ago. It went further than the last launch but failed again after 102 million rows (21Gb).

This time the error seems different

2022-06-04 13:34:32 ERROR i.a.w.DefaultReplicationWorker(run):174 - Sync worker failed.

java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$DestinationException: Destination process message delivery failed

Here are the last 1000 lines of logs

tail_1k_logs-5553.txt (136.1 KB)

Mehdi: from your logs the error Suppressed: io.airbyte.workers.WorkerException: Source process exit with code 137. This warning is normal if the job was cancelled.
Which means your instance is hitting OOM, please give more memory/cpu to your Airbyte instance.

Hello Marcos.

My VM has currently 16Gb RAM, should I scale it up to 32Gb ?

And do you have some documentation or advices for the the setting a correct sizing for the VM hosting Airbyte ?

Could some of these initial snapshot issues be down to the parameters in use for the debezium settings for the following config variables:

offset.flush.timeout.ms
offset.flush.interval.ms
max.request.size

max.queue.size
max.batch.size

Not sure how or where these are being set in Airbytes use of Debezium but if default values are being used that might be the issue as the defaults aren’t set for large snapshots in my understanding.

Ref: apache kafka - Debezium is failing to snapshot big table size - Stack Overflow

Yemi here is the variables for Debezium:
https://github.com/airbytehq/airbyte/blob/bb469418679e1605689d994a57cf59b1095c677d/airbyte-integrations/bases/debezium/src/main/java/io/airbyte/integrations/debezium/internals/DebeziumRecordPublisher.java#L122-L158

Any comments or information to help us is welcomed.

Caveat: I’m no debezium expert. :slight_smile:

The stackoverflow answer suggests:

offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=10485760

“Decreasing the offset flush interval allows Kafka connect to flush the offsets more frequently, and setting a large timeout gives it more time to get an acknowledgement for the commit.”

max.queue.size = 81290
max.batch.size = 20480

“The default size for queue is 8192, which is quite low for a larger database. Bumping these configurations up helped a lot.”

Another response suggests having the offset.flush.interval.ms > than offset.flush.timeout.ms to prevent an “Invalid call to OffsetStorageWriter flush() while already flushing…” error.

I don’t have time to test this with Airbyte at the moment, but would suggest anyone suffering issues with large snapshots rebuilds airbyte with the above changes and tests.

Debezium also offers a incremental snapshot that allows the specification of how many rows in each chunk, which might be worth investigating: Debezium connector for MySQL :: Debezium Documentation.

(schedule meeting with Daniel next week to test changing values in Debezium parameters)

I published a custom mysql source connector with the updated debezium configuration.
It failed for a normalization issue. Posting issue here: MySQL to Snowflake CDC Sync Fails: Race condition, raw table not found but still being written to snowflake

Some interesting things to note:

  1. This normalisation issue occurred last week when I was trying to get CDC full-refresh to work - so i dont believe its related to updated debezium spec (but it could be)
  2. Experiencing almost an hour delay from completed source to next step
2022-06-14 03:12:10 e[44msourcee[0m > 2022-06-14 03:12:10 e[32mINFOe[m i.a.i.s.r.AbstractDbSource(lambda$read$2):132 - Closed database connection pool.
2022-06-14 03:12:10 e[44msourcee[0m > 2022-06-14 03:12:10 e[32mINFOe[m i.a.i.b.IntegrationRunner(runInternal):171 - Completed integration: io.airbyte.integrations.base.ssh.SshWrappedSource
2022-06-14 03:12:10 e[44msourcee[0m > 2022-06-14 03:12:10 e[32mINFOe[m i.a.i.s.m.MySqlSource(main):213 - completed source: class io.airbyte.integrations.source.mysql.MySqlSource
2022-06-14 04:22:10 e[44msourcee[0m > 2022-06-14 04:22:10 e[1;31mERRORe[m i.a.i.b.IntegrationRunner(lambda$watchForOrphanThreads$8):266 - Failed to interrupt children non-daemon threads, forcefully exiting NOW...
  1. that Active non-daemon thread error (which i know isn’t an issue) doesn’t show up with my other connectors.