MySQL source connector performance

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: GCP VM Debian 11
  • Memory / Disk: 32GB / 500 GB
  • Deployment: Docker
  • Airbyte Version: 0.38.2-alpha
  • Source name/version: MySQL 0.5.11
  • Destination name/version: BigQuery (denormalized typed struct) 0.3.5
  • Step: during sync
  • Description:

Hi all, I have a question about data extraction performance.
We are testing the MySQL connector to extract data from AWS Aurora MySQL reader replica instance and load it into BigQuery using GCS staging. We are connecting to the database from a single VM on GCP using a SSH tunnel in AWS.
Currently the average speed of extracting data is about 1GB per 15 minutes (or, in our case, 1M rows per 20 minutes) using Full Refresh - Overwrite. It took about 9,5 hours to sync ~26M rows (~40GB). BigQuery loads are performed very fast so the majority of time is spent on pulling data from MySQL.
I don’t have any other benchmarks for this kind of sync so would like to ask:

  • Is this sync speed within expected ranges? Or looks slow?
  • If it looks slow, are there any ways to improve the performance?

I assume I could improve the performance by splitting MySQL connectors into multiple ones and run them in parallel, but I would like to understand first if there is a way to improve loading speed on a single connector.
Thanks!

1 Like

Hey @stereowind,
JDBC connectors are fetching records in batches. The size of these batches depends on the fetchSize parameters we use in query made to the source database.
We recently shipped a feature that performs a dynamic computation of the fetchSize , and your connector version has this feature. Could you please share logs of your syncs so that we can check the value of this fetchSize?

  • Is this sync speed within expected ranges? Or looks slow?

It does not look too slow from my perspective but there is definitely room for improvement on the Airbyte side to make this replication faster. We are currently focusing a lot of effort on this topic.

  • If it looks slow, are there any ways to improve the performance?

It’d be great to identify the bottleneck in your context. I’m guessing that it might be the fetchSize, but it could also be the sizing of your database replica or the available resource on your host. Are you able to evaluate the load on Aurora when the sync runs, and check the Airbyte containers’ resource consumption when the sync happens?

I could improve the performance by splitting MySQL connectors into multiple ones and run them in parallel,

This is indeed a workaround to make the parallel replication happen, but we know that it’s not an ideal solution.

@stereowind, we are still working on improving the performance of our database connectors. This issue is tracked here: Investigate the performance bottleneck of source database connectors · Issue #12532 · airbytehq/airbyte · GitHub

Based on our internal benchmarks for the MySQL connector, “1M rows per 20 minutes” is in the normal range of the current MySQL connector. There are definitely rooms for improvement in the future.

Do you have an ideal throughput in mind by the way?

@alafanechere, I did some experiments on the fetchSize (a very small value versus a very large value), and did not see significant difference in the performance on the datasets I tested. My tests were not comprehensive though. So at this moment I would say that there was not enough evidence to conclude that this fetchSize change could boost performance. It was mainly to help prevent the out-of-memory issue.

Hi @alafanechere and @tuliren, thanks for your responses!

I performed some additional tests and would like to share the results first.
Important correction: the amount of rows is actually ~21M, and the full size is around ~42GB. I probably wrote “~26M rows” by mistake.

Test #1

I tried to launch Airbyte on AWS EC2 instead of GCP VM to have Airbyte in the same VPC as our test database and connect without any tunnels. I used the same VM configuration (8 vCPU with total 16 threads, 32GB RAM, 30GB+ SSD) and tried to sync the same data (one MySQL table with ~21M rows (~42GB)) using MySQL connector v0.5.11.
The result didn’t show any significant difference. I did 1 run on AWS and 2 runs on GCP, and in fact one of the runs on GCP VM was faster, however the other one was significantly slower. I think that is probably because of the other testing load on the database that might have been run in parallel.

Airbyte AWS EC2 (full refresh - overwrite):

  1. 8h 23m 8s

Airbyte GCP Compute Engine (full refresh - overwrite):

  1. 12h 53m 6s
  2. 7h 15m 42s

Conclusion: I don’t think our network conditions impact data sync speed.

Test #2

I wanted to make sure that recently introduced dynamic fetchSize boosts performance and/or has any impact on the data sync. I ran the same sync above with two different versions of MySQL connector – v0.5.11 with dynamic fetchSize, and v0.5.9 with default fetchSize (1000?).
My test result showed that there is no noticeable impact on the connector performance – probably the same result as @tuliren mentioned.

Airbyte AWS EC2 (full refresh - overwrite):

  1. 8h 23m 8s (v0.5.11)
  2. 8h 22m 52s (v0.5.9)

Conclusion: I didn’t notice any impact of dynamic fetchSize on connector performance. Also, according to sync logs, it looks like dynamic fetchSize has no effect on actual fetch size and the connector keeps syncing data in 1000 rows batches.

What we decided to do

For now we decided to continue using our current sync method until the connector speed improves. We use embulk to sync data from Aurora MySQL to BigQuery.
I’d like to mention here that syncing the same table using embulk takes about 23 minutes, however we are only syncing certain columns (for example, omitting a couple of columns with long texts) versus Airbyte’s approach to sync all data with all columns.
Even then, the resulting dataset is around ~32GB (vs ~42GB full table). I’m yet to test syncing the whole table using embulk, but seeing how fast embulk extract and loads ~3/4th of the same data I would expect Airbyte to sync the full table in 30~40 minutes instead of 8 hours.
I know that embulk uses MySQL Connector/J JDBC Driver mysql-connector-java-5.1.44 and uses the default fetchSize=10000. Also according to the sync log, embulk used 8 local threads to fetch the data. Maybe this info can be of any help. Attaching the log of embulk sync job for reference (sensitive info redacted).
embulk_job.log (18.1 KB)


As for your questions:

Could you please share logs of your syncs so that we can check the value of this fetchSize ?

Partial log attached – the size exceeded maximum upload file size here (8MB) so I removed parts of the logs that are repeating a lot. I also redacted the log in many parts to remove sensitive information. And as mentioned above, the fetchSize doesn’t seem to affect the pipeline in any way.
airbyte-job-sync.txt (455.9 KB)

it could also be the sizing of your database replica or the available resource on your host. Are you able to evaluate the load on Aurora when the sync runs, and check the Airbyte containers’ resource consumption when the sync happens?

In my tests above both Airbyte and embulk had plenty of resources (at peak load CPU usage < 75%, RAM usage ~30%, insignificant disk space usage). Database replica was the same in all tests. I’m sure there is no problem with resource consumption.

This issue is tracked here

Thanks @tuliren ! Looking forward to future updates.

Do you have an ideal throughput in mind by the way?

At least the same speed that we are currently achieving with embulk. According to my tests it’s more than 15 times faster, so I would expect Airbyte to sync the same table (full size ~21M (~42GB)) in 1 hour at max, but preferably 30-40 minutes.
As as side note, looking forward to the functionality to choose what columns to sync for each table! That’s a critical feature for us, and would also immensely help with testing.

Thank you so much @stereowind for this thorough testing and insights about embulk! This is definitely helpful for future improvements on this connector.

As as side note, looking forward to the functionality to choose what columns to sync for each table! That’s a critical feature for us, and would also immensely help with testing.

Our team is currently working on this, feel free to subscribe to this issue to track progress on this topic.

1 Like