Hi @alafanechere and @tuliren, thanks for your responses!
I performed some additional tests and would like to share the results first.
Important correction: the amount of rows is actually ~21M, and the full size is around ~42GB. I probably wrote “~26M rows” by mistake.
I tried to launch Airbyte on AWS EC2 instead of GCP VM to have Airbyte in the same VPC as our test database and connect without any tunnels. I used the same VM configuration (8 vCPU with total 16 threads, 32GB RAM, 30GB+ SSD) and tried to sync the same data (one MySQL table with ~21M rows (~42GB)) using MySQL connector v0.5.11.
The result didn’t show any significant difference. I did 1 run on AWS and 2 runs on GCP, and in fact one of the runs on GCP VM was faster, however the other one was significantly slower. I think that is probably because of the other testing load on the database that might have been run in parallel.
Airbyte AWS EC2 (full refresh - overwrite):
- 8h 23m 8s
Airbyte GCP Compute Engine (full refresh - overwrite):
- 12h 53m 6s
- 7h 15m 42s
Conclusion: I don’t think our network conditions impact data sync speed.
I wanted to make sure that recently introduced dynamic
fetchSize boosts performance and/or has any impact on the data sync. I ran the same sync above with two different versions of MySQL connector – v0.5.11 with dynamic
fetchSize, and v0.5.9 with default fetchSize (1000?).
My test result showed that there is no noticeable impact on the connector performance – probably the same result as @tuliren mentioned.
Airbyte AWS EC2 (full refresh - overwrite):
- 8h 23m 8s (v0.5.11)
- 8h 22m 52s (v0.5.9)
Conclusion: I didn’t notice any impact of dynamic
fetchSize on connector performance. Also, according to sync logs, it looks like dynamic
fetchSize has no effect on actual fetch size and the connector keeps syncing data in 1000 rows batches.
What we decided to do
For now we decided to continue using our current sync method until the connector speed improves. We use embulk to sync data from Aurora MySQL to BigQuery.
I’d like to mention here that syncing the same table using embulk takes about 23 minutes, however we are only syncing certain columns (for example, omitting a couple of columns with long texts) versus Airbyte’s approach to sync all data with all columns.
Even then, the resulting dataset is around ~32GB (vs ~42GB full table). I’m yet to test syncing the whole table using embulk, but seeing how fast embulk extract and loads ~3/4th of the same data I would expect Airbyte to sync the full table in 30~40 minutes instead of 8 hours.
I know that embulk uses MySQL Connector/J JDBC Driver mysql-connector-java-5.1.44 and uses the default
fetchSize=10000. Also according to the sync log, embulk used 8 local threads to fetch the data. Maybe this info can be of any help. Attaching the log of embulk sync job for reference (sensitive info redacted).
embulk_job.log (18.1 KB)
As for your questions:
Could you please share logs of your syncs so that we can check the value of this
Partial log attached – the size exceeded maximum upload file size here (8MB) so I removed parts of the logs that are repeating a lot. I also redacted the log in many parts to remove sensitive information. And as mentioned above, the
fetchSize doesn’t seem to affect the pipeline in any way.
airbyte-job-sync.txt (455.9 KB)
it could also be the sizing of your database replica or the available resource on your host. Are you able to evaluate the load on Aurora when the sync runs, and check the Airbyte containers’ resource consumption when the sync happens?
In my tests above both Airbyte and embulk had plenty of resources (at peak load CPU usage < 75%, RAM usage ~30%, insignificant disk space usage). Database replica was the same in all tests. I’m sure there is no problem with resource consumption.
This issue is tracked here
Thanks @tuliren ! Looking forward to future updates.
Do you have an ideal throughput in mind by the way?
At least the same speed that we are currently achieving with embulk. According to my tests it’s more than 15 times faster, so I would expect Airbyte to sync the same table (full size ~21M (~42GB)) in 1 hour at max, but preferably 30-40 minutes.
As as side note, looking forward to the functionality to choose what columns to sync for each table! That’s a critical feature for us, and would also immensely help with testing.