Sync performance

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: AWS EC2
  • Deployment: Docker
  • Airbyte Version: 0.39.10-alpha
  • Source name/version: Snowflake 0.1.12
  • Destination name/version: Snowflake 0.4.28
  • Step: sync

Hi all,
When benchmarking sync speed, I found that the average sync speed is around 1.5-2 mb/s, which is somewhat lower than the network throughput, and I figured, why not increase the number of worker instances to improve the speed

Is that somehow possible?

Thank you

Hi @mike,
Changing the number of worker instances won’t affect immediately the sync speed.
If you have multiple tables you can try to parallelize the sync jobs on your snowflake source by creating one connection per stream. This will enable syncing multiple tables at the same time.
It would help you could share a full sync log so we can try to identify which process is the bottleneck in the current throughput. Do you have an other throughput metric to share that we can use as a performance benchmark?

Hey @alafanechere,

Good idea on splitting the job into several connections, but I am benchmarking the solution with just one big table (20 mil rows), as I described here.
There you can see the numbers as well, but you will find them extremely scattered taking different numbers as the baseline
What’s more, during this sync we only utilize <10% of the instance resources

As per the logs, they were huge, so I removed some thousands of rows like
Table _AIRBYTE_TMP_YWW_VALIDATION_CONFIG column _AIRBYTE_AB_ID (type VARCHAR[16777216]) -> Json type io.airbyte.protocol.models.JsonSchemaType@49c17ba4:
logs-159.txt (3.1 MB)

Ok, so according to your log, it takes ~36mn to load your 6GB table, which corresponds to a throughput of ~2mb/s as you suggest.
The source-snowflake connector is part of JDBC connector pool (same root logic as for Postgresql, MySQL etc.), the read operation is streaming results by batches of records whose size corresponds to the fetchSize parameter. This fetchSize was recently updated to be dynamic. On your logs you can see that the fetchSize is set to 10000 rows: Updated fetch size: 100000 rows .
We are working this quarter on improving our database connectors and there is room for improvements to this fetchSize computation to maximize throughput while preventing memory issues.
I could open an issue to ask our database team to specifically investigate snowflake read throughput performance but I’d need a benchmark throughput you’d get from another tool to motivate this improvement.

That’s right, except the original table being 300mb, and the synced one - 670mb. With this numbers the performance is one order of magnitude worse than 2 mb/s, so I’m not sure what the right number is

If I’m getting that right, tweaking the fetchSize won’t bring orders-of-magnitude performance increase, and I’m guessing that is very much a possibility by looking at the cpu/memory/network usage numbers

I’m thinking of running a more sophisticated analysis, so please do suggest any tools you have in mind

That’s right, except the original table being 300mb, and the synced one - 670mb

From the log you shared 6GB are being read.

tweaking the fetchSize won’t bring orders-of-magnitude performance increase

Why do you think tweaking the fetchsize won’t improve the performance?

I’m thinking of running a more sophisticated analysis, so please do suggest any tools you have in mind

I’ve no specific tool in mind, but you could start with a Python script reading the same table from the same VM and check what throughput you achieve.

From the log you shared 6GB are being read.

Yes, and exported csv is around 370 MB, just like the number snowflake gives, so that’s a mystery

Why do you think tweaking the fetchsize won’t improve the performance?

I am not knowledgeable enough about Airbyte to render any sort of informed opinion, so I am just assuming that the performance gains may not be in the order of magnitude

I’ve no specific tool in mind, but you could start with a Python script reading the same table from the same VM and check what throughput you achieve.

Alright, thanks!
I will report my findings here