Slow data migration with Airbyte OSS on k8s and Databricks

Summary

Data migration is slow when using Airbyte OSS on k8s with Databricks as the destination. The issue seems to be related to a hardcoded limit in the Databricks connector for non-compressed data, causing small batches of compressed data to be written into the table, generating numerous queries and slowing down the process.


Question

Hey folks,

I need some assistance regarding slow data migration:
:information_source: Some general details:
• We’re running Airbyte OSS on k8S
• Data source: s3 bucket with .parquet files
• Destination: Databricks
• Computation in Databricks: SQL Warehouse
TL:DR: The data migration is very slow and neither Databricks nor k8s seem to be the issue:

• Databricks:
◦ We’re using a cluster of size Medium and auto-scaling configured to up to 4 nodes
◦ The cluster is running on 1 node and pretty constantly executing only 1 query at a time, which are small batches of ~12MB (<100k rows) being inserted
• Pods in K8s:
◦ Job pods have a limit of 4GB of memory and 3 CPUs
◦ When looking at the memory and CPU usage, neither come close to be hitting the limit
:black_small_square:︎ The server pod is the only one using more memory, but:
:black_small_square:︎ the job pod has a constant memory usage of ~550MB and a CPU usage of ~1800 mCPUs
:black_small_square:︎ The source or destination pods for the jobs reach a higher memory usage at some points, but are not even close to the limit
Our hypothesis is that <airbyte/airbyte-integrations/connectors/destination-databricks/src/main/kotlin/io/airbyte/integrations/destination/databricks/DatabricksDestination.kt at 477bcc4c7195d5c8d7ba7316b94e39d255281035 · airbytehq/airbyte · GitHub limit>, which is hardcoded into the connector and appears to be a limit for non-compressed data is causing these chunks of ~12MB of compressed data written into the table, generating LOTs of queries which then cause the process to be slow.

Does this make sense? Is there any way to work around this?
Or do you have any other suggestions of what we might be doing wrong?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["slow-data-migration", "airbyte-oss", "kubernetes", "databricks", "connector-issue", "data-compression"]