Slow data migration with Airbyte OSS on k8s and Databricks

slack-user-airbyte · September 27, 2024, 6:13am

Summary

Data migration is slow when using Airbyte OSS on k8s with Databricks as the destination. The issue seems to be related to a hardcoded limit in the Databricks connector for non-compressed data, causing small batches of compressed data to be written into the table, generating numerous queries and slowing down the process.

Question

Hey folks,

I need some assistance regarding slow data migration:
Some general details:
• We’re running Airbyte OSS on k8S
• Data source: s3 bucket with .parquet files
• Destination: Databricks
• Computation in Databricks: SQL Warehouse
TL:DR: The data migration is very slow and neither Databricks nor k8s seem to be the issue:

• Databricks:
◦ We’re using a cluster of size Medium and auto-scaling configured to up to 4 nodes
◦ The cluster is running on 1 node and pretty constantly executing only 1 query at a time, which are small batches of ~12MB (<100k rows) being inserted
• Pods in K8s:
◦ Job pods have a limit of 4GB of memory and 3 CPUs
◦ When looking at the memory and CPU usage, neither come close to be hitting the limit
︎ The server pod is the only one using more memory, but:
︎ the job pod has a constant memory usage of ~550MB and a CPU usage of ~1800 mCPUs
︎ The source or destination pods for the jobs reach a higher memory usage at some points, but are not even close to the limit
Our hypothesis is that <airbyte/airbyte-integrations/connectors/destination-databricks/src/main/kotlin/io/airbyte/integrations/destination/databricks/DatabricksDestination.kt at 477bcc4c7195d5c8d7ba7316b94e39d255281035 · airbytehq/airbyte · GitHub limit>, which is hardcoded into the connector and appears to be a limit for non-compressed data is causing these chunks of ~12MB of compressed data written into the table, generating LOTs of queries which then cause the process to be slow.

Does this make sense? Is there any way to work around this?
Or do you have any other suggestions of what we might be doing wrong?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["slow-data-migration", "airbyte-oss", "kubernetes", "databricks", "connector-issue", "data-compression"]}

Topic		Replies	Views
Sync job is very slow Connector Questions & Issues source-s3 , destination-databricks	4	2849	January 31, 2023
Default 2GB memory limit not being observed Connector Questions & Issues source-postgres , data-loading	3	1812	December 15, 2022
Moving large data from PostgreSQL Aurora to BigQuery on Kubernetes Connector Questions kubernetes , airbyte , connector , bigquery , question	9	30	May 16, 2024
How to speed up airbyte Jobs Connector Questions & Issues source-postgres , destination-gcs	6	2755	July 18, 2022
Jobs running for long with message Connector Questions & Issues connectors , kubernetes	4	339	July 14, 2022

Slow data migration with Airbyte OSS on k8s and Databricks

Summary

Question

Related topics