Source MSSQL - initial load is very slow (CDC run)

sivankumar86 · July 1, 2022, 12:10am

Is this your first time deploying Airbyte?: Yes
OS Version / Instance: Ubuntu, Windows or MacOS… are some example remove it with yours
Memory / Disk: you can use something like 4Gb / 1 Tb
Deployment: Kubernetes deployment
Airbyte Version: 0.39.16
Source name/version: 0.4.5
Destination name/version: 0.4.29
Step: The issue is happening during sync, creating the connection or a new source?

I am working on mssql to snowflake CDC jobs. we have big tables (700000653 rows) and cdc logs are available only ~3 days. Source connector process around ~3k rows per seconds hence, it would take more than 3 days to complete initial load hence, we may miss some records in incremental process as transaction logs expires in 3 days. We may need to load bigger than this table in future.
Description:
How to improve CDC Initial load speed ?

Is there any workaround ? I would appreciate your help on this issue.

Thanks
Siva

sajarin · July 1, 2022, 3:09pm

Hey @sivankumar86, thanks for your post. As far as I know, improving source database performance is still something the team is working on: https://github.com/airbytehq/airbyte/issues/12532

Have you considered allocating more resources or increasing job parallelism by increasing the max number of sync workers? How has that impacted the load speed?

sivankumar86 · July 2, 2022, 2:26am

@sajarin Thanks for you reply. Good to know that it is in backlog. I have configured just one table per connection hence, it is going to run on single worker (correct me if I am missing) . I have configured 8GB limit for k8s container and I verified the usage and, noticed that it was using ~30% of resource (~2GB) only.

sajarin · July 6, 2022, 2:56pm

Hey @sivankumar86, I appreciate your patience with this issue. I don’t know if there’s a direct workaround for your specific problem. I think your best bet might be to play around with your kubernetes configuration and try to increase the job parallelism, if you haven’t done so already. I suspect this would increase the speed but this is only my guess. If this doesn’t work, I recommend that you create an issue on Github so that the team can track it better.

You can find more information in another thread: https://discuss.airbyte.io/t/scaling-airbyte-on-k8s-increased-job-parallelism/826
As well as in our documentation: https://docs.airbyte.com/understanding-airbyte/jobs/#worker-parallelization

Topic		Replies	Views
Source MSSQL - CDC incremental load failed when there is a batch update at source Connector Questions & Issues source-microsoft-sql-server-mssql , data-loading , connectors	2	302	March 7, 2023
Syncing huge tables (10B+ rows) from MySQL Connector Questions & Issues source-mysql , data-loading	2	394	January 31, 2023
Postgres CDC connector stuck , never stops reading from source and staging files while never copying them Connector Questions & Issues source-postgres	5	864	May 15, 2023
Improving performance of MSSQL to S3 CDC incremental syncs Connector Questions performance , airbyte , connector , incremental-sync , data-discrepancy	0	12	September 30, 2024
MySQL connector - CDC - very slow Connector Questions & Issues data-loading	4	621	April 11, 2023

Source MSSQL - initial load is very slow (CDC run)

Related topics