Source MSSQL - initial load is very slow (CDC run)

  • Is this your first time deploying Airbyte?: Yes

  • OS Version / Instance: Ubuntu, Windows or MacOS… are some example remove it with yours

  • Memory / Disk: you can use something like 4Gb / 1 Tb

  • Deployment: Kubernetes deployment

  • Airbyte Version: 0.39.16

  • Source name/version: 0.4.5

  • Destination name/version: 0.4.29

  • Step: The issue is happening during sync, creating the connection or a new source?

    I am working on mssql to snowflake CDC jobs. we have big tables (700000653 rows) and cdc logs are available only ~3 days. Source connector process around ~3k rows per seconds hence, it would take more than 3 days to complete initial load hence, we may miss some records in incremental process as transaction logs expires in 3 days. We may need to load bigger than this table in future.

  • Description:
    How to improve CDC Initial load speed ?

Is there any workaround ? I would appreciate your help on this issue.

Thanks
Siva

Hey @sivankumar86, thanks for your post. As far as I know, improving source database performance is still something the team is working on: https://github.com/airbytehq/airbyte/issues/12532

Have you considered allocating more resources or increasing job parallelism by increasing the max number of sync workers? How has that impacted the load speed?

1 Like

@sajarin Thanks for you reply. Good to know that it is in backlog. I have configured just one table per connection hence, it is going to run on single worker (correct me if I am missing) . I have configured 8GB limit for k8s container and I verified the usage and, noticed that it was using ~30% of resource (~2GB) only.

Hey @sivankumar86, I appreciate your patience with this issue. I don’t know if there’s a direct workaround for your specific problem. I think your best bet might be to play around with your kubernetes configuration and try to increase the job parallelism, if you haven’t done so already. I suspect this would increase the speed but this is only my guess. If this doesn’t work, I recommend that you create an issue on Github so that the team can track it better.

You can find more information in another thread: https://discuss.airbyte.io/t/scaling-airbyte-on-k8s-increased-job-parallelism/826
As well as in our documentation: https://docs.airbyte.com/understanding-airbyte/jobs/#worker-parallelization