Postgres CDC connector stuck , never stops reading from source and staging files while never copying them

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu 20.04 (LTS) x64
  • Memory / Disk: you can use something like 16Gb / 320Gb
  • Deployment: Docker
  • Airbyte Version: 0.40.26
  • Source name/version: 2.0.19
  • Destination name/version: 0.4.42
  • Description: When running incremental sync after initial Sync when using CDC Postgres Connector it looks like the job never stops running continuously pulling data from replication slot. The could be a good thing if we were in a streaming/microbatch paradigm however there is something that tells me this is not normal .The connector continuously put staging file in Snowlfake but never copies them to target tables. It looks like the job will never stop running and files will never be copied.
    Hitting cancel on the job has no effect and the job continues running indefinitely . I had to disable airbyte user on Snowflake to make the job crash .

Following the logs for investigations
f26603af_93ab_484c_90f5_e5facffc6f14_logs_13485_txt.txt (1.3 MB)

Looks like the ingestion of WAL is just very slow, wondering how we could speed it up ? What should be scaled ?

Hello @gchevalier :wave:

What you’re reporting maybe is a current limitation between a CDC source to Airbyte Destination.

  1. The sudden drop in the throughput. This is because of the back pressure from the destination. The destination is taking a long time to flush data. This can be explained when the CDC sources has a lot of tables in the catalog. Airbyte use DEFAULT_MAX_CONCURRENT_STREAM_IN_BUFFER = 10; for snowflake while the catalog contains lots of streams.
  2. Lack of frequent checkpointing in CDC. Every time we were able to read uptill 300-400MB data before the sync failed cause of the above error. Since there is no frequent checkpointing in the CDC i.e. the source connector only emits 1 state at the end of the sync, there is no state captured and in the next attempt we again start from the last point. If we emit a state at least we would be able to capture some progress and the next attempt would start from a higher position. This doesn’t guarantee to fix anything considering how active cart’s database but still we keep making some progress and also acknowledging the LSN to avoid building of the WAL logs.

The recommendation to solve the problem is:

  1. Upgrade to latest version of both connectors (Postgres and Snowflake)
  2. Change the Buffer Parameter in Snowflake Destination to 50.
  3. Limit the number of tables to maximum 30 per connection using CDC, this is a limitation btw Source and Destination the team is working to improve in the future.
  4. You need to use staging on Destination

Hello @marcosmarxm thank you so much for reaching back here.

  1. Upgrade to latest version of both connectors (Postgres and Snowflake)
    I’ll do that but I guess I’m stuck because of this other post : Upgrading from 0.40.26 to 0.41.0 on Docker - #2 by marcosmarxm

Change the Buffer Parameter in Snowflake Destination to 50.
Will do !

Limit the number of tables to maximum 30 per connection using CDC

Currently we have ~ 60 tables. Would that work I we have 3 connectors each with ~20 tables to sync outputing data to the same destination ?

You need to use staging on Destination

That is something I can investigate not sure to understand the rational here

Hi @gchevalier. We have the same issue, did using 60 tables in a single connection work for you? We have ~180 tables, and I’d prefer not to have 6 connections for this one database if possible!

@marcosmarxm if you have multiple connections, do they each need their own Postgres cdc slot, and publication? Is it possible for one of the slots to automatically receive events on new tables?

Oh, and is there an issue tracking this:

  1. Limit the number of tables to maximum 30 per connection using CDC, this is a limitation btw Source and Destination the team is working to improve in the future."

Thanks very much both!
Mark

@Mark_NS Actually what solved the issue in our case was to update the buffer param for snowflake destination to 50. We did not experiment with the other potential causes.