CDC incremental dedup history spend too much time

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu
  • Memory / Disk: 16Gb / 200GB
  • Deployment: Docker deployment
  • Airbyte Version: 0.39.20-alpha
  • Source name/version: airbyte/source-postgres/0.4.27
  • Destination name/version: airbyte/destination-redshift/0.3.46
  • Step: The issue is happening during sync (normalization)
  • Description: I have a big table needs to sync from Postgres to Redshift with CDC incremental dedup history method, I understand the first snapshot will take long time to be done (since its a 70GB plus table). After the first time the batches should be take way shorter time however I found it is not the case and the bottleneck seems to be during the normalization phrase. Why the Redshift still scan the complete _scd table in every sync so it takes couple of hours for synchronization. (cause the replication slot accumulated rapidly)

After the first snap shot the normalization still scan the full table of SCD table:
(the 0.5 billion rows is the whole table)

Even there is no change still take a good amount of time:

Thank you for your help!
Br,
Ken

Hey could you create an issue in github as an improvement?. So that team can look into it.

Also on the connection settings if you have chosen incremental dedup yeah we have to scan the full table to figure out the duplicates. Also what is that affect in case it is taking longer? effect

Thank you for your response !

I’m not sure if its a issue or not if dedup mode have to scan the full table in every sync.

The effect that I concern is replication slot accumulate massive of data to a dangerous level (take a lots of disk space) since every batch is taking more time than I expect. Should I change the append mode(no need to scan the full table) so it won’t spend too much of time while the normalization in the warehouse database meanwhile make sure airbyte is consuming the replication slot in postgres?

Yeah you could try that

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.