Postgres transaction logs not cleaning up in CDC incremental mode sync to Snowflake

Summary

Postgres transaction logs are not cleaning up as expected after each sync in CDC incremental mode from Postgres 16 to Snowflake. Postgres is running in AWS.


Question

We are using Airbyte in CDC incremental mode to sync Postgres 16 to Snowflake. We are observing that our transaction logs in Postgres are not cleaning up after each sync as much as we expect. Postgres is running in AWS. Any suggestions?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["postgres", "cdc-incremental-mode", "snowflake", "transaction-logs", "aws"]

I’m facing the same issue with Posgres 15 to Redshift. I’ve upgraded to last connectors and airbyte and its still acting the same, I even downgraded back and I’m still seeing the same behavior.

Are your syncs randomly doing full refreshes of the data ?

I think we were seeing random full refreshes. Let me post some of our findings

• Theory 1 - Debezium timeouts
◦ During moments of high load, Airbyte times out parsing through the WAL logs as it only waits 5 minutes for a relevant record
◦ It should have been waiting for 20 minutes (1200 seconds), but from the logs it clearly was not doing so
◦ This could happen if it was looking for data among many other WAL logs, e.g., something half way through a 800GB WAL log backlog
◦ Explains the never completing syncs allowing the WAL logs to grow for the last few weeks, and the immediate draining of the WAL logs once the replication slot moved ahead after increasing the timeout
◦ Q: What in the WAL logs successfully drained each night, and why was there so much left over? Should replay oldest to newest, so shouldn’t oldest just get purged? Were we seeing just a lag based on where it failed b/c of this?

As Joey mentioned, our replication slot just won’t advance. I really thought it was the dbz timeouts, but now we are “caught up” it seems and it doesn’t time out

     , PG_CURRENT_WAL_LSN()
     , REDO_LSN
     , RESTART_LSN
     , PG_SIZE_PRETTY(PG_WAL_LSN_DIFF(PG_CURRENT_WAL_LSN(), RESTART_LSN)) AS LAG
     , ACTIVE
FROM PG_CONTROL_CHECKPOINT(), PG_REPLICATION_SLOTS;```
```+------------+------------------+-------------+-------------+------+------+
|slot_name   |pg_current_wal_lsn|redo_lsn     |restart_lsn  |lag   |active|
+------------+------------------+-------------+-------------+------+------+
|airbyte_slot|1DE0/894564C8     |1DE0/75FDF240|1D8D/55E49CA0|333 GB|false |
+------------+------------------+-------------+-------------+------+------```

<@U01MMSDJGC9> Could you please offer any insights or guide us to how debug this further ? This is a huge flaw i’m guessing with the postgres source connector and it’s impacting us massively

I assume you’re using RDS Postgres too?

This is also impacting us heavily so we should work together to find the resolution. Our RDS instance is constantly autoscaling storage because the transaction logs aren’t being cleaned up.

Me too and we’ve already hit maximum storage, and with RDS you can’t downscale storage :man-facepalming:

Oh wow, that is a lot. We are not near the limit yet

We are at 3613 GiB right now

Yeah, you can’t downscale storage, you have to create a new instance and migrate

ohhh ive set the limit at 1000 GB, more than that and its coming right out of my salary :joy:

Can both of you provide what version of connectors source and destination and airbyte platform you are using?

Postgres 3.6.11
Snowflake 3.11.4
Platform 0.63.10

Also occurred in previous versions but we were hoping upgrading would fix it

Source - Postgres: 3.4.26
Destination - Redshift: 2.6.4
Platform - 0.62.4

Disclaimer: I’ve already tried upgrading to:
Source - Postgres: 3.6.11
Destination - Redshift: 3.4.0
Platform - 0.63.11

and it was even worse, I would see full refreshes that would for example load the true 50 GB of data at first attempt, then fail and load 100 GB at second attempt, and might randomly late load 150GB

Joey, could you file a GitHub issue? I’ll ask someone from the database team to take a look into :octavia-thanks:

Yeah, I will work on that shortly and post the link here. :pray:

https://github.com/airbytehq/airbyte/issues/42917