Summary
Inquiry about the support for large initial CDC syncs with WASS on Airbyte 0.64.3 and Postgres source 3.6.18, experiencing issues with a Postgres->Redshift sync
Question
Does anyone know if https://airbyte.com/blog/supporting-very-large-cdc-syncs-with-wass|WASS support for large initial CDC syncs is available on Airbyte 0.64.3 and Postgres source 3.6.18? The blog post on WASS claims it should be a recent enough version, but the initial sync fails for the seemingly exact reason identified in the blog post (i.e., it failed at 38GB synced of a 500+GB Postgres->Redshift sync). Thank you for your help.
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
["large-initial-cdc-syncs", "wass", "airbyte-0.64.3", "postgres-source-3.6.18", "initial-sync-failure", "postgres-redshift-sync"]
this happened to us so we downgraded to pre-WASS (3.4.26)
for some reason, WASS was breaking our initial sync Saved offset is before replication slot's confirmed lsn. Please reset the connection, and then increase WAL retention and/or increase sync frequency to prevent this from happening in the future.
even though our WAL retention is indefinite and the WAL shouldn’t be too big by the time this error would throw
<@U05N5APSYBE> Thank you for sharing this! I hadn’t expected to see WASS hurting the sync. I’ll update this thread once the sync using Airbyte 1.0.0 goes, but if it doesn’t go well, we’ll see about downgrading the source. Thank you again!
hi <@U07LR8G4C2H> and <@U05N5APSYBE> , yes, 3.6.18 contains wass, if you feel there is a problem with it, could you file a ticket with your connection link or the sync log? we will take it from there
Hi <@U073KSQ6Z53>! Our (new) sync has been running for 5 hours and we’re observing two issues
• The RDS OldestReplicationSlotLag
is growing linearly, suggesting that WASS isn’t kicked in despite being on Airbyte 1.0.0/latest Postgres source. Is there some way to confirm this or force it to read the WAL?
• Only 20 of the >500GB have synced. Is there anything we can do to speed up the sync? Our concern (beyond it taking a long time) is about the ~5 days of uncapped WAL growth.
cc <@U01MMSDJGC9> in case you have any ideas
Thank you for any ideas you might have.
Hi Adam, this is a little strange as our current implementation has 4 hours time out for initial sync, and then CDC will kicks in.
<@U073KSQ6Z53> that’s fascinating! Here’s a screenshot of it running (my local time is 16:44 PM, so 5 hours). Is there a line in the logs we can look for to detect CDC/the WAL read kicking in?
oh sorry, i just checked again, this timeout is by default 8 hours!
It’s the last row “Initial Load Timeout in Hours”
you can set that to a smaller value maybe 4 experiment WASS
<@U073KSQ6Z53> OK great this is so helpful! And when it times out, nothing bad will happen as long as we can handle 8 hours of WAL backup? If so, that works for us.
Curiously, is there any trick to increasing the sync throughput?
Thank you so much for your help.
Yes, once CDC kicks in it will acknowledge the WAL log entries. You shall see the log size gets shrinked.
currently, to increase sync throughput, we can only do so by creating multiple connections, letting each connection own a subset of the current set of streams. In Q4, we do plan to migrate postgres to our newly developed bulk extract CDK, which provides concurrencies and perhaps will give better throughput for each connection.
> One follow-up: Since each timeout is treated like a sync failure and there are 5 retries per sync, will the sync be marked as failing and be unrecoverable after 5 timeouts? (edited)
Yue can provide a better answer, but latest version of Airbyte if a sync made progress and fail it will try again without count as a failure-failure
<@U01MMSDJGC9> that’s comforting, thank you!
No amount of sync/refresh sync worked (they all threw the WAL error). We upgraded to 1.0.0, and are running a sync now to see if any of the throughput/WASS improvements resolve the issue.
Thank you <@U073KSQ6Z53>! I do see the CDC having kicked in after 8 hours (twice). I also see the system is making incremental progress, which is GREAT!
One follow-up: Since each timeout is treated like a sync failure and there are 5 retries per sync, will the sync be marked as failing and be unrecoverable after 5 timeouts?
in WASS, we will retry 20 times.