Stack Syncs and Automated Full Resyncs on CDC

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu 18.04.4 LTS
  • Memory / Disk: you can use something like 16Gb / 50GB
  • Deployment: Docker
  • Airbyte Version: 0.40.3
  • Source name/version: Postgres Connector / 1.0.7
  • Destination name/version: Snowflake Destination / 0.4.34
  • Step: Sync
  • Description:

The behaviour that I am about to describe is a little bit complex but I will try to be thorough in order to be as descriptive I can.

Connection Info

We have configured one of our connections using postgres-connector and snowflake-destination. From the source database we have configured 2 tables both in

  • CDC replication
  • Incremental & Deduped History
  • Daily Sync on 9 AM, manually triggered through API (Airflow Operator)

The source database is an Aurora Postgres and for the CDC configuration we have followed your setup instructions, which means that we have created

  • A dedicated read-only user granted with the relevant privileges
  • A replication slot using ‘pgoutput’ plugin
  • A publication slot registered with the 2 tables.

Furthermore, those 2 configured tables are updated in batch operations, which means that for many days there are zero changes, but then due to a scheduled operation process thousands rows are affected (inserts/updates/deletes).

Problem Description
This connection was firstly configured on 2022-09-08T21:00:00Z and until 2022-09-16T21:00:00Z the syncs were working correctly, but after that date there were a weird behaviour. Below are listed all the subsequent runs along with their results:

  1. Succeeded | 18.43 MB | 32,033 emitted records | 32,033 committed records | 8m 18s | Sync 9:00AM 09/18 - Those 32,033 records were never actually loaded into the table since they cannot be tracked from the relevant _scd history tables.
  2. Succeeded | 28.53 GB | 14,144,163 emitted records | 14,144,163 committed records | 1h 28m 21s | Sync 9:00AM 09/19 - An automated full-resync different from Reset Data Process since it does not drop the tables. (Practically we are not sure how Airbyte decides to proceed with a full resync but is seams to be linked with the weakness of finding out the previous offset of replication slot)
  3. Succeeded | 18.43 MB | 32,033 emitted records | 32,033 committed records | 24m 28s | Sync
    9:00AM 09/20
    - Same with The sync at 2022-09-17T21:00:00Z
  4. Succeeded | 18.43 MB | 32,033 emitted records | 32,033 committed records | 24m 28s | Sync
    9:00AM 09/21
    - Same with The sync at 2022-09-17T21:00:00Z
  5. Succeeded | 28.53 GB | 14,144,163 emitted records | 14,144,163 committed records | 1h 28m 21s | Sync 2:45PM 09/21 Same with The sync at 2022-09-18T21:00:00Z (It’s the date were we identified the issue and we manually re-treggired the connection)
  6. Succeeded | 18.43 MB | 32,033 emitted records | 32,033 committed records | 24m 28s | Sync
    4:14PM 09/21
    - After the prior automatic full resync we triggered another sync to check if the connector started to track from the correct state, in which we would expect to have 0 changes, but we got the same result with the run at 2022-09-17T21:00:00Z

I don’t want to deep dive into more details coming from the Airbyte’s log analysis, but I want to leave that part to you. Our main goal from this Topic is to understand the reasons behind these behaviour.

Below you can find the running logs for these dates as also some helpful screenshot from Airbyte Runs and Snowflake’s change aggragations.

Airbyte Run Screenshot

Snowflake Results by grouping _airbyte_emitted_at column


Logs
logs-123–17-09-2022.txt (297.2 KB)
logs-124–18-09-2022.txt (305.8 KB)
logs-125–19-09-2022-full-resync.txt (2.2 MB)
logs-126–20-09-2022.txt (294.1 KB)
logs-127–21-09-2022.txt (296.5 KB)
logs-128-21-09-2022-full-resync.txt (2.2 MB)
logs-129–21-09-2022-last-run.txt (293.9 KB)

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

  • It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
  • Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
  • Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
  • We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

A piece of extra information about the source database is that there is a second existing replication_slot used by Fivetran Connectors which have been created using test-decoding plugin.

This report is awesome, thanks for all the info. It should be OK to have multiple replication slots, as long as the one Airbyte uses is not shared with anything else.

It looks like there have been some performance improvements for Postgres CDC in v1.0.9, could you try the new connector version and see if anything improves? Postgres | Airbyte Documentation

I find it really odd that the sync was working as expected up until 9/18.

Still reviewing the logs you have provided and don’t have a conclusive answer yet. But I did notice a bunch of lines saying this on 9/17:

2022-09-17 06:04:07 e[44msourcee[0m > 2022-09-17 06:04:07 e[33mWARNe[m i.d.c.p.PostgresStreamingChangeEventSource(maybeWarnAboutGrowingWalBacklog):377 - Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.

It looks related to what’s described in this issue: Investigate usage of heartbeat property in debezium for CDC · Issue #15040 · airbytehq/airbyte · GitHub

I think it’s saying that if there is a discrepancy between the publication and Airbyte catalog, like if you have added two tables to the publication but only one is used in the catalog, transaction records from the missing table would not be synced or cleaned up from the Postgres WAL/XLOG. If this isn’t intentional, I would double check that your catalog is set up correctly. Also, I wonder how disk space is doing on the postgres server?

Hello @sh4sh ,

first of all, thank you for your thorough investigation analysis.

The log that you have attached, it’s one that we have faced also in the past in other problematic connections using Postgres Connector.

Furthermore, I understand your assumption, but unfortunately, for all of those cases, our publications are configured only with the tables that we want to sync using Airbyte. So I could say that our Airbyte catalog is in sync with the publication configuration.

The database’s disk cluster is 252 GB, and the usage metrics are really low (not even the 5% at highest picks)

P.S: We created a new replication slot, publication slot, connector, and connection, and we upgraded Postgres Connector to 1.0.10. Until now we haven’t faced any issues.

I see, thanks for the clarification and for trying with the new connector version.

I did find a reported issue that has pretty similar behaviour: Postgres CDC stops syncing new rows · Issue #14932 · airbytehq/airbyte · GitHub
If you have the time to look it over and think it is related, please comment on the issue and add a thumbs up to make sure it’s prioritized properly.

There is a debugging step mentioned in comments about a log message that did appear in the 9/19 full resync logs, this didn’t resolve it for the user who submitted the issue but I wonder if for your case it may be helpful:

**@ameyabhope** this is why it is syncing from scratch:

`Saved offset is before Replication slot's confirmed_flush_lsn, Airbyte will trigger sync from scratch`

I suspect that your Postgres DB is purging WAL in between Airbyte jobs (perhaps to prevent filling up the disk). Could you try running airbyte sync jobs more often and extend the auto-cancellation timeout?

I reviewed the logs but it’s still unclear to me why the data is not being replicated properly. To make sure that this is not an issue with Snowflake, would you be able to test on a different destination connector?

Hello @sh4sh,

I have checked the post and I believe that the case matches ours. So I have replied to this issue ticket and I have also added a :+1: on the post.

Now about the logs purging, I don’t believe that it’s the case since we have not faced any related problems with the other replication slot that we have already configured (Fivetran), plus, as also shared with you there is plenty of space in the local storage of the specific database.

The test that you have suggested is difficult since the configured database is used for the production environment. Of course, if we face the same issue again with the new connection, I will re-post it here with more logs.

From my point of view, since there is already an open issue in GitHub, I am ok with following this for checking the progress.

Of course, if you have any extra insight on the issue, I will be more than happy to share it with me.

Again, many thanks for your support until now. :slight_smile:

1 Like

Hello here,
Don’t know if this would help, but i managed to fix some of this kind of issues (full resyncs), by updating postgres source connector’s code to trick the LSN checks. As I wrote in this issue, there was an issue in debezium code fixed in latest releases but still present on the one used by the connector, that mangled the stored lsn together (lsn_commit and lsn_proc), making the connector thinking some lsn was updated when the lsn_proc is after the lsn_commit (which can happen quite often, when the server still needs to retain logs before the last commited versions). By tricking logs checking to match the lsn_proc and not the lsn_commit fixed this issue for me (still have another issue when state messages are not emitted but i don’t think the cause is the same)

Hello @shrodingers ,

that’s great to hear!

Actually what you say makes a lot of sense! Just to clarify, the Debezium version used from Postgres-Connector on >v1.0.7, contains the fix or we have to wait for a next upgrade?

Well, so happens that i didn’t go to the postgres connector’s doc for the week, and this should be resolved on connector version 1.0.13 and above (issue fixed in debezium release 1.9.4 and this version bumps to 1.9.6). So versions > 1.0.13 should fix this issue @gianniskod

1 Like

Wow this is incredibly helpful thank you! When you are both able to try the newer postgres source connector (currently at v1.0.14) please reply in the Github issue to let everybody know if it helped or if things are still not working

Hello, @shrodingers and @sh4sh ,

just a minor update on our use case. After upgrading the connector to v1.0.14 on the same connection we had before, we are facing a new problem.

Practically the connection seems not to work at all, since it’s unable to read any change, and as a result every sync brings 0 changes. We tried to

  1. Recreate the replication and publication.
  2. Full resync the data in the connection.
  3. Keep the connection running on a daily basis for the last 4 days.

Apparently, we got the same results, which are 0 changes. From the logs, I believe that the most concerning part is

2022-10-16 06:02:49 e[44msourcee[0m > First LSN 'LSN{F9/A1675FD8}' received
2022-10-16 06:02:49 e[44msourcee[0m > WAL resume position 'LSN{F9/A1675FD8}' discovered
2022-10-16 06:02:49 e[44msourcee[0m > Connection gracefully closed
2022-10-16 06:02:49 e[44msourcee[0m > Connection gracefully closed
2022-10-16 06:02:49 e[44msourcee[0m > Initializing PgOutput logical decoder publication
2022-10-16 06:02:49 e[44msourcee[0m > Requested thread factory for connector PostgresConnector, id = cms named = keep-alive
2022-10-16 06:02:49 e[44msourcee[0m > Creating thread debezium-postgresconnector-cms-keep-alive
2022-10-16 06:02:49 e[44msourcee[0m > Processing messages
2022-10-16 06:05:21 e[44msourcee[0m > Message with LSN 'LSN{F9/A1675FD8}' arrived, switching off the filtering
2022-10-16 06:06:01 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:06:26 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:10:52 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:11:21 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:11:47 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:11:51 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:12:01 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:12:05 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:12:08 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:12:13 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.
2022-10-16 06:15:31 e[44msourcee[0m > Received 10001 events which were all filtered out, so no offset could be committed. This prevents the replication slot from acknowledging the processed WAL offsets, causing a growing backlog of non-removeable WAL segments on the database server. Consider to either adjust your filter configuration or enable heartbeat events (via the heartbeat.interval.ms option) to avoid this situation.

These logs indicate that somehow the connector despite that receives change events, it’s unable to process them, and as a result, zero changes are applied.

I am not sure if it’s related to the Source postgres CDC state messages missing · Issue #16991 · airbytehq/airbyte · GitHub, that @shrodingers posted before, but the update to the newer version didn’t fix anything.

I have attached the full log export to better investigate it.
zero_changes_incident_sync_16_10.txt (175.0 KB)

P.S: Our current solution is again to create a new fresh connection, and hope that this would fix the issues.

That doesn’t seem to be related to my issue, since this seem to be directly tied to debezium here… But indeed it is weird, since it behaves like the tables were not in the include list of the debezium connector (WAL is written but the connector can’t seem to acknowledge it for some reason…) Do you have the same behaviour when setting REPLICA IDENTITY to FULL on your tables (it’s merely a guess on the source, not a certainty).

Apparently, we have not tested such a case. Currently, we have set REPLICA IDENTITY to DEFAULT, which is the suggested process to distinguish rows, since it tracks the primary key. I would expect that this is enough.