Summary
The user is inquiring about the behavior of the CDC mechanism in the MongoDB connector, specifically regarding the handling of deleted records from the source.
Question
Hello all,
We recently switched to the latest MongoDB connector and because of our volume of data, CDC is the only ingestion mode suitable.
However, I’m surprised to read that the CDC mechanism doesn’t yet flag deleted records from source and just deletes them in the destination. I thought that was one of the main benefit of CDC.
Am I missing anything (set up, config, etc.) or is this correct ?
Based on the information provided in the knowledge sources, Airbyte currently removes deleted rows from the final table in the destination as part of the deduplication process for CDC database sources. This means that if a row is deleted in your source (like MongoDB), it will not be present in the final table in your destination (like Redshift).
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
["mongodb-connector", "cdc-mechanism", "deleted-records", "ingestion-mode", "deduplication"]
is the quoted answer from http://kapa.ai|kapa.ai?
We use CDC incremental append to sync from MySQL/Postgres, and with that we keep track of all changes, including deletions.
Thanks for your answer. Yes correct, this is the answer I received from http://kapa.ai|kapa.ai.
We currently use the incremental sync mode “append + deduped”, and I noticed in the log that Airbyte CDC deletes deleted record from source :
46004
2024-07-11 14:09:01 destination > INFO type-and-dedupe i.a.c.d.j.JdbcDatabase(executeWithinTransaction$lambda$1):48 done executing query within transaction: delete from "raw_xxxx"."xxxx_xxxx" where "_ab_cdc_deleted_at" is not null;```
Maybe if we switch to sync mode "append" (without deduped), we would have the deleted records with "_ab_cdc_deleted_at" not null ? But without deduped, I'm worried that we'll end up with a lot of dupplicated records.
Maybe if we switch to sync mode “append” (without deduped), we would have the deleted records with “_ab_cdc_deleted_at” not null ?
exactly, this is what we get
But without deduped, I’m worried that we’ll end up with a lot of dupplicated records.
there are some duplicated and historical records, but in our case, we can deduplicate them according to our needs with a transformation tool (eg. dbt). For us, it’s important to keep track of the changes vs. getting snapshots.
So when we want the current state of the table, we deduplicate and only keep the latest state of each row with something like partition by primary_key(s) order by _ab_cdc_log_file desc, _ab_cdc_log_pos desc, _ab_cdc_updated_at desc… might this also work for you…?
Thank you very much !
> might this also work for you…?
I’m not sure that would work for us, given that the scope of data is large with very frequent changes and at the same time hard deleted records in source should be quite rare. We’ll investigate, thank you very much for your feedbacks !