Question about Incremental Append + Dedup

slack-user-airbyte · May 14, 2024, 6:24pm

Summary

Airbyte user is experiencing VM crashes when syncing from MongoDB to Snowflake using V2 connectors with CDC driven sync. They are specifically concerned about how Airbyte handles deduplication for a stream and if it loads all data into memory for duplicate lookup.

Question

Question about Incremental Append + Dedup.

TL;DR : How does Airbyte do Deduping for a stream? Does it load all the data into memory to lookup duplicates?

I have a connection setup to sync from a MongoDB collection to Snowflake. I keep running into a problem where the VM I am running on keeps crashing. This is brought on by this specific connection as all other connections I have work fine and the VM never crashes when I disable this one.

Using V2 connectors (thus CDC driven sync).

I am wondering how the Dedup of this mode works? From the logs I see that many more GBs of data is being “loaded” than what is actually contained in the source. How does the dedup incremental append differ (in the background) to the non-deduping incremental append mode? If it is the case that with deduping more data is loaded into memory (or whatever might be causing the VM to crash), a simple fix for me would simply be to not dedup it, just append it.

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

_{["incremental-append", "dedup", "mongodb-connector", "snowflake-connector", "v2-connectors", "cdc-driven-sync"]}

slack-user-airbyte · May 14, 2024, 6:52pm

Dedup usually happens with the data already in your destination and Airbyte uses dbt (deprecated method) or views/queries to dedup. You can read more here: https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2/#what-is-destinations-v2

slack-user-airbyte · May 14, 2024, 6:52pm

Hi <@U02TQLBLDU4>,
Thanks for the reply. I was half expecting that.
Though you say “usually” does that mean there are exceptions to this?

slack-user-airbyte · May 14, 2024, 6:52pm

Sorry, always happen.

Topic		Replies	Views
Does mongoDB supports incremental deduped sync mode? Q&A connectors	1	452	September 4, 2022
Incremental dedup sync hangs until timeout after sync is complete Connector Questions & Issues source-postgres , connectors	6	436	March 2, 2023
Data Sync Issue with Airbyte on AWS EC2 Connector Questions connector , incremental-sync , question , data-sync-issue , data-discrepancy	2	23	July 3, 2024
Incremental Models Failing Connector Questions & Issues destination-snowflake , data-loading	3	551	September 16, 2022
Incremental append and deduped + history Connector Questions & Issues source-mysql , destination-bigquery , data-loading , connectors	4	1124	August 8, 2022

Question about Incremental Append + Dedup

Summary

Question

Related topics