Question about Incremental Append + Dedup


Airbyte user is experiencing VM crashes when syncing from MongoDB to Snowflake using V2 connectors with CDC driven sync. They are specifically concerned about how Airbyte handles deduplication for a stream and if it loads all data into memory for duplicate lookup.


Question about Incremental Append + Dedup.

TL;DR : How does Airbyte do Deduping for a stream? Does it load all the data into memory to lookup duplicates?

I have a connection setup to sync from a MongoDB collection to Snowflake. I keep running into a problem where the VM I am running on keeps crashing. This is brought on by this specific connection as all other connections I have work fine and the VM never crashes when I disable this one.

Using V2 connectors (thus CDC driven sync).

I am wondering how the Dedup of this mode works? From the logs I see that many more GBs of data is being “loaded” than what is actually contained in the source. How does the dedup incremental append differ (in the background) to the non-deduping incremental append mode? If it is the case that with deduping more data is loaded into memory (or whatever might be causing the VM to crash), a simple fix for me would simply be to not dedup it, just append it.

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["incremental-append", "dedup", "mongodb-connector", "snowflake-connector", "v2-connectors", "cdc-driven-sync"]

Dedup usually happens with the data already in your destination and Airbyte uses dbt (deprecated method) or views/queries to dedup. You can read more here:

Hi <@U02TQLBLDU4>,
Thanks for the reply. I was half expecting that.
Though you say “usually” does that mean there are exceptions to this?

Sorry, always happen.