Data Sync Issue with Airbyte on AWS EC2

Summary

The user is experiencing a data sync issue with Airbyte on AWS EC2 where the amount of records synced and loaded is duplicated each day, and the sync duration is increasing. The user noticed discrepancies between the numbers in Airbyte and the actual data in Snowflake.


Question

Hello everyone, I’m using airbyte on AWS EC2. I’m using it to connect between data from MongoDB to Snowflake.
I’m doing a daily sync at 12AM, with method of incremental - Append + Deduped, When I look at the completed sync, every day the amount of records synced and amount of GBs loaded is duplicated from the previous day. In addition, the sync takes 2 times more than before.
So for example, I have a collection with 600K rows. In the first day, I’ll have 600K records, 700MB and 5 minute sync. The day after, I’ll have 1.2M rows, 1.4GB and 10 minute sync. The curious thing is when I look at my Snowflake table, I still have the 600K rows, just the numbers in Airbyte seems off.
I looked at the logs, and it actually seems like airbyte reads 600K records on the first day, 1.2M on the second day and 2.4M on the third day, however, the collection only contains 600K records.
Does anyone encounter this before?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["data-sync-issue", "airbyte-aws-ec2", "mongodb-snowflake", "incremental-sync", "sync-duration", "data-discrepancy"]

looks like it pulls the entire set of records even though your replication mode is incremental. I would check if the cursor columns used to identify incremental changes are correct.

Thanks for the answer. It doesn’t seem to pull the entire set of records, but a duplication of the records every day