Sync from Salesforce got duplicate records

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: COS on GCP (container optimized os)
  • Memory / Disk: you can use something like 4Gb / 1 Tb
  • Deployment: Docker
  • Airbyte Version: 0.36.3-alpha
  • Source name/version: Salesforce@1.0.10
  • Destination name/version: GCS@0.2.9
  • Step: The issue is happening during sync (Full Refresh|Append)
  • Description:

I recently synced data from a Salesforce instance to my GCS. The sync was successful and there were 3 shards created (2022_08_18_1660843157436_0.csv, 2022_08_18_1660843157436_1.csv and 2022_08_18_1660843157436_2.csv)

Later on I found two identical records (except for the “_airbyte_ab_id” and “_airbyte_emitted_at” columns) in the shard 2022_08_18_1660843157436_0.csv.

I queried the Salesforce instance with the Id of that record and it returned only one row.

I was reading about this post but it’s different from my case:

  • The OP was using incremental sync while I was using Full refresh append.
  • The OP found error in the log but I read through the log and found no error.

Can you help me understand what could cause this issue?

Airbyte guarantees at-least-one record, so duplicates can happen in raw tables. Unlikely to happen in a full refresh overwrite but in a full refresh append. Can you confirm both are completely identical? If you hash the json-data from the airbyte_data column both have the same hash?

Thanks for the reply @marcosmarxm Yes, they are completely identical and they return the same hash. Just want to confirm: duplicates can happen in the same sync session?

Because Salesforce is a complex connector I must investigate further, but in theory it shouldn’t happens only maybe if the source provider (Salesforce) send it to Airbyte. I’m saying this because Airbyte receives the payload, extract the records (in summary does a for-loop from the payload response) and send the record to the destination.

For what stream/endpoint this is happening?
(for a test) can you create another connection and run it again to another destinatino table and see if the duplicate record repeat or not?

As you said, you didn’t found anything in logs so we must try some alternatives to isolate the problem.

Same thing is happening to me.

In my case:
The source is Salesforce 1.0.27
The destination is BigQuery 1.1.14
And the sync is “full-refresh/overwrite”

I’ve tried to change the destination to another BigQuery dataset, but I’m getting the same problem.
Also, I’ve hashed _airbyte_data on the raw table, and they are identical.