[source-amazon-ads] Incremental Deduped + History sync duplication

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: Debian 10 Buster
  • Memory / Disk: 7.5GB / 30GB
  • Deployment: Docker
  • Airbyte Version: 0.40.9
  • Source name/version: Amazon Ads 0.1.22
  • Destination name/version: BigQuery 1.2.4
  • Step: During/After Sync.
  • Description:

For the Steam named “sponsored_products_report_stream”, I have set the sync mode to “Incremental Deduped + history” with a daily sync schedule, however, taking a look at the output shows duplication occurring in the _airbyte_raw destination table. The _airbyte_data field within each record of the _airbyte_raw table has the following data structure:

{
    "reportDate": STRING,
    "profileId": NUMBER,
    "recordType": STRING,
    "updatedAt": STRING,
    "metric": OBJECT
}

I have set up a materialized view within BigQuery to normalize this object and the metric property into a single table with individual columns for each field. The query used for this normalization has been attached:
normalization.txt (6.4 KB)

Querying this materialized view for a specific record type on a specific date (e.g. report_date = “2022-10-09” AND record_type = “campaigns”) provides two rows for each “campaign”, both being synced on a different date.

By my understanding, the “Incremental Deduped + history” sync mode should update the original records and then update the “updatedAt” field within _airbyte_data, however, it just seems to add duplicated records on the next sync without touching the old records. This duplication occurs repeatedly, i.e. for today’s date, i see 1 record (correct, since only one sync has occurred for today’s date), for yesterday there are 2 records (2 syncs), for 2 days ago there are 3 records (3 syncs), etc, etc…

I have only tested this on “sponsored_products_report_stream”, however, I imagine the same is occurring across all report streams for this source since they all follow the same data structure.

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

  • It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
  • Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
  • Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
  • We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

Hi @ale, thank you for your patience, I was out of the office yesterday. Let me look into this and I hope to have some ideas for you soon!

1 Like

Hi @natalyjazzviolin, did you manage to get anywhere with this?

Yes - I have a few questions for you!

  1. What are you using as the cursor field?

  2. Have you made changes to the schema? If so, you’ll need to run a full refresh-overwrite sync as the incremental sync mode is not able to handle source schema changes yet.

Here’s the doc on the known limitations of this sync mode:
https://docs.airbyte.com/understanding-airbyte/connections/incremental-deduped-history/#known-limitations

Thanks!

  1. We haven’t changed this, so whatever the default cursor is likely be the one in use. When I select the drop-down to the left of the stream, the “cursor field” column is blank and it doesn’t seem like there is any way for me to adjust it:

  1. No schema changes have been made. There also doesn’t seem to be any way for me to adjust this.

OK, thanks for that info! Could you please give me the complete logs, if we don’t find anything wrong in them I’ll escalate this to GitHub!

Hi @natalyjazzviolin, apologies for the delay, I was OOO.

Please find the most recent sync log attached. I have double checked and this specific sync also led to duplication. Note that this sync also includes the replication of two other streams too (sponsored_brands_report_stream and sponsored_brands_video_report_stream).

fecfc023_e2b2_4c0f_806a_3adf7767b53e_logs_26_txt.txt (764.3 KB)

Hi @natalyjazzviolin, hope you’re well! I was wondering if you’ve had a chance to look into the above? Thanks!!

Hi @natalyjazzviolin – Will this issue be addressed any further?

Hi! I apologize for the delay - we had a lot of inquiries for Hacktoberfest, but now should be getting back to the normal rhythm of things! I’ve made a GitHub issue for your ticket here:
https://github.com/airbytehq/airbyte/issues/18905

It’s been triaged to the correct team and I’ll inquire to see who can follow up on it!