[source-amazon-ads] Incremental Deduped + History sync duplication

ale · October 10, 2022, 10:37am

Is this your first time deploying Airbyte?: Yes
OS Version / Instance: Debian 10 Buster
Memory / Disk: 7.5GB / 30GB
Deployment: Docker
Airbyte Version: 0.40.9
Source name/version: Amazon Ads 0.1.22
Destination name/version: BigQuery 1.2.4
Step: During/After Sync.
Description:

For the Steam named “sponsored_products_report_stream”, I have set the sync mode to “Incremental Deduped + history” with a daily sync schedule, however, taking a look at the output shows duplication occurring in the _airbyte_raw destination table. The _airbyte_data field within each record of the _airbyte_raw table has the following data structure:

{
    "reportDate": STRING,
    "profileId": NUMBER,
    "recordType": STRING,
    "updatedAt": STRING,
    "metric": OBJECT
}

I have set up a materialized view within BigQuery to normalize this object and the metric property into a single table with individual columns for each field. The query used for this normalization has been attached:
normalization.txt (6.4 KB)

Querying this materialized view for a specific record type on a specific date (e.g. report_date = “2022-10-09” AND record_type = “campaigns”) provides two rows for each “campaign”, both being synced on a different date.

By my understanding, the “Incremental Deduped + history” sync mode should update the original records and then update the “updatedAt” field within _airbyte_data, however, it just seems to add duplicated records on the next sync without touching the old records. This duplication occurs repeatedly, i.e. for today’s date, i see 1 record (correct, since only one sync has occurred for today’s date), for yesterday there are 2 records (2 syncs), for 2 days ago there are 3 records (3 syncs), etc, etc…

I have only tested this on “sponsored_products_report_stream”, however, I imagine the same is occurring across all report streams for this source since they all follow the same data structure.

marcosmarxm · October 10, 2022, 3:02pm

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

natalyjazzviolin · October 11, 2022, 1:43pm

Hi @ale, thank you for your patience, I was out of the office yesterday. Let me look into this and I hope to have some ideas for you soon!

ale · October 13, 2022, 1:30pm

Hi @natalyjazzviolin, did you manage to get anywhere with this?

natalyjazzviolin · October 13, 2022, 2:43pm

Yes - I have a few questions for you!

What are you using as the cursor field?
Have you made changes to the schema? If so, you’ll need to run a full refresh-overwrite sync as the incremental sync mode is not able to handle source schema changes yet.

Here’s the doc on the known limitations of this sync mode:
https://docs.airbyte.com/understanding-airbyte/connections/incremental-deduped-history/#known-limitations

ale · October 13, 2022, 3:09pm

Thanks!

We haven’t changed this, so whatever the default cursor is likely be the one in use. When I select the drop-down to the left of the stream, the “cursor field” column is blank and it doesn’t seem like there is any way for me to adjust it:

No schema changes have been made. There also doesn’t seem to be any way for me to adjust this.

natalyjazzviolin · October 14, 2022, 7:59pm

OK, thanks for that info! Could you please give me the complete logs, if we don’t find anything wrong in them I’ll escalate this to GitHub!

ale · October 18, 2022, 1:51pm

Hi @natalyjazzviolin, apologies for the delay, I was OOO.

Please find the most recent sync log attached. I have double checked and this specific sync also led to duplication. Note that this sync also includes the replication of two other streams too (sponsored_brands_report_stream and sponsored_brands_video_report_stream).

fecfc023_e2b2_4c0f_806a_3adf7767b53e_logs_26_txt.txt (764.3 KB)

ale · October 21, 2022, 2:37pm

Hi @natalyjazzviolin, hope you’re well! I was wondering if you’ve had a chance to look into the above? Thanks!!

ale · October 31, 2022, 10:53am

Hi @natalyjazzviolin – Will this issue be addressed any further?

natalyjazzviolin · November 3, 2022, 12:29pm

Hi! I apologize for the delay - we had a lot of inquiries for Hacktoberfest, but now should be getting back to the normal rhythm of things! I’ve made a GitHub issue for your ticket here:
https://github.com/airbytehq/airbyte/issues/18905

It’s been triaged to the correct team and I’ll inquire to see who can follow up on it!

Topic		Replies	Views
Incremental Sync Successful but Rows in Table Duped Connector Questions & Issues source-postgres	4	293	July 14, 2022
Unexpected Record Updates and Data Integrity Issues in Airbyte Incremental Sync to Snowflake Connector Questions connector , airbyte-oss , append-deduped , google-ads-api , aws-ec2	0	19	August 2, 2024
CDC Postgres (Deduped+history) to Bigquery: deleted_at field is not getting updated for deleted records Connector Questions & Issues normalization , data-loading	11	684	July 14, 2022
HubSpot companies data becomes double in array list Connector Questions & Issues source-hubspot	1	419	July 21, 2022
Source MySQL: Incremental Deduped Sync Failure Connector Questions & Issues source-mysql , destination-bigquery	6	868	September 1, 2022

[source-amazon-ads] Incremental Deduped + History sync duplication

Related topics