Duplicate records in incremental sync parquet files after full refresh

dean · August 25, 2022, 6:50pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: Ubuntu VM
Deployment: Kubernetes
Airbyte Version: 0.39.3-alpha
Source name/version: Salesforce 1.0.2
Destination name/version: S3 0.2.13
Step: The issue is happening during sync before parquet files are created on S3
Description: Hi team, we have a situation where we reset Airbyte connection programatically using argo workflows template and the result is cleared state in Airbyte db for that connection and removal of all corresponding files on S3 which is effectively a full refresh.

So far so good, now after those files are deleted and the external table built on top of those files on S3 is dropped, we sync this connection, but when you have a clear state, first sync acts as a full refresh so we obtain all records currently present in our Salesforce source, that is also working fine because we can compare the result set in parquet with what we see in Salesforce and it all checks out.

Now this is where problems start for us, we have an argo workflow scheduled to run on an hourly basis and it will always sync this connection in incremental mode because that’s what is set in the config file for this connection, this being incremental sync means we are expecting only updated records to sync. However what we’re noticing is for a very low number of records a parquet file will be created and it will contain a record already synced in the full refresh after connection was reset and it will be identical to the one already synced, so much so that even SystemModstamp will be the same which should never be the case with any record. So for example for 10 subsequent incremental syncs, you will have some records that constantly reoccur in every sync but they’re all identical to every previous run all the way back to the original full refresh result set. Please advise!

natalyjazzviolin · August 29, 2022, 1:57pm

Hi @dean, would you be able to update Airbyte to the latest version and double-check that the connectors are up to date as a first step? I’m looking into this for you!

dean · August 30, 2022, 12:45pm

Hi @natalyjazzviolin,

I just tried to upgrade to 0.40.3 and I’m getting this error in Argo CD which we use to sync deployments from github to kubernetes

ComparisonError

rpc error: code = Unknown desc = kustomize build /tmp/git@github.com_firebolt-analytics_de-gitops/airbyte/dev failed exit status 1: Error: no matches for Id apps_v1_Deployment|~X|airbyte-scheduler; failed to find unique target for patch apps_v1_Deployment|airbyte-scheduler

Does this mean you guys changed something about the airbyte-scheduler in the new version?

I’m attaching our .env and kustomization.yaml files which worked for us for every upgrade until now, in the meantime I’ll keep investigating.

kustomization.txt (1.4 KB)
env.txt (2.3 KB)

Thanks for looking into this Natalie! Also your jazz sounds great

dean · August 30, 2022, 2:32pm

UPDATE: removed airbyte-scheduler from kustomization.yaml as well as set-resource-limits.yaml but still getting errors with bootloader saying:

Error from server (BadRequest): container “airbyte-bootloader-container” in pod “airbyte-bootloader-7lf5r” is waiting to start: image can’t be pulled

I’m assuming you introduced some changes in bootloader too since 0.39.3?

dean · August 30, 2022, 7:58pm

UPDATE: added this to kustomization.yaml

  - name: airbyte/cron
    newTag: 0.40.3
    newName: 231290928314.dkr.ecr.us-east-1.amazonaws.com/airbyte/cron

and this to set-resource-limits:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: airbyte-cron
spec:
  template:
    spec:
      containers:
        - name: airbyte-cron-container
          resources:
            limits:
              cpu: 0.5
              memory: 128Mi

Hopefully once I push these images I’ll be able to sync and everything will work…

dean · September 1, 2022, 8:44am

UPDATE: after a lot of wrestling with k8s deployment I was finally able to deploy the latest version

Salesforce source shows version 1.0.13

Cockroach DB source is set to 0.1.16

S3 Destination shows 0.3.14

However we’re still unable to achieve pod stability, bootloader pod seems to be recreated every few seconds for some reason so I’ll get back to you once we manage to handle that one

natalyjazzviolin · September 1, 2022, 6:15pm

Thank you for all the follow ups - it’s great that you were finally able to update! We are working on making deployment to Kubernetes easier and more streamlined, but it definitely still needs improvement.

I see we’ve got an open issue regarding upgrading/bootloader pod changes:
https://github.com/airbytehq/airbyte/issues/9133

Could you try deleting or renaming the bootloader pod?

P.S. thank you for listening to the jazz, I appreciate it!

dean · September 2, 2022, 9:07pm

Hi @natalyjazzviolin,

I finally arrived at the issue you referred to last and it’s the first bullet point there regarding multi-attach error for server pod

  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Normal   Scheduled           8m1s                 default-scheduler        Successfully assigned de/airbyte-server-64cd88ff8c-97xx4 to ip-10-10-112-5.ec2.internal
  Warning  FailedAttachVolume  7m59s                attachdetach-controller  Multi-Attach error for volume "pvc-e405467c-aaf3-4068-a759-61ee9100bc4a" Volume is already used by pod(s) airbyte-cron-c68b7946b-dj4qx
  Warning  FailedMount         3m43s                kubelet                  Unable to attach or mount volumes: unmounted volumes=[airbyte-volume-configs], unattached volumes=[gcs-log-creds-volume kube-api-access-fpp22 airbyte-volume-configs]: timed out waiting for the condition
  Warning  FailedMount         85s (x2 over 5m58s)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[airbyte-volume-configs], unattached volumes=[airbyte-volume-configs gcs-log-creds-volume kube-api-access-fpp22]: timed out waiting for the condition

natalyjazzviolin · September 6, 2022, 5:20pm

I am looking into this with another team member who is very experienced with Kubernetes and hope to have some insight for you soon!

natalyjazzviolin · September 6, 2022, 7:27pm

Got some ideas:

There seem to be some long standing issues with slow volume attach/detach. Could you try these two things:

Try resetting the master node
Try this guide:
https://blog.mayadata.io/recover-from-volume-multi-attach-error-in-on-prem-kubernetes-clusters

Topic		Replies	Views
Salesforce connections stuck and subsequent workflow runs failing Connector Questions & Issues source-salesforce , data-loading , connectors , kubernetes	5	489	July 14, 2022
Every connection sync is failing on Airbyte latest version [0.39.15 alpha] though same connection works fine in 0.35.31 Platform, Deploy & Infra Issues connectors , deploy	4	218	July 14, 2022
Problem with S3 incremental sync Connector Questions & Issues data-loading , source-s3	1	425	March 22, 2023
I have an error sync from postgrest to s3 on parquet Connector Questions & Issues source-postgres , data-loading	1	337	August 24, 2022
Source S3 - One row of mostly nulled data Connector Questions & Issues destination-redshift , data-loading , source-s3	10	737	September 19, 2022

Duplicate records in incremental sync parquet files after full refresh

Related topics