Running too many syncs concurrently causing non-explicit failures

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: bottlerocket-aws-k8s-1.24-x86_64-v1.12.0-6ef1139f on c5a.xlarge instance (Intel chip)
  • Memory / Disk: 8Gb / EBS (elastic storage)
  • Deployment: Kubernetes EKS 1.24 via Helm (chart version: 0.43.22)
  • Airbyte Version: 0.40.29
  • Source name/version: MSSQL 0.4.28 + Postgres 1.0.42
  • Destination name/version: Snowflake 0.4.44
  • Step: Sync
  • Additional config:
    • 2Gi
    • pod-sweeper.resources.requests.memory: 128Mi
    • server.resources.requests.memory: 4Gi
    • temporal.resources.requests.memory: 512Mi
    • webapp.resources.requests.memory: 512Mi
    • worker.resources.requests.memory: 1Gi
    • MAX_*_WORKERS: 10
  • Description:

We’re seeing a bunch of failures when we attempt to run many jobs concurrently. Running any one of them individually (or just a couple at a time) works fine, but if we queue up say 20 jobs (we’re seeing up to 10 running pods per pod type [read, write, etc]), then a handful will fail, and a handful will succeed. The jobs range in size from ~1GB to 28GB. We’ve combed through the logs (including the airbyte-worker) extensively and can’t find anything obvious on what is failing.

Here are some of the failures we’re seeing to try and narrow it down:

  • In terms of pod errors, the only thing we’re seeing are some of the destination-snowflake-write-* pods going to an Init:Error status. When this happens, it’s always associated with an io.airbyte.workers.exception.WorkerException: kubectl cp failed with exit code 1 (see next bullet point for more info/logs) in the actual sync/airbyte-worker logs. However, the destination-snowflake-write-* logs don’t show anything out of order (these logs include both the successfully running + the init error write logs): snow__20230209.txt (46.6 KB)

  • There are two main errors we’re seeing in the sync logs themselves:

    • io.airbyte.workers.exception.WorkerException: kubectl cp failed with exit code 1. This is generally the first failure of the bunch and the error we’re seeing most frequrently. Example log: kube_cp.txt (156.9 KB)
    • io.airbyte.workers.exception.WorkerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "io.airbyte.workers.process.KubePortManagerSingleton.take()" is null. This one is the least common error we’re seeing and generally only occurs once other jobs have started failing/might be a symptom of other issues. Example log: invoke.txt (63.1 KB)

Another unintended consequence is that after awhile, everything starts failing. Sometimes new pods won’t even get spun up. Probably a different issue (ie memory leak) but we do what everyone else has been doing in this scenario and restart the airbyte-worker. We have to run a script to do this daily to keep things from completely going off the rails.

Given that the syncs run fine individually, it seems to be either a scaling/resource issue or something like the temporal worker not handling concurrency well. In terms of scaling/resources, we’ve tried adding memory, dedicating nodes, etc, but we don’t feel like we have a clear window into what’s going on, and the logs aren’t giving us much to go off either. Happy to try additional resource config changes, upping EC2 size, etc, just need some guidance on what is happening and what the warning/error signs are that give us this insight.

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

  • It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
  • Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
  • Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
  • We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
The Community Assistance Team

It seems like this reported github issue could explain the kubectl cp failed errors and potentially the Init:Error write pods status, too

Hey @luke @marcosmarxm
We’re experiencing pretty much the same issue here.
We pushed the latest Airbyte version which includes the fix applied mentioned in this reported github issue, which seemed to resolve a lot of kubectl cp failures.

However, we’re still experiencing an intermittent airbyte platform outage where the worker loses the ability to schedule job pods. Once we restart the worker deployment, things return to normal.

io.airbyte.workers.exception.WorkerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "io.airbyte.workers.process.KubePortManagerSingleton.take()"

I believe this is something to do with K8s not being able to allocate ports to the worker (More in this slack thread).
We’re going to try and increase the number of replicas in the Worker deployment from 2->5 which should allow for more concurrent sync jobs :crossed_fingers:

Would be interested to hear your opinions :smiley:

1 Like