- Is this your first time deploying Airbyte?: No
- OS Version / Instance: bottlerocket-aws-k8s-1.24-x86_64-v1.12.0-6ef1139f on c5a.xlarge instance (Intel chip)
- Memory / Disk: 8Gb / EBS (elastic storage)
- Deployment: Kubernetes EKS 1.24 via Helm (chart version: 0.43.22)
- Airbyte Version: 0.40.29
- Source name/version: MSSQL 0.4.28 + Postgres 1.0.42
- Destination name/version: Snowflake 0.4.44
- Step: Sync
- Additional config:
- global.jobs.resources.requests.memory: 2Gi
- pod-sweeper.resources.requests.memory: 128Mi
- server.resources.requests.memory: 4Gi
- temporal.resources.requests.memory: 512Mi
- webapp.resources.requests.memory: 512Mi
- worker.resources.requests.memory: 1Gi
- MAX_*_WORKERS: 10
- Description:
We’re seeing a bunch of failures when we attempt to run many jobs concurrently. Running any one of them individually (or just a couple at a time) works fine, but if we queue up say 20 jobs (we’re seeing up to 10 running pods per pod type [read, write, etc]), then a handful will fail, and a handful will succeed. The jobs range in size from ~1GB to 28GB. We’ve combed through the logs (including the airbyte-worker
) extensively and can’t find anything obvious on what is failing.
Here are some of the failures we’re seeing to try and narrow it down:
-
In terms of pod errors, the only thing we’re seeing are some of the
destination-snowflake-write-*
pods going to anInit:Error
status. When this happens, it’s always associated with anio.airbyte.workers.exception.WorkerException: java.io.IOException: kubectl cp failed with exit code 1
(see next bullet point for more info/logs) in the actual sync/airbyte-worker
logs. However, thedestination-snowflake-write-*
logs don’t show anything out of order (these logs include both the successfully running + the init error write logs): snow__20230209.txt (46.6 KB) -
There are two main errors we’re seeing in the sync logs themselves:
-
io.airbyte.workers.exception.WorkerException: java.io.IOException: kubectl cp failed with exit code 1
. This is generally the first failure of the bunch and the error we’re seeing most frequrently. Example log: kube_cp.txt (156.9 KB) -
io.airbyte.workers.exception.WorkerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "io.airbyte.workers.process.KubePortManagerSingleton.take()" is null
. This one is the least common error we’re seeing and generally only occurs once other jobs have started failing/might be a symptom of other issues. Example log: invoke.txt (63.1 KB)
-
Another unintended consequence is that after awhile, everything starts failing. Sometimes new pods won’t even get spun up. Probably a different issue (ie memory leak) but we do what everyone else has been doing in this scenario and restart the airbyte-worker
. We have to run a script to do this daily to keep things from completely going off the rails.
Given that the syncs run fine individually, it seems to be either a scaling/resource issue or something like the temporal worker not handling concurrency well. In terms of scaling/resources, we’ve tried adding memory, dedicating nodes, etc, but we don’t feel like we have a clear window into what’s going on, and the logs aren’t giving us much to go off either. Happy to try additional resource config changes, upping EC2 size, etc, just need some guidance on what is happening and what the warning/error signs are that give us this insight.