I’m not sure the new config can be easily applied to an existing cluster. What do you think of running uninstall
and then install
?
last week, when we tried to upgrade airbyte, we uninstalled and installed abctl and it filled our server memory and crashed it so haven’t tried that way with this new release yet.
Hm. The install crashed the server? or running syncs after the install? How much memory does the server have?
I currently don’t have access to that server (which crashed a week ago), but it was the Production server and was aptly provisioned.
<@U07FH2Y34A1> , I’m working with Nivi, <@U04GW684E2V> , and we are curious when one syncs a cdc airflow connection, which pods are involved to actually execute the work? Worker +? . We keep getting a java heap space error and have tried several configuration tweaks but can’t seem to unblock the java heap space error despite setting to 2+ gigs. The error we always get is something like this
- [ASYNC WORKER INFO] Pool queue size: 0, Active threads: 0
2024-10-03 21:40:00 source > Terminating due to java.lang.OutOfMemoryError: Java heap space
Things we have tried are:
- adjusting values.yaml (see below)
- updating aibyte postgres tables for actor_definition for the mssql server source and general connection.
– sql server source definition
psql airbyte -d db-airbyte -t -A -c “UPDATE actor_definition SET resource_requirements=‘{"jobSpecific": [{"jobType": "sync", "resourceRequirements": {"cpu_limit": "2", "cpu_request": "2", "memory_limit": "10Gi", "memory_request": "2Gi"}}]}’ WHERE id = ‘b5ea17b1-f170-46dc-bc31-cc744ca984c1’;”
–general connection information for the failing cdc connection
psql -U airbyte -d db-airbyte -t -A -c “UPDATE connection SET resource_requirements = ‘{"cpu_limit": "0.5", "cpu_request": "0.5", "memory_limit": "10Gi", "memory_request": "1Gi"}’ WHERE id = ‘7a826998-1439-4a42-81ed-5a929a0774d6’;”
How do we determine which config point/setting is not sufficient to overcome the cdc sync java heap space error?
do connection level settings take precedence over source specific actor definitions? We are blocked for using airbyte with cdc for this particular use case and hoping it is a config point that we can adjust to get to the finish line.
btw- We are on the latest abctl local install version v0.18.0
abctl status
INFO Using Kubernetes provider:
Provider: kind
Kubeconfig: /root/.airbyte/abctl/abctl.kubeconfig
Context: kind-airbyte-abctl
SUCCESS Found Docker installation: version 24.0.5
SUCCESS Existing cluster ‘airbyte-abctl’ found
INFO Found helm chart ‘airbyte-abctl’
Status: deployed
Chart Version: 1.1.0
App Version: 1.1.0
INFO Found helm chart ‘ingress-nginx’
Status: deployed
Chart Version: 4.11.2
App Version: 1.11.2
INFO Airbyte should be accessible via http://localhost:8000
values.yaml file
global:
edition: “community”
jobs:
resources:
limits:
cpu: 1000m
memory: 12Gi ## e.g. 500m
requests:
cpu: 500m
memory: 2Gi
env_vars:
HTTP_IDLE_TIMEOUT: 1800s
webapp:
ingress:
annotations:
http://kubernetes.io/ingress.class|kubernetes.io/ingress.class: internal
http://nginx.ingress.kubernetes.io/proxy-body-size|nginx.ingress.kubernetes.io/proxy-body-size: 16m
http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout: 1800
http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout: 1800
airbyte-bootloader:
resources:
limits:
cpu: 1000m
memory: 12Gi ## e.g. 500m
requests:
cpu: 500m
memory: 2Gi
worker:
enabled: true
– Number of worker replicas
replicaCount: 1
image:
# – The repository to use for the airbyte worker image.
repository: airbyte/worker
# – the pull policy to use for the airbyte worker image
pullPolicy: IfNotPresent
worker resource requests and limits
ref: http://kubernetes.io/docs/user-guide/compute-resources/
We usually recommend not to specify default resources and to leave this as a conscious
choice for the user. This also increases chances charts run on environments with little
resources, such as Minikube. If you do want to specify resources, uncomment the following
lines, adjust them as necessary, and remove the curly braces after ‘resources:’.
resources:
#! – The resources limits for the worker container
limits:
memory: 10Gi
cpu: 500m
# – The requested resources for the worker container
requests:
memory: 2Gi
cpu: 250m
There can be a few pods involved in the lifecycle of a connector, but the main one doing the sync work starts with replication-job-
Which pod is giving you out of memory errors?
The error we receive doesnt indicate what pod as far as I can tell
<@U07FH2Y34A1> - Would adjusting the values.yaml for the section “workload-launcher” affect the replication-job- pod java heap memory? Or is there a databe update needed to tweak the workload-launcher memory settings
What does docker exec -it airbyte-abctl-control-plane kubectl -n airbyte-abctl get pods
show?
The global.jobs.resources
section affects the job pod resources. I’m less familiar with the database updates approach.
But if you update the global.jobs.resources
values, you might need to restart the server and workload-launcher pods to pick up the new settings. We should try to track down one of the failing pods though, and see what settings it has.
NAME READY STATUS RESTARTS AGE
airbyte-abctl-airbyte-bootloader 0/1 Completed 0 3d19h
airbyte-abctl-connector-builder-server-5c6d48b574-tpm7h 1/1 Running 4 (3d19h ago) 3d22h
airbyte-abctl-cron-5ddb45bc4d-swvcz 1/1 Running 4 (3d19h ago) 3d22h
airbyte-abctl-pod-sweeper-pod-sweeper-7cbbf9cf6d-gfczc 1/1 Running 8 (3d19h ago) 4d21h
airbyte-abctl-server-579f67894b-9m9ht 1/1 Running 5 (2d ago) 3d22h
airbyte-abctl-temporal-d858d6866-zt6s4 1/1 Running 5 (3d19h ago) 3d22h
airbyte-abctl-webapp-7f5b9b7654-8bmdg 1/1 Running 12 (3d19h ago) 3d22h
airbyte-abctl-worker-5989b87ccf-2q59w 1/1 Running 1 (3d19h ago) 3d19h
airbyte-abctl-workload-api-server-d64449cb8-czc2z 1/1 Running 4 (3d19h ago) 3d22h
airbyte-abctl-workload-launcher-65854658f9-d57k9 1/1 Running 4 (3d19h ago) 3d22h
airbyte-db-0 1/1 Running 8 (3d19h ago) 4d21h
airbyte-minio-0 1/1 Running 8 (3d19h ago) 4d21h
replication-job-13291-attempt-0 0/3 Completed 0 7m39s
replication-job-13292-attempt-0 0/3 Completed 0 7m39s
Whenever I’ve made changes to configs I have restarted the docker container
I don’t see any failed pods in there. Looks like replication-job-13292-attempt-0
is the most recent job. Did that succeed?
You could look at the details of that job with
docker exec -it airbyte-abctl-control-plane kubectl -n airbyte-abctl describe pod replication-job-13292-attempt-0
It will be a few days before the data will be in the large delta cdc scenario
<@U07FH2Y34A1> following is the only mention to the replication-pod I see in the logs of the failing job
2024-10-04 23:08:22 INFO i.a.w.l.p.KubePodClient(launchReplication):84 - Launching replication pod: replication-job-13157-attempt-0 with containers:
144
2024-10-04 23:08:22 INFO i.a.w.l.p.KubePodClient(launchReplication):85 - [source] image: airbyte/source-mssql:4.1.14 resources: ResourceRequirements(claims=[], limits={memory=12Gi, cpu=1000m}, requests={memory=2Gi, cpu=500m}, additionalProperties={})
145
2024-10-04 23:08:22 INFO i.a.w.l.p.KubePodClient(launchReplication):86 - [destination] image: airbyte/destination-snowflake:3.5.0 resources: ResourceRequirements(claims=[], limits={memory=12Gi, cpu=1000m}, requests={memory=2Gi, cpu=500m}, additionalProperties={})
146
2024-10-04 23:08:22 INFO i.a.w.l.p.KubePodClient(launchReplication):87 - [orchestrator] image: airbyte/container-orchestrator:1.1.0 resources: ResourceRequirements(claims=[], limits={memory=12Gi, cpu=1000m}, requests={memory=2Gi, cpu=500m}, additionalProperties={})
147
Hm, ok, thanks. So it seems like the workload-launcher is starting the job containers with lots of memory (12Gi per container). Sounds like it’s the source (mssql) container that is running out of memory.
Debugging the memory usage of the mssql connector is outside my area of knowledge. Maybe <@U02TQLBLDU4> can point you in the right direction?