Airbyte job failing due to lack of resources on K8s cluster

Summary

Airbyte job fails to extract data from SpaceX APIs on K8s cluster with available memory, despite lack of resource limitations set.


Question

Hi everyone!

I have installed Airbyte on my K8s cluster which has 3 machines with 16GB ram each.

The cluster also runs other services such as Minio, Postgres etc…

However, on Airbyte when I try to run simple job of extracting data from SpaceX APIs it says it fails due to lack of resources…

I observed the nodes and at least one of the nodes has pretty much all memory free… so why is Airbyte failing?
I havent set any limitations of resources.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["airbyte", "k8s-cluster", "resources", "spaceX-APIs", "memory"]

check this thread
https://airbytehq.slack.com/archives/C021JANJ6TY/p1728562294190379?thread_ts=1728561176.853139&cid=C021JANJ6TY

probably you need to adjust jobs limits/requests if you use Airbyte 1.0 or newer

I do have the latest version indeed. However according to what I saw in the documentation if you dont specify resource limits it should use whatever it needs no?

No sure where the issue lies…
Also was reading about some variables that can be set for the workers here:
https://docs.airbyte.com/operator-guides/scaling-airbyte|https://docs.airbyte.com/operator-guides/scaling-airbyte

Kind of lost in different directions…

<@U07LD7NUGG3> did you set these ENV variables? I think Java by default only uses 25% of memory available. I set these on global.env_vars and haven’t had any problems. I’m running on GKE Autopilot though - and am using external database. Airbyte spins up 3 pods for each job so you have to make sure you have enough capacity to accommodate those - that of course depends on how many concurrent jobs you have running.

    JOB_MAIN_CONTAINER_CPU_REQUEST: "500m"
    JOB_MAIN_CONTAINER_MEMORY_REQUEST: "2Gi"```

I think HPA may also struggle with Java as, apparently, it cannot see the actual memory consumption.

So if I need to set those variables then I will adding a limit to the resources it can use instead of leaving it to decide based on available resources? What is the point then? Its weird. Plus all pods seem to be created on one of the nodes when at least one other has plenty of resources available

I don’t know. I have come across in many sources this notion that Kubernetes has problems with Java applications. Apparently something to do with memory being retained and not released. I know I had problems with a normal k8 cluster before and I also didn’t set any resource requests. Airbyte basically detonated the whole cluster when it crashed one of the nodes. Haven’t had any issues since running on GKE Autopilot and setting resource requests.

Also what is the difference between the env variables and setting the resources in the k8s notion? I am using currently k3s.

FAIK. they both do the same thing. I don’t recall if there’s a resource request / limit setting for job containers. The env_vars above apply to the pods Airbyte launches for sync jobs.

In charts code, there is a place where configmap is configured
https://github.com/airbytehq/airbyte-platform/blob/09ed30a11be72ae15f523904f58f5548a58d2b93/charts/airbyte/templates/env-configmap.yaml#L99-L102|env-configmap.yaml#L99-L102
and later those values are used in environment variables in different deployments
https://github.com/airbytehq/airbyte-platform/blob/09ed30a11be72ae15f523904f58f5548a58d2b93/charts/airbyte-workload-launcher/templates/deployment.yaml#L225-L244|deployment.yaml#L225-L244

As long as it works, any method is good

Thank you. This is great. :pray: Will try it out even though I continue to find weird having to set minimum resources since it can lead to wasted resources :frowning:

Keep in mind that not all sync jobs are created equal. I have Salesforce accounts where a full sync is ~1 million total records across all objects, and others where it’s 100s of millions. Airbyte and k8s don’t know this (and I often don’t until the first sync either!). But the pods have to be scheduled before any of that happens.

You can set the requirements lower and not set limits (or set them very high). If there are available resources on the nodes, k8s will increase the requirements up to the limit. But it will also mean that k8s will schedule pods onto a node that may not have enough room to scale (because you told it that it could). In these cases, the workload is likely to be slow (CPU throttling) or fail (out of memory/OOM). This is similar to how the --low-resource-mode flag works in abctl.

It’s also worth noting that things like database connections can vary wildly based on the size of the schema being loaded (e.g. if there are millions of tables). Similarly the size of the actual payloads and how they’re formatted can vary as well. And finally how the connector is built (e.g. CDK vs. low-code/builder). In CDK connectors, there are a lot more ways to shoot your own foot off and bloat memory usage (or inefficiently parse deeply-nested JSON, for example). Most of the normal CDK features are well-optimized; it’s just the reality that you’re adding a layer of custom code on top of those components that can’t always be controlled for.

Some k8s platforms (for example GKE Autopilot) also have requirements on the minimums for resources as well as the ratio of CPU/memory requested. These platforms will often mutate your settings if they’re non-compliant—so they may not be what you think they are if you’re not watching the logs.

Keep in mind that you can also override default resource requirements for specific connections or connector types as well, but you’ll need to see what is best for your workload.

So there are a ton of options, but not a lot that are quite auto-magical . . . there are just a lot of things that the platform (and k8s) don’t know in advance of a job being run. So it can require some tuning for specific use cases.

But if the settings of the workers are set at the whole deployment level that this will lead to waste of resources if those minimums are only for a particular case in mist of many.

That is why I am trying to understand what is best way to configure it. In theory this should even be able to set on a per job basis when you are configuring it.

You can, but since this gets technical to how k8s works under the hood, they don’t surface it in the UI, but rather require you to set it in the Airbyte database:
https://docs.airbyte.com/operator-guides/configuring-connector-resources

Do be aware that some of us have had an issue that these limits aren’t always respected, so there’s an open <https://github.com/airbytehq/airbyte/issues/42921|GitHub issue> related to it. Not sure if it’s present in the latest build.

note that for connector level definitions, you can actually specify it per jobType the connector does (e.g. sync, get_spec, check_connection, discover_schema, etc.)

I have tried to set some configurations on CPU and memory but they do not seem to be passed after I run the helm update.

The process still fails with message:
0/3 nodes are available: 1 Insufficient cpu, 3 Insufficient memory. Preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

All requests seem to be set to CPU 1 as seem in image:

Orchestrator appears to require 1 cpu and 2Gi memory
Source and destination requests are set to 200m cpu and 1Gi memory

On my configuration the workload launcher has 256Mi memory and 1 cpu on requests

My worker has request of 250m cpu and 2Gi memory

Not understanding what is failing here. :confused:

<@U07LD7NUGG3> Are you setting these directly in your values.yaml? Can you provide just those sections of the config?

This is for my worker section on values.yaml