Resource leak in airbyte-worker

adam · August 18, 2022, 11:35pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: Amazon Linux 2
Memory / Disk: 32 GB RAM
Deployment: k8s
Airbyte Version: 0.39.35
Source name/version: many
Destination name/version: redshift
Step: Sync

About once a month, our Airbyte deployment gets completely hosed - jobs start experiencing errors launching new pods and syncs can’t run anymore…until airbyte-worker is restarted. Looking at airbyte-worker while it was in this state today, we noticed two interesting things:

It was using a large amount of RAM - upwards of 8 GB
The main java process for airbyte-worker had 29821 running threads!

When this occurs, new jobs are not able to spawn the threads they need in airbyte-worker:

022-08-18 06:00:56 source > [0.017s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2022-08-18 06:00:56 source > Error occurred during initialization of VM
2022-08-18 06:00:56 source > java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

The overall VM has plenty of resources available, and k8s is not killing the container due to reaching limits.

After restarting airbyte-worker, it’s happily running using approximately 40 threads.

I haven’t noticed any similar issue reports (might have missed them though!) or recent changes that may address this. Has anyone else experienced something similar? Any suggestions for things to try or gather next time this occurs (or changes to make before then!)? We’ve seen this happen 3 times now in our production cluster, across multiple airbyte versions.

adam · August 22, 2022, 7:08pm

My summary has an error: our VM has 64 GB of RAM total, not 32.

marcosmarxm · August 25, 2022, 2:58am

Thanks for the investigation @adam I asked the team some investigation about what is the resource expected for worker and server.

adam · August 25, 2022, 4:36pm

Hi @marcosmarxm,

Just to be clear, resource utilization is only part of the problem. The other problem is that airbyte-worker stops running after a number of weeks, until it is restarted. (The resource issue of course appears to be related to that.) Is that something you’re seeing in cloud, or hearing about from other users? We’ve thought about adding actual k8s limits to airbyte-worker, but are concerned that hitting those limits would result in data loss of any in progress syncs during the worker restart.

adam · August 25, 2022, 7:07pm

I’ve tracked this down to a problem in the S3 file writing for logs - the threads are blocking on a lock that isn’t being released, and very few logs are actually making it to S3. Almost all of our threads are from the log4j2 appender for S3. Looks like we have verbose mode enabled for the appender in log4j2.xml, so I’ll see if I can find anything interesting in the local container logs.

davinchia · August 25, 2022, 9:37pm

Hi @adam, early Airbyte engineer here. We deploy Cloud multiple times a day in CD fashion so this is a blind spot for us. Looking at our dev envs that have Airbyte deployments running for ~1 -2 weeks, I don’t really see a memory leak pattern. Of course, these deployments have much less load.

Would love to work with you to dig into your findings. I worked on the logging layer so I’m familiar with what you are saying. It is possible all of this is related to the logger layer. Can you share more about what you’ve found about the logging layer?

adam · August 25, 2022, 9:59pm

Happy to! I reached out to you on slack in case you want the actual thread dump file (its too big to upload here). It shows almost 3k pairs (so 6k total) of “LoggingEventCache-publish-thread” and “TimePeriodBasedBufferMonitor-publish-trigger”. Those come from the log4j2-appender library: GitHub - bluedenim/log4j-s3-search: Log4j appender with S3, Azure, Google Cloud, and search publishing. I can’t find any meaningful log output to figure out what those threads are blocked on, unfortunately. I do see some locks in this code, but I’m not seeing why we’d be catching on them.

I also found that airbyte-worker seemingly stops flushing logs to S3 shortly after it is started. airbyte-server writes them out continuously. I’ll include a screenshot below from our production S3 bucket, where airbyte has been configured to write to since June. The airbyte-server prefix has orders of magnitude more data compared to this, without the gaps in time.

Our dev environment shows the same issue with logs not making it to S3, although we haven’t seen the thread count quite explode to the same magnitude there (at least not to the point where things locked up).

Kevin_Han · August 28, 2022, 11:20pm

Working on the same exact problem here.

Connector sync runs, logs are unable to be flushed from worker, and worker will OOM on any subsequent run

davinchia · August 29, 2022, 9:54pm

Hey Kevin, which Airbyte deployment are you running?

We are working on something for flushing logs that should be released in the next week or so. See Provide a way to clean up old data · Issue #15567 · airbytehq/airbyte · GitHub. This should help with the space issue.

The OOM issue is more concerning to me. Can you provide more details?

adam · October 31, 2022, 4:03pm

Hi @davinchia - just checking if you have any updates on your investigation.

danieldiamond · November 1, 2022, 7:11am

For the record we experience the same. Our docker instance randomly stalls and hangs and we don’t realise because syncs don’t fail. We’re forced to restart our instance every couple of weeks for no reason.

berosen · December 13, 2022, 8:03pm

Has anyone found a solution for this that does include reboots?

leoalmeidasant · December 27, 2022, 6:00pm

I’ve created a cronJob (in k8s) that execute rollout deployments for worker, server and temporal. We have been passing through this issue since august, and we never found any solution.

Lydia_Lim · December 30, 2022, 9:42am

Same issue here. We noticed the way the memory usage creeps up seems to follow a predictable pattern, but we have not identified the root cause either.
Screenshot 2022-12-30 at 5.41.33 PM

ojaved-equip · February 26, 2023, 2:41am

Anyone have updates on this issue? We still experience it from time to time without much luck of identifying root cause.

adam · March 2, 2023, 5:04pm

My current workaround for this was to enable the container orchestrator (requires a little bit of customization to deploy with the helm charts, but is doable via extra env) and add a memory limit on airbyte-worker. Container orchestrator allows airbyte-worker restarts without impacting any sync activity, and the limit (I went with 2 GB) forces the airbyte-worker pod to restart before hitting this issue.

HMubaireek · March 6, 2023, 7:54am

@adam Hello Adam. Can you please write the steps you did to solve the issue? (the changes to help chart)

rcheatham-q · March 30, 2023, 4:28pm

+1 to @HMubaireek’s request, can you please share how you customized the helm chart to enable container orchestrators?

adam · April 3, 2023, 6:26pm

From our values.yaml:

worker:
  resources:
    limits:
      memory: 2Gi
  image:
    tag: ${airbyte_version}
  containerOrchestrator:
    enabled: true
  extraEnv:
    - name: CONTAINER_ORCHESTRATOR_ENABLED
      valueFrom:
        configMapKeyRef:
          key: CONTAINER_ORCHESTRATOR_ENABLED
          name: airbyte-airbyte-env
    - name: STATE_STORAGE_S3_BUCKET_REGION
      valueFrom:
        configMapKeyRef:
          key: S3_LOG_BUCKET_REGION
          name: airbyte-airbyte-env
    - name: STATE_STORAGE_S3_REGION
      valueFrom:
        configMapKeyRef:
          key: S3_LOG_BUCKET_REGION
          name: airbyte-airbyte-env
    - name: STATE_STORAGE_S3_BUCKET_NAME
      value: ${state_bucket}
    - name: STATE_STORAGE_S3_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: ACCESS_KEY_ID
          name: aws-iam-creds
    - name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: ACCESS_KEY_SECRET
          name: aws-iam-creds

Topic		Replies	Views
Unusual RAM usage by airbyte server Platform, Deploy & Infra Issues deploy	7	1112	August 9, 2022
Running too many syncs concurrently causing non-explicit failures Connector Questions & Issues normalization , data-loading	3	1263	March 7, 2023
After Upgrading Airbyte to v0.34.4-alpha get error Platform, Deploy & Infra Issues deploy	10	428	July 14, 2022
Airbyte job failing due to lack of resources on K8s cluster Platform Questions platform , airbyte , resources , memory , question	18	237	November 21, 2024
Increase Sync Worker Resources Connector Questions & Issues connectors , kubernetes	6	1908	March 30, 2023

Resource leak in airbyte-worker

Related topics