Resource leak in airbyte-worker

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Amazon Linux 2
  • Memory / Disk: 32 GB RAM
  • Deployment: k8s
  • Airbyte Version: 0.39.35
  • Source name/version: many
  • Destination name/version: redshift
  • Step: Sync

About once a month, our Airbyte deployment gets completely hosed - jobs start experiencing errors launching new pods and syncs can’t run anymore…until airbyte-worker is restarted. Looking at airbyte-worker while it was in this state today, we noticed two interesting things:

  1. It was using a large amount of RAM - upwards of 8 GB
  2. The main java process for airbyte-worker had 29821 running threads!

When this occurs, new jobs are not able to spawn the threads they need in airbyte-worker:

022-08-18 06:00:56 source > [0.017s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2022-08-18 06:00:56 source > Error occurred during initialization of VM
2022-08-18 06:00:56 source > java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

The overall VM has plenty of resources available, and k8s is not killing the container due to reaching limits.

After restarting airbyte-worker, it’s happily running using approximately 40 threads.

I haven’t noticed any similar issue reports (might have missed them though!) or recent changes that may address this. Has anyone else experienced something similar? Any suggestions for things to try or gather next time this occurs (or changes to make before then!)? We’ve seen this happen 3 times now in our production cluster, across multiple airbyte versions.

1 Like

My summary has an error: our VM has 64 GB of RAM total, not 32.

1 Like

Thanks for the investigation @adam I asked the team some investigation about what is the resource expected for worker and server.

Hi @marcosmarxm,

Just to be clear, resource utilization is only part of the problem. The other problem is that airbyte-worker stops running after a number of weeks, until it is restarted. (The resource issue of course appears to be related to that.) Is that something you’re seeing in cloud, or hearing about from other users? We’ve thought about adding actual k8s limits to airbyte-worker, but are concerned that hitting those limits would result in data loss of any in progress syncs during the worker restart.

I’ve tracked this down to a problem in the S3 file writing for logs - the threads are blocking on a lock that isn’t being released, and very few logs are actually making it to S3. Almost all of our threads are from the log4j2 appender for S3. Looks like we have verbose mode enabled for the appender in log4j2.xml, so I’ll see if I can find anything interesting in the local container logs.

Hi @adam, early Airbyte engineer here. We deploy Cloud multiple times a day in CD fashion so this is a blind spot for us. Looking at our dev envs that have Airbyte deployments running for ~1 -2 weeks, I don’t really see a memory leak pattern. Of course, these deployments have much less load.

Would love to work with you to dig into your findings. I worked on the logging layer so I’m familiar with what you are saying. It is possible all of this is related to the logger layer. Can you share more about what you’ve found about the logging layer?

Happy to! I reached out to you on slack in case you want the actual thread dump file (its too big to upload here). It shows almost 3k pairs (so 6k total) of “LoggingEventCache-publish-thread” and “TimePeriodBasedBufferMonitor-publish-trigger”. Those come from the log4j2-appender library: GitHub - bluedenim/log4j-s3-search: Log4j appender with S3, Azure, Google Cloud, and search publishing. I can’t find any meaningful log output to figure out what those threads are blocked on, unfortunately. I do see some locks in this code, but I’m not seeing why we’d be catching on them.

I also found that airbyte-worker seemingly stops flushing logs to S3 shortly after it is started. airbyte-server writes them out continuously. I’ll include a screenshot below from our production S3 bucket, where airbyte has been configured to write to since June. The airbyte-server prefix has orders of magnitude more data compared to this, without the gaps in time.

Our dev environment shows the same issue with logs not making it to S3, although we haven’t seen the thread count quite explode to the same magnitude there (at least not to the point where things locked up).

Working on the same exact problem here.

Connector sync runs, logs are unable to be flushed from worker, and worker will OOM on any subsequent run

Hey Kevin, which Airbyte deployment are you running?

We are working on something for flushing logs that should be released in the next week or so. See Provide a way to clean up old data · Issue #15567 · airbytehq/airbyte · GitHub. This should help with the space issue.

The OOM issue is more concerning to me. Can you provide more details?

Hi @davinchia - just checking if you have any updates on your investigation.

For the record we experience the same. Our docker instance randomly stalls and hangs and we don’t realise because syncs don’t fail. We’re forced to restart our instance every couple of weeks for no reason.

Has anyone found a solution for this that does include reboots?

I’ve created a cronJob (in k8s) that execute rollout deployments for worker, server and temporal. We have been passing through this issue since august, and we never found any solution.

Same issue here. We noticed the way the memory usage creeps up seems to follow a predictable pattern, but we have not identified the root cause either.
Screenshot 2022-12-30 at 5.41.33 PM

Anyone have updates on this issue? We still experience it from time to time without much luck of identifying root cause.

My current workaround for this was to enable the container orchestrator (requires a little bit of customization to deploy with the helm charts, but is doable via extra env) and add a memory limit on airbyte-worker. Container orchestrator allows airbyte-worker restarts without impacting any sync activity, and the limit (I went with 2 GB) forces the airbyte-worker pod to restart before hitting this issue.

1 Like

@adam Hello Adam. Can you please write the steps you did to solve the issue? (the changes to help chart)

+1 to @HMubaireek’s request, can you please share how you customized the helm chart to enable container orchestrators?

From our values.yaml:

worker:
  resources:
    limits:
      memory: 2Gi
  image:
    tag: ${airbyte_version}
  containerOrchestrator:
    enabled: true
  extraEnv:
    - name: CONTAINER_ORCHESTRATOR_ENABLED
      valueFrom:
        configMapKeyRef:
          key: CONTAINER_ORCHESTRATOR_ENABLED
          name: airbyte-airbyte-env
    - name: STATE_STORAGE_S3_BUCKET_REGION
      valueFrom:
        configMapKeyRef:
          key: S3_LOG_BUCKET_REGION
          name: airbyte-airbyte-env
    - name: STATE_STORAGE_S3_REGION
      valueFrom:
        configMapKeyRef:
          key: S3_LOG_BUCKET_REGION
          name: airbyte-airbyte-env
    - name: STATE_STORAGE_S3_BUCKET_NAME
      value: ${state_bucket}
    - name: STATE_STORAGE_S3_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: ACCESS_KEY_ID
          name: aws-iam-creds
    - name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: ACCESS_KEY_SECRET
          name: aws-iam-creds