Resource leak in airbyte-worker

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Amazon Linux 2
  • Memory / Disk: 32 GB RAM
  • Deployment: k8s
  • Airbyte Version: 0.39.35
  • Source name/version: many
  • Destination name/version: redshift
  • Step: Sync

About once a month, our Airbyte deployment gets completely hosed - jobs start experiencing errors launching new pods and syncs can’t run anymore…until airbyte-worker is restarted. Looking at airbyte-worker while it was in this state today, we noticed two interesting things:

  1. It was using a large amount of RAM - upwards of 8 GB
  2. The main java process for airbyte-worker had 29821 running threads!

When this occurs, new jobs are not able to spawn the threads they need in airbyte-worker:

022-08-18 06:00:56 source > [0.017s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2022-08-18 06:00:56 source > Error occurred during initialization of VM
2022-08-18 06:00:56 source > java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

The overall VM has plenty of resources available, and k8s is not killing the container due to reaching limits.

After restarting airbyte-worker, it’s happily running using approximately 40 threads.

I haven’t noticed any similar issue reports (might have missed them though!) or recent changes that may address this. Has anyone else experienced something similar? Any suggestions for things to try or gather next time this occurs (or changes to make before then!)? We’ve seen this happen 3 times now in our production cluster, across multiple airbyte versions.

My summary has an error: our VM has 64 GB of RAM total, not 32.

Thanks for the investigation @adam I asked the team some investigation about what is the resource expected for worker and server.

Hi @marcosmarxm,

Just to be clear, resource utilization is only part of the problem. The other problem is that airbyte-worker stops running after a number of weeks, until it is restarted. (The resource issue of course appears to be related to that.) Is that something you’re seeing in cloud, or hearing about from other users? We’ve thought about adding actual k8s limits to airbyte-worker, but are concerned that hitting those limits would result in data loss of any in progress syncs during the worker restart.

I’ve tracked this down to a problem in the S3 file writing for logs - the threads are blocking on a lock that isn’t being released, and very few logs are actually making it to S3. Almost all of our threads are from the log4j2 appender for S3. Looks like we have verbose mode enabled for the appender in log4j2.xml, so I’ll see if I can find anything interesting in the local container logs.

Hi @adam, early Airbyte engineer here. We deploy Cloud multiple times a day in CD fashion so this is a blind spot for us. Looking at our dev envs that have Airbyte deployments running for ~1 -2 weeks, I don’t really see a memory leak pattern. Of course, these deployments have much less load.

Would love to work with you to dig into your findings. I worked on the logging layer so I’m familiar with what you are saying. It is possible all of this is related to the logger layer. Can you share more about what you’ve found about the logging layer?

Happy to! I reached out to you on slack in case you want the actual thread dump file (its too big to upload here). It shows almost 3k pairs (so 6k total) of “LoggingEventCache-publish-thread” and “TimePeriodBasedBufferMonitor-publish-trigger”. Those come from the log4j2-appender library: GitHub - bluedenim/log4j-s3-search: Log4j appender with S3, Azure, Google Cloud, and search publishing. I can’t find any meaningful log output to figure out what those threads are blocked on, unfortunately. I do see some locks in this code, but I’m not seeing why we’d be catching on them.

I also found that airbyte-worker seemingly stops flushing logs to S3 shortly after it is started. airbyte-server writes them out continuously. I’ll include a screenshot below from our production S3 bucket, where airbyte has been configured to write to since June. The airbyte-server prefix has orders of magnitude more data compared to this, without the gaps in time.

Our dev environment shows the same issue with logs not making it to S3, although we haven’t seen the thread count quite explode to the same magnitude there (at least not to the point where things locked up).

Working on the same exact problem here.

Connector sync runs, logs are unable to be flushed from worker, and worker will OOM on any subsequent run

Hey Kevin, which Airbyte deployment are you running?

We are working on something for flushing logs that should be released in the next week or so. See Provide a way to clean up old data · Issue #15567 · airbytehq/airbyte · GitHub. This should help with the space issue.

The OOM issue is more concerning to me. Can you provide more details?