About once a month, our Airbyte deployment gets completely hosed - jobs start experiencing errors launching new pods and syncs can’t run anymore…until airbyte-worker is restarted. Looking at airbyte-worker while it was in this state today, we noticed two interesting things:
It was using a large amount of RAM - upwards of 8 GB
The main java process for airbyte-worker had 29821 running threads!
When this occurs, new jobs are not able to spawn the threads they need in airbyte-worker:
022-08-18 06:00:56 source > [0.017s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2022-08-18 06:00:56 source > Error occurred during initialization of VM
2022-08-18 06:00:56 source > java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
The overall VM has plenty of resources available, and k8s is not killing the container due to reaching limits.
After restarting airbyte-worker, it’s happily running using approximately 40 threads.
I haven’t noticed any similar issue reports (might have missed them though!) or recent changes that may address this. Has anyone else experienced something similar? Any suggestions for things to try or gather next time this occurs (or changes to make before then!)? We’ve seen this happen 3 times now in our production cluster, across multiple airbyte versions.
Just to be clear, resource utilization is only part of the problem. The other problem is that airbyte-worker stops running after a number of weeks, until it is restarted. (The resource issue of course appears to be related to that.) Is that something you’re seeing in cloud, or hearing about from other users? We’ve thought about adding actual k8s limits to airbyte-worker, but are concerned that hitting those limits would result in data loss of any in progress syncs during the worker restart.
I’ve tracked this down to a problem in the S3 file writing for logs - the threads are blocking on a lock that isn’t being released, and very few logs are actually making it to S3. Almost all of our threads are from the log4j2 appender for S3. Looks like we have verbose mode enabled for the appender in log4j2.xml, so I’ll see if I can find anything interesting in the local container logs.
Hi @adam, early Airbyte engineer here. We deploy Cloud multiple times a day in CD fashion so this is a blind spot for us. Looking at our dev envs that have Airbyte deployments running for ~1 -2 weeks, I don’t really see a memory leak pattern. Of course, these deployments have much less load.
Would love to work with you to dig into your findings. I worked on the logging layer so I’m familiar with what you are saying. It is possible all of this is related to the logger layer. Can you share more about what you’ve found about the logging layer?
Happy to! I reached out to you on slack in case you want the actual thread dump file (its too big to upload here). It shows almost 3k pairs (so 6k total) of “LoggingEventCache-publish-thread” and “TimePeriodBasedBufferMonitor-publish-trigger”. Those come from the log4j2-appender library: GitHub - bluedenim/log4j-s3-search: Log4j appender with S3, Azure, Google Cloud, and search publishing. I can’t find any meaningful log output to figure out what those threads are blocked on, unfortunately. I do see some locks in this code, but I’m not seeing why we’d be catching on them.
I also found that airbyte-worker seemingly stops flushing logs to S3 shortly after it is started. airbyte-server writes them out continuously. I’ll include a screenshot below from our production S3 bucket, where airbyte has been configured to write to since June. The airbyte-server prefix has orders of magnitude more data compared to this, without the gaps in time.
Our dev environment shows the same issue with logs not making it to S3, although we haven’t seen the thread count quite explode to the same magnitude there (at least not to the point where things locked up).
For the record we experience the same. Our docker instance randomly stalls and hangs and we don’t realise because syncs don’t fail. We’re forced to restart our instance every couple of weeks for no reason.
I’ve created a cronJob (in k8s) that execute rollout deployments for worker, server and temporal. We have been passing through this issue since august, and we never found any solution.
Same issue here. We noticed the way the memory usage creeps up seems to follow a predictable pattern, but we have not identified the root cause either.
My current workaround for this was to enable the container orchestrator (requires a little bit of customization to deploy with the helm charts, but is doable via extra env) and add a memory limit on airbyte-worker. Container orchestrator allows airbyte-worker restarts without impacting any sync activity, and the limit (I went with 2 GB) forces the airbyte-worker pod to restart before hitting this issue.