Airbyte deployment issue on GCP CE

Summary

Workers stop functioning intermittently in Airbyte deployment on GCP CE, requiring container cleanup and recreation. Seeking guidance on resolving the issue and implementing a health check service for monitoring service availability.


Question

Hi Team,

We’ve encountered an issue with our Airbyte deployment on GCP CE. Randomly, our workers seem to stop functioning while the application UI remains operational. Syncing and connection setups halt intermittently. We’ve observed that a workaround involves cleaning all containers and recreating them, after which functionality resumes as expected.

Could you provide guidance on resolving this issue? The error details can be found https://docs.google.com/document/d/1mg067RhSh5gZ46sytjVasEbD13qKFLXO-K2Q4Imopf0/edit|here.

Additionally, we’re in need of a health check service to monitor service availability. We’d like confirmation that the health check API (https://reference.airbyte.com/reference/get_health|link) will alert us to any service outages. Alternatively, do you have suggestions on how we can detect if a worker or specific container is running but unresponsive?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["airbyte-deployment", "gcp-ce", "workers", "container", "issue", "health-check", "service-availability"]

I would recommend checking the resources of the job deployments. These can be configured by the JOB_MAIN_CONTAINER_* environment variables, or by updating the resource_requirements field in the database for a specific connection.

Sometimes the main Worker container (which spins up and monitors the actual sync worker jobs) also needs higher limits, so you may want to consider increasing those as well