Recovering from Temporal Workflow Deadlock in Self-hosted Airbyte on Docker

Summary

The user is facing issues with a deadlock in temporal workflows while self-hosting Airbyte on Docker, resulting in failed connections to sync. They managed to bring Airbyte back to life by cancelling workflows, but now face issues with existing connections failing. They are seeking help on recovering from this state.


Question

Hey all
I’m self hosting Airbyte on Docker and recently it went into some kind of deadlock with the temporal workflows, I think (we were getting a lot of “Not enough hosts to serve the request” errors in the temporal logs, and nothing responded – API and webapp). After increasing the instance size, trying to scale workers, and trying to reduce requests to the server, with no luck, we ended up going directly into the temporal container and cancelling all workflows running. That brought Airbyte back to life, but now all connections are failing to sync with the exception io.temporal.client.WorkflowNotFoundException: workflowId='connection_manager_<connection_id>' – obviously because we cancelled their respective Temporal ConnectionManagerWorkflows. New connections work as expected. Does anyone know a way of recovering from this state?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["self-hosting", "docker", "temporal-workflows", "deadlock", "connection-failure", "recovery"]

I’m trying to reset the cancelled workflows with temporal workflow reset --workflow-id="connection_manager_<connection_id>" --type="LastContinuedAsNew" --run-id="<id_of_the_canceled_run>" --reason "fix" but getting “GetHistory page size is larger than threshold” in the logs