So we encountered an error on the 26th of April where all of our syncs stopped working. They were stuck in either a “pending” state or a “running” state for about 8 days.
Any attempt to restart sync through the web UI would cause them to be stuck in the “pending” state until cancelled.
Devops managed to fix the problem with docker-compose restart
and it hasn’t happened since.
The thing that bothers us about this is that there was no indication that everything had stopped for that long. One of our Data Scientists had noticed that a specific set of data hadn’t been updated in over a week, and that’s what prompted me to check out the web UI.
Is there any kind of mechanism available to alert us when it’s stuck like that? My boss suggested asking for a webhook that fires on sync completion, but anything else that gets the same result is welcome.