Summary
The user is experiencing discrepancies with the
temporal_workflow_failure_total
metric in monitoring failed connections. They are looking for alternative metrics to verify failed connections and seeking advice on resolving the issue.
Question
Hello community.
I hope someone has experience with the following issue (already asked <#C01AHCD885S|>, but would like to hear for a living being).
We are using temporal_workflow_failure_total
to monitor failed connections. From time to time it gives us the value, showing that there are some connections which are in a failed state, but there are none. Not even a failed attempt.
Maybe there’s another metric to check failed connections?
Is there perhaps another metric we could use to verify failed connections?
The bot suggested clearing the Temporal database, but if we needed to clean up, the “failed” connections should not have been resolved by itself (the problem can stuck for 2 days and then just resolve).
Thank you in advance!
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.