Deleted connection being retried again and again for something

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: GKE Container Optimized Linux
  • Memory / Disk: you can use something like 4Gb / 1 Tb
  • Deployment: Kubernetes on GKE
  • Airbyte Version: 0.39.42
  • Source name/version: Intercom
  • Destination name/version: PubSub
  • Step: Deleted connection still being tried by temporal
  • Description:
    We have a deleted connection in airbyte, that somehow still being tried (and failed) again and again. Is there a way to somehow cleanup temporal/schedule state/tasks given a connection id? Not sure which tables to look for and what to drop/delete. Happy to write custom snippet against temporal db and drop/disable it as well if you point me an example of interaction based on connection id in code. I am not sure how this got into that state and when.

I have checked the connection is deleted and not show up. Using it in URL explicity shows
image
The last sync was cancelled two weeks back when it was deleted

I checked airbyte tables connection and jobs and they reflect things properly. No updates for that connection id after the deletion date.

The connection manager workflow logs looks like this,

It is tried ~30 times a second, leading to both noisy/huge amount of logs and also increase in DB size for temporal gradually.

I tried using tctl but its giving me problems connecting to the DB (we use cloud_sql_proxy from Google in another container inside pod and just connect to localhost)

TEMPORAL_CLI_SHOW_STACKS=1 TEMPORAL_CLI_ADDRESS=<temporal_pod_ip>:7233 tctl admin wf delete --db_engine postgres --db_address localhost --db_port 5432 --username <user> --password <pwd> --workflow_id "<wf_id from temporal_logs for that connection>"

Any help/pointers in stopping this appreciated.

cc @lgomezm

This fiasco continued after I posted daily, then randomnly we started seeing more errors on temporal, things like context deadline exceeded

but things were working fine. Then today we started seeing this error
io.temporal.worker.NonDeterministicException: Failure handling event 60 of type 'EVENT_TYPE_MARKER_RECORDED' during replay

and then this Not enough hosts to serve the request

and then this consistent query buffer is full, cannot accept new consistent queries

Now no sync are running fine. We tried bringing down/up the stack few times, but no luck.

Any idea what is going on? Or pointers to what I can debug in temporal? Appreciate it.

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

  • It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
  • Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
  • Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
  • We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

Just as FYI, we ended up creating a completely new namespace for airbyte, new db, new log bucket and exported/imported configuration from old. While we would have loved to figure out what went wrong, I think in interest of time that seemed better option. Thanks for anyone taking time to read this vague post, appreciate it.

Hey @pras, sorry for the delay and that you had to make a completely new instance for airbyte to make it work. I think there have been other reports of similar behavior by other users and I don’t believe we’ve traced the issue. If it happens again, please reply here or make a new post so we can escalate the issue.

If other community members here are experiencing similar issues, please post your experience below or in a new thread (if this thread is closed), thanks!