Source creation failure in self-managed OSS instance on GKE

Summary

Source creation fails in self-managed OSS instance on GKE using Google Cloud Storage for storage and logs. Error occurs during pod initialization with a 504 ‘unknown error’ in the UI, preventing the source from being available in Airbyte.


Question

Does anyone know why source creation fails constantly. I’m running a self-managed OSS instance on Google kubernetes engine. I’m using Google Cloud Storage for persisting storage and logs. I’ve attached 2 screenshots here. One that shows the error in the UI. As I click the “Set up source” button, it seems to trigger creation of a new pod, but it takes some time for the pod to come online. During pod initialization the UI throws a 504 “unknown error”. The pod that checks for the data source succeeds, but the status doesn’t get saved anywhere. And the source is not available to use in Airbyte.

Any idea why this happens and how to solve this?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["source-creation", "self-managed", "OSS-instance", "GKE", "Google-Cloud-Storage", "pod-initialization", "504-error", "source-unavailable", "Airbyte"]

How are you exposing your GKE? And are you sitting any auth in front of it? (e.g. IAP or whatsuch)

Most commonly this happens because the auth proxy, whether a HTTP/S LB or your own nginx or such you’ve set up has a default HTTP timeout on the backend connection. Because GKE needs to spin up the node and then wait for the actual workload to run, connection check is usually where this shows up. (You don’t see it in real runs, because there isn’t a client waiting on the backend—it’s all happening in the background)

More than likely you’ll find that setting the HTTP backend timeout in your LB or reverse-proxy will solve the problem. Usually you’re going to set this suitably high for your worst-case workload (e.g. 600 or 1200 seconds). Seems like a long time, but keep in mind that some sources have slow responses, or in the case of database backends listing thousands of schemas can be slow.

there are more specific cases of this, for example if you’re using an auth proxy . . . you may also have to adjust timeouts there.

<@U035912NS77> thanks for the heads up. I’ve only created a gateway to expose both the Temporal and Airbyte UI’s publicly. Everything else is defaults. So either set by the Airbyte helm charts or GKE Autopilot. Really appreciate the help, this gives me an avenue to start looking into.

I have an IAP proxy in front of the temporal UI but Airbyte UI has no additional auth in front of it. Anyway, I think I have look into timeouts for the Airbyte server connection.

Are you using a GCP load balancer for the proxy? If so, you’ll find the timeout in the backend config in the UI (or the BackendConfig object if you’re provisioning it via YAML). Otherwise refer to the docs for your proxy of choice and it should get you there

<@U035912NS77> I’m using the Gateway API and defining ingress as HTTPRoutes. They are configured a bit differently and don’t use BackendConfig. I think they use policies - will refer to the docs. I think I just have to create a policy for the Airbyte UI route that sets a longer timeout. Really appreciate your help, thank you.

Cool. If you run into any other issues, feel free to PM—we’re full Google stack (GKE Autopilot, Cloud SQL for external DB, Cloud Storage for logs/state, Secret Manager, IAP for access . . . very similar to you except that we use HTTP/S LB+IAP for access to the UI/API). So there’s a good chance we’ve run into similar things all around.

Update, I was able to get this resolved. Just in case someone stumbles into this issue later. The solution was exactly what <@U035912NS77> mentioned above. You have to define the backend timeout for the Airbyte webapp ingress for the.

Since I’m using the new Gateway API and am running on GKE, the solution was to create a policy that targets the airbyte-webapp service. The relevant documentation is here.

https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources#configure_timeout