Schema Discovery Timeout Issue on Airbyte Server in Kubernetes Deployment

Summary

Airbyte server in version 0.63.3 deployed on Google Cloud Kubernetes Engine cluster is facing issues with schema discovery timing out, leading to a 502 error during configuration of File Source connector and custom REST API connector. User is questioning the need for full schema discovery at each sync and mentions unsuccessful attempt to configure schema discovery timeouts using Helm charts.


Question

Hello everyone,
We are using Airbyte version 0.63.3 deployed on a Google Cloud Kubernetes Engine cluster. I have discovered that the Airbyte server in this version is too impatient when it comes to schema discovery. I am trying to an Excel file that is generated by a web service. The generation of this Excel file takes more than one minute. The web service is a developed by a third party so I have no control on it. When I try to configure the File Source connector for this Excel file, after 1 minute I get the generic error “Airbyte server is temporarily unavailable (502)”. However, the Kubernetes Pod running File Source schema discovery continues to run. By the time this Pod finishes schema discovery the Airbyte server has completely forgot about it. Therefore I simply cannot configure this Source.
I had a very similar problem with a REST API custom connector that we built. I had to split the set of the source streams into batches of no more than 30 streams, because the schema discovery took too long, and I was hitting this 502 error.
Another thing that I do not understand is why the Airbyte server needs to do a full schema discovery at the beggining of each sync. I have already done the schema discovery when I have configured the connection. Now I just want to sync that connection. Why go through the process of schema discovery again?
I tried to configure these settings https://docs.airbyte.com/enterprise-setup/scaling-airbyte#schema-discovery-timeouts on our Kubernetes deployment using Helm charts. However, Airbyte is completely ignoring these.
Thank you.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["schema-discovery-timeout", "airbyte-server", "kubernetes-deployment", "file-source-connector", "rest-api-connector", "502-error", "schema-discovery", "helm-charts"]

The environment variables you mentioned I believe only apply to the built-in nginx proxy, which is most likely not used in your deployment.

I’m assuming you’re using an HTTP/S Load Balancer on GCP, and in that case you need to set the backend service timeout on the load balancer. For our connectors, 600-1200 has been suitable, but connection checks and schema discovery are both the slowest tasks (especially when Autopilot needs to spin up a new node before the new pods can be scheduled).

I have a WIP version trying to trick Helm to deploy the LB directly (by pre-creating some backend and frontend config objects up front and referencing them via annotations), but it’s a bit of a hack right now. Until that’s cleaner, I would recommend just managing the LB separately and making the change directly.

Yes, you are correct. I am using a Google Cloud Load Balancer setup by the creation of the Kubernetes Ingress. I have edited the values.yaml file and added this:

  extraEnv:
    - name: HTTP_IDLE_TIMEOUT
      value: 20m
    - name: READ_TIMEOUT
      value: 30m```
and modified the Google Cloud Backend timeout from 30 seconds to 5 minutes.

And now it seems to be working

interesting, I wonder if there was a recent change there . . . I haven’t previously had any luck with those settings ending up on the LB

You may want to look at your LB config and see if it actually shows the backend service timeout correctly or if you just got lucky and had a warm start (available node capacity) when you tested, which usually makes the checks quite fast.

No, no. The values.yaml settings apply only to the environment variables for the airbyte-server managed pod

The timeout at the Load Balancer I had to manually configure.

Ah, I see. I haven’t had any further issues with the 502s once that was set, so you’ll likely be good now.

It was extremely frustrating

For our deployment, we’re using a private GKE Autopilot cluster, shared VPC, HTTP/S LB, Cloud NAT for stable outbound IP, Cloud SQL for Postgres, Identity-Aware Proxy for access control . . . basically the whole Google stack. So feel free to reach out if you run into any other issues, there’s a good chance we’ve seen them before :slightly_smiling_face:

Thank you very much. We are using a Standard GKE cluster (not Autopilot) and the database is hosted by the airbyte-db container,

I’m also using GKE and getting what I think is a timeout when fetching a certain source schema.

I’ve added:

server:
extraEnv:
- name: HTTP_IDLE_TIMEOUT
value: 20m
- name: READ_TIMEOUT
value: 30m

and also timeouts to nginx

kind: Ingress
metadata:
  name: prod-airbyte-ingress
  namespace: prod-airbyte
  annotations:
    <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: nginx
    <http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "300s"
    <http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "300s"```
It always fails after exactly 1 minute, so not sure what the deal is yet. Still digging

I’m not using NGINX but a classic Google Cloud Load Balancer. I’ve made the same Airbyte Server settings as you did. I’ve configured my backend with a timeout of 5 minutes. For me this solved the timeout issues.