Summary
User is facing an issue when trying to create a source or destination on a GKE Autopilot private cluster behind a Shared VPC with IAP. The error message received is ‘Server temporarily unavailable (http.502.my4YrLdeHmndZBTKh9j1Kr)’. User has made necessary changes like adding annotations, creating Ingress, enabling IAP, and setting up firewall rules. The logs show successful connections but the operation fails. Looking for insights from others with a similar setup.
Question
Hello, I’m using a GKE Autopilot private cluster on a new (but not our first) Airbyte deployment (on current app version, 0.57.2
). It is deployed via Helm (chart version 0.64.81
). This is the first setup where we’ve been behind a Shared VPC (so the Airbyte project isn’t the VPC host project, just a service project). We’re using Cloud NAT for a stable outbound IP, and Identity-Aware Proxy (IAP) for auth.
Overall, things deployed pretty smoothly by only making the following changes:
- Added the
<http://cloud.google.com/neg|cloud.google.com/neg>: '{"ingress": true}'
annotation to theairbyte-webapp-svc
Service - Created an Ingress (external LB) for the
airbyte-webapp-svc
Service with HTTPS termination set up - Enabled IAP for
airbyte-webapp-svc
- Created a firewall rule in the VPC host to allow traffic through to the Ingress LB
From there, everything seemed to work fine—IAP forced auth, Airbyte detected that it was secured, and the webapp loads and interacts correctly including updating connector versions (which means the outbound Cloud NAT is working).
The only thing that doesn’t seem to work is that when I try to create a source OR destination, regardless of type, I get the following error:
Server temporarily unavailable (http.502.my4YrLdeHmndZBTKh9j1Kr)
After about 50 tries my BigQuery destination worked, but I haven’t been able to get it to work since. I saw some notes like https://discuss.airbyte.io/t/kubernetes-check-connection-issues/594/18|this suggesting increasing the timeout of the created LB, which has no effect for me (and in theory pod-to-pod communication shouldn’t be going through the Ingress LB anyway). The logs on the connection check workload always look the same, like this (in this case, a Mailchimp API key connector to reduce variables, but it’s all of them):
Using existing AIRBYTE_ENTRYPOINT: python /airbyte/integration_code/main.py
Waiting on CHILD_PID 7
PARENT_PID: 1
2024/04/11 20:00:54 socat[8] N reading from and writing to stdio
2024/04/11 20:00:54 socat[8] N opening connection to AF=2 10.1.0.74:9032
2024/04/11 20:00:54 socat[8] N successfully connected from local address AF=2 10.1.1.146:45958
2024/04/11 20:00:54 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
2024/04/11 20:00:54 socat[7] N reading from and writing to stdio
2024/04/11 20:00:54 socat[7] N opening connection to AF=2 10.1.0.74:9033
2024/04/11 20:00:54 socat[7] N successfully connected from local address AF=2 10.1.1.146:43238
2024/04/11 20:00:54 socat[7] N starting data transfer loop with FDs [0,1] and [5,5]
EXIT_STATUS: 0
2024/04/11 20:01:34 socat[7] N socket 1 (fd 0) is at EOF
2024/04/11 20:01:34 socat[8] N socket 1 (fd 0) is at EOF
2024/04/11 20:01:34 socat[7] N socket 2 (fd 5) is at EOF
2024/04/11 20:01:34 socat[7] N exiting with status 0
2024/04/11 20:01:34 socat[8] N socket 2 (fd 5) is at EOF
2024/04/11 20:01:34 socat[8] N exiting with status 0
Terminated```
I'm not seeing any logging indicating resource constraints for the pods (but they're also very short-lived for connection checks).
Is anyone else using a similar GKE Autopilot setup with any insights?
<br>
---
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1712872234158139) if you want to access the original thread.
[Join the conversation on Slack](https://slack.airbyte.com)
<sub>
["gke-autopilot", "private-cluster", "shared-vpc", "iap", "source-destination", "http-502-error", "ingress-lb", "firewall-rules", "connection-check", "pod-communication"]
</sub>