Moving large data from PostgreSQL Aurora to BigQuery on Kubernetes

Summary

The user is facing challenges moving around 17GB of data from PostgreSQL Aurora in AWS to BigQuery using Airbyte on Kubernetes. They are seeking advice on how to move more than 20GB of data from a PostgreSQL table.


Question

Hello team, I have a problem moving around 17GB from a PostgreSQL Aurora in AWS to BigQuery. My Airbyte deployment is on Kubernetes.
I’ve seen in the documentation that up to 500GB of information can be moved. Has anyone faced this problem, or how have you solved moving more than 20GB from a PostgreSQL table?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["postgresql-aurora", "bigquery", "airbyte", "kubernetes", "data-migration"]

I am using PostgreSQL Aurora engine in AWS as a source, and our destination is Google BigQuery. I have checked all the configurations and they seem good.

We gained a few insights from our environment. We resolved this issue by incrementally increasing the limit that our Kubernetes environment has, but we noticed another problem. This could lead to issues with the sizing of virtual machines, such as increased costs. So, instead of just cutting or dropping the limits in every pod, we tried to create another strategy. This involves extracting the information in one shot, and the latest jobs extract a smaller portion of data.

What’s the error when trying to move this data?

I have this log but I didnt find the really error

And the log isn’t really helpful in identifying what the error is.

I can add more logs:

  "failureOrigin" : "source",
  "internalMessage" : "Source process exited with non-zero exit code 3",
  "externalMessage" : "Something went wrong within the source connector",
  "metadata" : {
    "attemptNumber" : 4,
    "jobId" : 29984,
    "connector_command" : "read"
  },
  "stacktrace" : "io.airbyte.workers.internal.exception.SourceException: Source process exited with non-zero exit code 3\n\tat io.airbyte.workers.general.BufferedReplicationWorker.readFromSource(BufferedReplicationWorker.java:369)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsyncWithHeartbeatCheck$3(BufferedReplicationWorker.java:243)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1589)\n",
  "timestamp" : 1706027760665
} ]```

thats odd - it looks like in the first log the GCS storage times out. I’m mostly AWS but I’ve seen that happen intermittently on other projects

exit code 3 points to maybe path not found… does the BigQuery destination have options for the filename pattern? For long running jobs I had an issue with files being overwritten which was resolved by adding the {timestamp} variable to the pattern

id double check permissions as well on the GCP side, they can be a pain