Replicating Large Database and Scaling Airbyte for Data Warehousing

Summary

The user is facing issues replicating a large database using Airbyte and is looking for suggestions on how to improve the process and scale Airbyte for data warehousing purposes.


Question

Hey Team, question, we have a large database, that is 12TB+ and we are trying to replicate that to another postgres DB. Replicating 8 of our largest tables had failed on airbyte. We ended up using logical Postgres to postgres replication.

We are planning to use Airbyte everywhere, essentially building a datawarehouse, but replicating seems to take a very long time, or at least, we don’t know if data is really beeing added, what can you suggest moving these data and how can we scale airbyte aside from what is listed here? https://docs.airbyte.com/operator-guides/scaling-airbyte

I appreciate your response.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["large-database", "replication", "postgres", "data-warehouse", "scaling", "airbyte"]

Resources and workers are the big constraints. But it’s a bit more nuanced with database sources, deployment methods, and what your destination is.

How were you deployed for your test? (e.g. Kubernetes on GKE/EKS, something else)

And what were the highlights of the config you were using (e.g. number of replicas for key containers, CPU/memory settings, etc.)?

Hi Justin,
Thanks for your response., and sorry for late reply, I was out last 2 days.

what your destination is.
Our destination is a PostgresDB

How were you deployed for your test?
EKS

And what were the highlights of the config you were using (e.g. number of replicas for key containers, CPU/memory settings,
yes,
Workers:
3 replicas

  cpu: "2"
  memory: 12Gi
requests:
  cpu: 1500m
  memory: 8Gi```
Server:
2 replicas
```limits:
  cpu: "5"
  memory: 24Gi
requests:
  cpu: "3"
  memory: 18Gi```
Temporal:
```limits:
  cpu: "2"
  memory: 12Gi
requests:
  cpu: 1500m
  memory: 8Gi```
webapp:
```limits:
  cpu: "2"
  memory: 12Gi
requests:
  cpu: 1500m
  memory: 8Gi```
JOB_RESOURCES
```limits:
  cpu: "5"
  memory: 25Gi
requests:
  cpu: 5
  memory: 25Gi```

The tables with 200+Million records are not responding to any changes in sync-ing, at all, is the sync-ing all or nothing? I was hoping its atleast some level of microbatch commits, so we see data, for this test, we entirely did not see any replicated data…

Also, the lag and size of postgres replicationslotlag continues to grow, this led us to believe that the workers and spun pods are not running fast enough eventhough we set the JOBS_* information on configmaps.