Replicating Large Database and Scaling Airbyte for Data Warehousing

slack-user-airbyte · July 17, 2024, 6:11am

Summary

The user is facing issues replicating a large database using Airbyte and is looking for suggestions on how to improve the process and scale Airbyte for data warehousing purposes.

Question

Hey Team, question, we have a large database, that is 12TB+ and we are trying to replicate that to another postgres DB. Replicating 8 of our largest tables had failed on airbyte. We ended up using logical Postgres to postgres replication.

We are planning to use Airbyte everywhere, essentially building a datawarehouse, but replicating seems to take a very long time, or at least, we don’t know if data is really beeing added, what can you suggest moving these data and how can we scale airbyte aside from what is listed here? https://docs.airbyte.com/operator-guides/scaling-airbyte

I appreciate your response.

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["large-database", "replication", "postgres", "data-warehouse", "scaling", "airbyte"]}

slack-user-airbyte · July 17, 2024, 6:16am

Resources and workers are the big constraints. But it’s a bit more nuanced with database sources, deployment methods, and what your destination is.

How were you deployed for your test? (e.g. Kubernetes on GKE/EKS, something else)

And what were the highlights of the config you were using (e.g. number of replicas for key containers, CPU/memory settings, etc.)?

slack-user-airbyte · July 19, 2024, 6:16am

Hi Justin,
Thanks for your response., and sorry for late reply, I was out last 2 days.

what your destination is.
Our destination is a PostgresDB

How were you deployed for your test?
EKS

And what were the highlights of the config you were using (e.g. number of replicas for key containers, CPU/memory settings,
yes,
Workers:
3 replicas

  cpu: "2"
  memory: 12Gi
requests:
  cpu: 1500m
  memory: 8Gi```
Server:
2 replicas
```limits:
  cpu: "5"
  memory: 24Gi
requests:
  cpu: "3"
  memory: 18Gi```
Temporal:
```limits:
  cpu: "2"
  memory: 12Gi
requests:
  cpu: 1500m
  memory: 8Gi```
webapp:
```limits:
  cpu: "2"
  memory: 12Gi
requests:
  cpu: 1500m
  memory: 8Gi```
JOB_RESOURCES
```limits:
  cpu: "5"
  memory: 25Gi
requests:
  cpu: 5
  memory: 25Gi```

slack-user-airbyte · July 19, 2024, 6:16am

The tables with 200+Million records are not responding to any changes in sync-ing, at all, is the sync-ing all or nothing? I was hoping its atleast some level of microbatch commits, so we see data, for this test, we entirely did not see any replicated data…

Also, the lag and size of postgres replicationslotlag continues to grow, this led us to believe that the workers and spun pods are not running fast enough eventhough we set the JOBS_* information on configmaps.

Topic		Replies	Views
Reducing size of raw data table in Airbyte Cloud Platform Questions platform , cdc , airbyte-cloud , data-replication , mysql-connector	3	352	June 28, 2024
Support for Full Replication in Airbyte Platform Questions platform , airbyte-platform , question , full-replication , heterogeneous-dbms	1	11	September 8, 2024
Issue replicating PostgreSQL database to ClickHouse with Airbyte Connector Questions airbyte , connector , question , postgresql , clickhouse	4	493	June 6, 2024
Scaling Issues while syncing BigQuery to Clickhouse Platform, Deploy & Infra Issues data-loading , deploy , source-bigquery	2	401	January 24, 2023
Airbyte taking too much resources and Time Q&A	3	1246	July 14, 2022

Replicating Large Database and Scaling Airbyte for Data Warehousing

Summary

Question

Related topics