I’m using the self deployed docker-compose airbyte on an EC2 instance.
It looks like my connections have a very small batch size:
2022-06-10 07:40:33 INFO i.a.w.g.DefaultReplicationWorker(lambda$getReplicationRunnable$6):334 - Records read: 535000 (131 MB)
2022-06-10 07:40:33 INFO i.a.w.g.DefaultReplicationWorker(lambda$getReplicationRunnable$6):334 - Records read: 536000 (131 MB)
The memory usage on the ec2 is only around 19% or so also.
From the docs it seems this may need configured with the two below parameters in the .env file.
But it’s not clear to me how these should be set (playing around has not had any significant effect).
What should these values look like?
or something else?
- Is this your first time deploying Airbyte?: Yes
- OS Version / Instance: Ubuntu
- Memory / Disk: 32GB or so
- Deployment: Docker on EC2
- Airbyte Version: 0.39.4
- Source name/version: Postgres 0.4.16
- Destination name/version: Postgres 0.3.20
- Step: Unknown
These values should have the same structure as those usually defined in docker-compose files for memory request.
You can set
I’m not sure that this will improve the throughput of you postgres sync though. I would suggest you upgrade your source postgres connector to the latest version. We recently improved this connector to have a dynamic fetch size that was previously hardcoded to 1000 records.
You’ll find an interesting related discussion here about source database performance. The discussion is around MySQL but the core logic is exactly the same for Postgres.
It still seems to have a very low throughput:
2022-06-21 23:00:57 destination > 2022-06-21 23:00:57 INFO i.a.i.d.r.InMemoryRecordBufferingStrategy(lambda$flushAll$1):84 - Flushing mytable: 104 records (24 MB)
2022-06-21 23:01:02 destination > 2022-06-21 23:01:02 INFO i.a.i.d.r.InMemoryRecordBufferingStrategy(lambda$flushAll$1):84 - Flushing mytable: 70 records (24 MB)
2022-06-21 23:01:03 destination > 2022-06-21 23:01:03 INFO i.a.i.d.r.InMemoryRecordBufferingStrategy(lambda$flushAll$1):84 - Flushing mytable: 130 records (24 MB)
But the question of how to set memory parameters is solved.
I’ll keep an eye on the postgres improvements coming this quarter and see if that helps us.
We’re also looking to move to CDC which may have better throughput.