Reducing size of raw data table in Airbyte Cloud

Summary

Airbyte Cloud user asking about reducing the size of the raw data table in the airbyte_internal schema to avoid issues with RDS Postgresql database destination and potential impact on data replication.


Question

Hello community,
I’m using Airbyte Cloud to replicate data from a MySQL database to a Postgresql database with CDC feature.
Airbyte creates me a schema airbyte_internal which is normal to store the raw data. But the size of the table in that schema is really big and it’s gonna be too big for the RDS Postgresql database destination.
Do you know if there’s a way to reduce the size of the table raws in the schema airbyte_internal?
If I delete some data in the tables, will be affecting the data replication badly?

Thank you in advance for your help.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["airbyte-cloud", "mysql-connector", "postgresql-connector", "cdc", "airbyte-internal-schema", "table-size", "data-replication"]

Hi William
Thank you for your answer, that confirms what I thought.
I’ll truncate old data that is not updated anymore from the source to retrieve space.

Thank you again :slightly_smiling_face:

As I understand it that data is used to create the final tables. As long as you have final tables enabled, you should be able to truncate old data from the raw__stream tables in the airbyte_internal schema (at least that’s what they get called in the Snowflake destination). The reason those tables are getting so big is that they’re storing every change that’s occurred in the final table since the historical sync in JSON (again, at least for Snowflake). Theoretically once the sync has fully completed you could truncate them.

They will continue to grow as changes occur, until you deal with it. It’s essentially Incremental Append mode but in JSON form for loading all the data with each sync, and then the typing (and deduping if you have it enabled) phase uses that to update the final table.