Data types used by Airbyte when syncing to AWS Redshift

slack-user-airbyte · October 23, 2024, 6:13am

Summary

Airbyte uses NUMERIC(38,9) and VARCHAR(65535) data types when syncing data to AWS Redshift. The user is concerned about the table size and optimization.

Question

Hi everyone!
Newbie question here. I noticed that Airbyte creates columns with the NUMERIC(38,9) and VARCHAR(65535) data types when syncing data to AWS Redshift. Could someone please explain why these particular types are used? My concern is that the tables become very large quickly, and I’m not sure if I need to take any action to optimize this or if it’s standard practice for handling various data types across systems.
Any advice would be greatly appreciated!

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["airbyte", "syncing-data", "aws-redshift", "data-types", "optimization"]}

slack-user-airbyte · October 24, 2024, 6:22am

Often types are mapped to create a more explicit common dialect, since different systems define types differently (or don’t implement a certain type at all).

Thus source connectors map data to what are called “Airbyte types” internally, which then map to explicit types in the destination.

The mapping from internal Airbyte types is typically noted in the Destination docs, in this case for Redshift:
https://docs.airbyte.com/integrations/destinations/redshift#data-type-map

The VARCHAR length specifier doesn’t affect storage (just sets a limit); thus the storage is the number of bytes in your input string plus 1 byte for strings < 127 bytes, and 2 bytes for strings >= 127 bytes. So a 10-byte string would take 11 bytes to store, and a 20,000 byte string would take 20,002 bytes to store.

The storage of the NUMERIC column is fixed size, and in that case would be 2 x ( (precision + 3) / 4) + 1 — or 23 bytes in this case (at least assuming you ceil the quotient, which I would guess is how it works in reality).

So these aren’t particularly large storage formats, and are likely meant to maximize compatibility of the source-Airbyte-destination types. You can always choose different types if needed in downstream modeling.

If you’re implementing custom connectors, make sure you set the types appropriately . . . e.g. don’t use number when you mean integer (as the former will require much more storage than the latter). An overview of Airbyte’s supported types is available https://docs.airbyte.com/understanding-airbyte/supported-data-types#the-types|here. When using certified or community sources, you’re likely at the mercy of the original author . . . but you can obviously put in issues or merge requests if you think something is incorrect in those type assignments.

slack-user-airbyte · October 24, 2024, 6:22am

Thank you so much for the detailed explanation! This clears up a lot of the confusion I had around how Airbyte handles type mapping and storage in Redshift.
I now understand that the storage formats aren’t necessarily as large as I initially thought, which eases my concerns about the table sizes growing too quickly.
I’ll definitely keep your advice in mind about choosing the right types in downstream modeling, and I’ll also check out the documentation you linked to make sure everything is optimized correctly. I’m not working on custom connectors at the moment, but if I do, I’ll be sure to implement proper type settings to avoid unnecessary storage overhead.
Thanks again for the thorough response—it was exactly what I was hoping for!

Topic		Replies	Views
Getting Varchar instead of Date_Time fields for Redshift to Redshift replication Connector Questions & Issues data-loading , connectors	0	216	July 8, 2023
Issue with long JSON field and mysql source Connector Questions & Issues source-mysql , destination-redshift , data-loading , connectors	7	1443	July 14, 2022
Reducing size of airbyte_internal schema in Redshift Platform Questions platform , redshift , question , data-ingestion , table-size	3	149	October 29, 2024
Destination data types all varchars Connector Questions & Issues source-postgres , destination-snowflake , normalization	5	327	November 29, 2022
Redshift V2 Sync Failure Due to String Size Limit Connector Questions redshift , airbyte , connector , sync-failure , string-size-limit	0	5	November 28, 2024

Data types used by Airbyte when syncing to AWS Redshift

Summary

Question

Related topics