Summary
Airbyte uses NUMERIC(38,9)
and VARCHAR(65535)
data types when syncing data to AWS Redshift. The user is concerned about the table size and optimization.
Question
Hi everyone!
Newbie question here. I noticed that Airbyte creates columns with the NUMERIC(38,9)
and VARCHAR(65535)
data types when syncing data to AWS Redshift. Could someone please explain why these particular types are used? My concern is that the tables become very large quickly, and I’m not sure if I need to take any action to optimize this or if it’s standard practice for handling various data types across systems.
Any advice would be greatly appreciated!
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
["airbyte", "syncing-data", "aws-redshift", "data-types", "optimization"]
Often types are mapped to create a more explicit common dialect, since different systems define types differently (or don’t implement a certain type at all).
Thus source connectors map data to what are called “Airbyte types” internally, which then map to explicit types in the destination.
The mapping from internal Airbyte types is typically noted in the Destination docs, in this case for Redshift:
https://docs.airbyte.com/integrations/destinations/redshift#data-type-map
The VARCHAR
length specifier doesn’t affect storage (just sets a limit); thus the storage is the number of bytes in your input string plus 1 byte for strings < 127 bytes, and 2 bytes for strings >= 127 bytes. So a 10-byte string would take 11 bytes to store, and a 20,000 byte string would take 20,002 bytes to store.
The storage of the NUMERIC
column is fixed size, and in that case would be 2 x ( (precision + 3) / 4) + 1
— or 23 bytes in this case (at least assuming you ceil the quotient, which I would guess is how it works in reality).
So these aren’t particularly large storage formats, and are likely meant to maximize compatibility of the source-Airbyte-destination types. You can always choose different types if needed in downstream modeling.
If you’re implementing custom connectors, make sure you set the types appropriately . . . e.g. don’t use number
when you mean integer
(as the former will require much more storage than the latter). An overview of Airbyte’s supported types is available https://docs.airbyte.com/understanding-airbyte/supported-data-types#the-types|here. When using certified or community sources, you’re likely at the mercy of the original author . . . but you can obviously put in issues or merge requests if you think something is incorrect in those type assignments.
Thank you so much for the detailed explanation! This clears up a lot of the confusion I had around how Airbyte handles type mapping and storage in Redshift.
I now understand that the storage formats aren’t necessarily as large as I initially thought, which eases my concerns about the table sizes growing too quickly.
I’ll definitely keep your advice in mind about choosing the right types in downstream modeling, and I’ll also check out the documentation you linked to make sure everything is optimized correctly. I’m not working on custom connectors at the moment, but if I do, I’ll be sure to implement proper type settings to avoid unnecessary storage overhead.
Thanks again for the thorough response—it was exactly what I was hoping for!