Unable to fetch schema when when having lots of avro files in S3 source

melkerohrman · May 16, 2022, 12:43pm

Is this your first time deploying Airbyte?: Yes
OS Version / Instance: MacOS
Memory / Disk: you can use something like 16GB
Deployment: Kubernetes
Airbyte Version: 0.35.0
Source name/version: S3 0.1.13
Destination name/version: Redshift 0.3.33
Step: Fetching schema
Description:
When fetching schema from my S3 bucket with just a few files, there is no issue, it takes about 1 minute.

For my use case I need to load thousands of avro files from S3 into redshift. When setting up the source with higher data volume it takes forever to fetch the schema. After 2h I’ve interrupted the process. I assume it scans through all the files in my bucket to determine the schema. Would be nice if there was a way to only check the first file or provide a sample file to get the schema instead, as the current implementation will only support low data volumes.

alafanechere · May 16, 2022, 6:50pm

Hey @melkerohrman,
Thank you for the feedback, I shared it with our product team.
As a workaround I would suggest you try to set the path prefix parameter (before your first sync) to infer schema on a specific subset of files, then, once the discovery is done, you can remove this file prefix.

You can also manually set the schema yourself in the configuration, this will make the discovery step much faster.

melkerohrman · May 17, 2022, 7:08am

Thanks, I will go for your first suggestion as a work around.

My schema has some values with [“String”, “null”] and it seems like this is not supported when manually setting the schema even if it’s a valid format

alafanechere · May 17, 2022, 9:21pm

My schema has some values with [“String”, “null”] and it seems like this is not supported when manually setting the schema even if it’s a valid format.

Could you please share the schema you tried to set manually and the error you found?

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
Issues with Avro schema detection from S3 Connector Questions connector , bug , s3-connector , error-logs , avro-schema-detection	0	7	July 26, 2024
Error loading avro files with S3 source and schema discovery failures Connector Questions connector , question , s3-source , schema-discovery , avro-files	9	156	June 20, 2024
Redshift source "Failed to fetch schema" Connector Questions & Issues connectors	2	401	September 12, 2022
Redshift Source Discovering schema failed Connector Questions & Issues source-redshift , getting-started , data-loading	6	818	October 24, 2022
Failed to fetch streams from S3 Connector Connector Questions & Issues connectors	2	260	July 14, 2022

Unable to fetch schema when when having lots of avro files in S3 source

Related topics