Unable to fetch schema when when having lots of avro files in S3 source

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: MacOS
  • Memory / Disk: you can use something like 16GB
  • Deployment: Kubernetes
  • Airbyte Version: 0.35.0
  • Source name/version: S3 0.1.13
  • Destination name/version: Redshift 0.3.33
  • Step: Fetching schema
  • Description:
    When fetching schema from my S3 bucket with just a few files, there is no issue, it takes about 1 minute.

For my use case I need to load thousands of avro files from S3 into redshift. When setting up the source with higher data volume it takes forever to fetch the schema. After 2h I’ve interrupted the process. I assume it scans through all the files in my bucket to determine the schema. Would be nice if there was a way to only check the first file or provide a sample file to get the schema instead, as the current implementation will only support low data volumes.

Hey @melkerohrman,
Thank you for the feedback, I shared it with our product team.
As a workaround I would suggest you try to set the path prefix parameter (before your first sync) to infer schema on a specific subset of files, then, once the discovery is done, you can remove this file prefix.

You can also manually set the schema yourself in the configuration, this will make the discovery step much faster.

Thanks, I will go for your first suggestion as a work around.

My schema has some values with [“String”, “null”] and it seems like this is not supported when manually setting the schema even if it’s a valid format

My schema has some values with [“String”, “null”] and it seems like this is not supported when manually setting the schema even if it’s a valid format.

Could you please share the schema you tried to set manually and the error you found?