Error loading avro files with S3 source and schema discovery failures

Summary

The user is facing schema discovery failures when trying to load avro files with the S3 source in Airbyte. The error is related to a missing ‘logical_type’ in the Avro logical type. The user mentions that ‘fastavro’ can read the files and schemas correctly in Python.


Question

Hi, I’m trying to load avro files with the S3 source, and I’m getting schema discovery failures, with such errors:

    return await self.get_parser().infer_schema(self.config, file, self.stream_reader, self.logger)
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/avro_parser.py", line 68, in infer_schema
    json_schema = {
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/avro_parser.py", line 69, in <dictcomp>
    field["name"]: AvroParser._convert_avro_type_to_json(avro_format, field["name"], field["type"])
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/avro_parser.py", line 129, in _convert_avro_type_to_json
    raise ValueError(f"{avro_field['logical_type']} is not a valid Avro logical type")
KeyError: 'logical_type'```
... after which there's a lot more of the same and finally a discovery failure.
This is with the source-s3 version 4.5.11

If I download the files, `fastavro` can read the files (and their schemas) just fine in Python. Am I missing anything?

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1712557994937269) if you want to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["avro-files", "s3-source", "schema-discovery", "logical-type", "fastavro"]
</sub>

Got to the bottom of the direct issue (a bug in the way error is raised), and submitted a PR: https://github.com/airbytehq/airbyte/pull/36888

Thank you for the pull request!

Hopefully we can get this in quickly. Have you tried running the connector pointing to local / pre-release / forked CDK to see if it helps?

<@U069EMNRPA4> heya, thanks for the review! I did try it with a local build
airbyte-ci connectors --name=source-s3 --use-local-cdk build
and the result was as expected, and as you mentioned in your comment (I could see the “offending” field).

Your question in the GitHub comment is on point, though, fastavro can load that file that was causing the original issue for me, and ideally AirByte should be able to do that too (ie. handling logical types other than the <https://fastavro.readthedocs.io/en/latest/logical_types.html|official ones>, because the schema should have enough info how to handle the fields that have some underlying “how is it encoded in basic avro types?” way)

This ^^ would be an even better outcome, and would love it!

True. You should file an issue with an example of a file that fails but should pass, I’ll ask the team to take a look at it.

Cheers <@U069EMNRPA4>, I’ve filed an issue here: https://github.com/airbytehq/airbyte/issues/36918
I cannot attach an avro file there, but added the script to generating a simple one, and attaching the example here, in case.

Let me know if further elaboration or clarification is needed!

Not to muddy waters, but there seem to be other issues with the Avro logical types loading as well, for some official logical types), unless I’m doing something wrong (which wouldn’t totally surprise me, but I think I’ve checked things that I could)

Opened an issue for that: https://github.com/airbytehq/airbyte/issues/36920

Quickly eyeballed your other PRs and issues. Some great work there, let’s push through it. :heart:

I’ll show the two avro issues to the team tomorrow, we’ll see what’s going on. I think your typo PR is a nobrainer merge, but the additional types support, or the latest bug — not yet sure what’s going on there.

Note: this still needs work. We improved our ability to run CI on contributions, but should enable it to run CDK unit tests too (cc <@U02UC3T12PJ>) and then rebase, run CI, and merge potentially.