Parsing errors with Avro format in Google Drive connector

Summary

The user is experiencing parsing errors with the Avro format in the Google Drive connector when trying to parse a PDF file. Other formats work successfully. The error indicates a UnicodeDecodeError related to decoding a byte in the file.


Question

Hi everyone!

I’m trying out Airbyte with Google Drive and the Avro format. But I keep getting parsing errors when it tries a pdf.
I know I’m connected successfully because other formats work. Is there something I’m missing?

Thanks!

Error (full error in thread)
['Traceback (most recent call last):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/avro_parser.py", line 151, in parse_records\n avro_reader = fastavro.reader(fp)\n File "fastavro/_read.pyx", line 1086, in fastavro._read.reader.__init__\n File "fastavro/_read.pyx", line 1013, in fastavro._read.file_reader.__init__\n File "fastavro/_read.pyx", line 743, in fastavro._read._read_data\n File "fastavro/_read.pyx", line 616, in fastavro._read.read_record\n File "fastavro/_read.pyx", line 727, in fastavro._read._read_data\n File "fastavro/_read.pyx", line 473, in fastavro._read.read_map\n File "fastavro/_read.pyx", line 301, in fastavro._read.read_utf8\nUnicodeDecodeError: \'utf-8\' codec can\'t decode byte 0xd3 in position 3: invalid continuation byte\n



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["parsing-errors", "avro-format", "google-drive-connector", "pdf", "unicode-decode-error"]

['Traceback (most recent call last):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/avro_parser.py", line 151, in parse_records\n avro_reader = fastavro.reader(fp)\n File "fastavro/_read.pyx", line 1086, in fastavro._read.reader.__init__\n File "fastavro/_read.pyx", line 1013, in fastavro._read.file_reader.__init__\n File "fastavro/_read.pyx", line 743, in fastavro._read._read_data\n File "fastavro/_read.pyx", line 616, in fastavro._read.read_record\n File "fastavro/_read.pyx", line 727, in fastavro._read._read_data\n File "fastavro/_read.pyx", line 473, in fastavro._read.read_map\n File "fastavro/_read.pyx", line 301, in fastavro._read.read_utf8\nUnicodeDecodeError: \'utf-8\' codec can\'t decode byte 0xd3 in position 3: invalid continuation byte\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py", line 95, in _check_parse_record\n record = next(iter(parser.parse_records(stream.config, file, self.stream_reader, logger, discovered_schema=None)))\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/avro_parser.py", line 161, in parse_records\n raise RecordParseError(FileBasedSourceError.ERROR_PARSING_RECORD, filename=file.uri, lineno=line_no) from exc\nairbyte_cdk.sources.file_based.exceptions.RecordParseError: Error parsing record. This could be due to a mismatch between the config\'s file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=Flax.pdf lineno=0\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py", line 64, in check_availability_and_parsability\n self._check_parse_record(stream, file, logger)\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py", line 102, in _check_parse_record\n raise CheckAvailabilityError(FileBasedSourceError.ERROR_READING_FILE, stream=stream.name, file=file.uri) from exc\nairbyte_cdk.sources.file_based.exceptions.CheckAvailabilityError: Error opening file. Please check the credentials provided in the config and verify that they provide permission to read files. Contact Support if you need assistance.\nstream=binary file=Flax.pdf\n']

Are you using source-google-drive? We could start by upgrading to the recent fastavro in the CDK to see if it helps.

Is it one specific pdf that fails, or any pdf?

Yes, using source-google-drive.

Any pdf or jpeg fails.

It might be a misunderstanding though.

I want to sync files from google drive to something like S3.

Is that possible?

It looks like Avro is for sending structured binary data. Not sure if a raw pdf/jpeg fits that.

edit: for additional context

I want to get google drive files and process them myself. I don’t know in advance what the files will be