Hi team, I am currently using Airbyte to read CSV files from S3 bucket
I am getting an error if the file only has headers.
Is there a configuration workaround that we can use for skipping files with headers only?
Traceback (most recent call last):
File "/airbyte/integration_code/main.py", line 13, in <module>
launch(source, sys.argv[1:])
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 129, in launch
for message in source_entrypoint.run(parsed_args):
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 120, in run
for message in generator:
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 123, in read
raise e
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 114, in read
yield from self._read_stream(
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
for record in record_iterator:
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 248, in _read_full_refresh
for record in records:
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 452, in read_records
yield from super().read_records(sync_mode, cursor_field, stream_slice, stream_state)
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 346, in read_records
file_reader = self.fileformatparser_class(self._format, self._get_master_schema())
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 235, in _get_master_schema
raise RuntimeError(
RuntimeError: Detected mismatched datatype on column 'Count', in file 'test/Report.csv'. Should be 'integer', but found 'string'.
The file is supposed to have a header followed by rows of data.
When the data syncs for the first time, the file has rows after the header. I am guessing that the data in the rows decide the data type based on my observations so far.
When the file has only a header row, I am seeing the error above.
I am not able to upload the CSV files here, so pasting the mock data in plain text below.
2022-12-07 20:16:04 source > Encountered an exception while reading stream rebates_h
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 114, in read
yield from self._read_stream(
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 179, in _read_stream
for record in record_iterator:
File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 237, in _read_incremental
for record_counter, record_data in enumerate(records, start=1):
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 535, in read_records
yield from self._read_from_slice(file_reader, stream_slice)
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 326, in _read_from_slice
for record in file_reader.stream_records(f):
File "/airbyte/integration_code/source_s3/source_files_abstract/formats/csv_parser.py", line 154, in stream_records
streaming_reader = pa_csv.open_csv(
File "pyarrow/_csv.pyx", line 1273, in pyarrow._csv.open_csv
File "pyarrow/_csv.pyx", line 1137, in pyarrow._csv.CSVStreamingReader._open
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Empty CSV file
2022-12-07 20:16:04 source > Finished syncing rebates_h
2022-12-07 20:16:04 source > SourceS3 runtimes:
Syncing stream rebates_h 0:03:56.341347
2022-12-07 20:16:04 source > Empty CSV file
I have empty files in the pattern but also I have non empty files.
Sorry for the delay, I’m not sure that this problem has been fixed. It seems like the S3 source connector does not handle empty CSV files very well. This issue is on our team’s backlog but it is unclear when this will be fixed. If you have ideas on how to handle empty files better, we encourage anyone to contribute to our connectors. Otherwise, I suppose the only workaround in the interim is to avoid feed empty files through the S3 source. Hope this helps, happy to answer further questions.