Source S3 - Issue reading CSV when only header row present

Hi team, I am currently using Airbyte to read CSV files from S3 bucket
I am getting an error if the file only has headers.
Is there a configuration workaround that we can use for skipping files with headers only?

Traceback (most recent call last):
  File "/airbyte/integration_code/main.py", line 13, in <module>
    launch(source, sys.argv[1:])
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 129, in launch
    for message in source_entrypoint.run(parsed_args):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 120, in run
    for message in generator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 123, in read
    raise e
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 114, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 248, in _read_full_refresh
    for record in records:
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 452, in read_records
    yield from super().read_records(sync_mode, cursor_field, stream_slice, stream_state)
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 346, in read_records
    file_reader = self.fileformatparser_class(self._format, self._get_master_schema())
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 235, in _get_master_schema
    raise RuntimeError(
RuntimeError: Detected mismatched datatype on column 'Count', in file 'test/Report.csv'. Should be 'integer', but found 'string'.

Hey the error doesn’t seem to be saying that the file is empty rather it says the row has mismatching data right? Am I missing something?

Hey Harshith,

The file is supposed to have a header followed by rows of data.

When the data syncs for the first time, the file has rows after the header. I am guessing that the data in the rows decide the data type based on my observations so far.

When the file has only a header row, I am seeing the error above.

I am not able to upload the CSV files here, so pasting the mock data in plain text below.

File for which the replication is working:
id,first_name,last_name,email,gender,ip_address,count
1,John,Doe,johndoe@hubpages.com,Male,127.0.0.1,1
2,Jane,Doe,janedoe@walmart.com,Female,127.0.0.1,1

File for which the replication is failing:
id,first_name,last_name,email,gender,ip_address,count

PS:

  • The file that we are dealing with is system generated and the file has only headers if there are no rows of data

Hey was able to reproduce it and have created an issue around this. Request you to give a +1 over the issue https://github.com/airbytehq/airbyte/issues/13171 and follow it there

Hi @Shubham_Kalloli @harshith

Any updates about this. I fall in this issue.

2022-12-07 20:16:04 source > Encountered an exception while reading stream rebates_h
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 114, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 179, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 237, in _read_incremental
    for record_counter, record_data in enumerate(records, start=1):
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 535, in read_records
    yield from self._read_from_slice(file_reader, stream_slice)
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 326, in _read_from_slice
    for record in file_reader.stream_records(f):
  File "/airbyte/integration_code/source_s3/source_files_abstract/formats/csv_parser.py", line 154, in stream_records
    streaming_reader = pa_csv.open_csv(
  File "pyarrow/_csv.pyx", line 1273, in pyarrow._csv.open_csv
  File "pyarrow/_csv.pyx", line 1137, in pyarrow._csv.CSVStreamingReader._open
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Empty CSV file
2022-12-07 20:16:04 source > Finished syncing rebates_h
2022-12-07 20:16:04 source > SourceS3 runtimes:
Syncing stream rebates_h 0:03:56.341347
2022-12-07 20:16:04 source > Empty CSV file

I have empty files in the pattern but also I have non empty files.

Hey @boggdan,

Sorry for the delay, I’m not sure that this problem has been fixed. It seems like the S3 source connector does not handle empty CSV files very well. This issue is on our team’s backlog but it is unclear when this will be fixed. If you have ideas on how to handle empty files better, we encourage anyone to contribute to our connectors. Otherwise, I suppose the only workaround in the interim is to avoid feed empty files through the S3 source. Hope this helps, happy to answer further questions.