Source S3 - Issue reading CSV when only header row present

Shubham_Kalloli · May 19, 2022, 2:32pm

Hi team, I am currently using Airbyte to read CSV files from S3 bucket
I am getting an error if the file only has headers.
Is there a configuration workaround that we can use for skipping files with headers only?

Traceback (most recent call last):
  File "/airbyte/integration_code/main.py", line 13, in <module>
    launch(source, sys.argv[1:])
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 129, in launch
    for message in source_entrypoint.run(parsed_args):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 120, in run
    for message in generator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 123, in read
    raise e
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 114, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 248, in _read_full_refresh
    for record in records:
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 452, in read_records
    yield from super().read_records(sync_mode, cursor_field, stream_slice, stream_state)
File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 346, in read_records
    file_reader = self.fileformatparser_class(self._format, self._get_master_schema())
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 235, in _get_master_schema
    raise RuntimeError(
RuntimeError: Detected mismatched datatype on column 'Count', in file 'test/Report.csv'. Should be 'integer', but found 'string'.

harshith · May 19, 2022, 5:44pm

Hey the error doesn’t seem to be saying that the file is empty rather it says the row has mismatching data right? Am I missing something?

Shubham_Kalloli · May 24, 2022, 7:47am

Hey Harshith,

The file is supposed to have a header followed by rows of data.

When the data syncs for the first time, the file has rows after the header. I am guessing that the data in the rows decide the data type based on my observations so far.

When the file has only a header row, I am seeing the error above.

I am not able to upload the CSV files here, so pasting the mock data in plain text below.

File for which the replication is working:
id,first_name,last_name,email,gender,ip_address,count
1,John,Doe,johndoe@hubpages.com,Male,127.0.0.1,1
2,Jane,Doe,janedoe@walmart.com,Female,127.0.0.1,1

File for which the replication is failing:
id,first_name,last_name,email,gender,ip_address,count

PS:

The file that we are dealing with is system generated and the file has only headers if there are no rows of data

harshith · May 25, 2022, 12:54pm

Hey was able to reproduce it and have created an issue around this. Request you to give a +1 over the issue https://github.com/airbytehq/airbyte/issues/13171 and follow it there

boggdan · December 7, 2022, 9:30pm

Hi @Shubham_Kalloli @harshith

Any updates about this. I fall in this issue.

2022-12-07 20:16:04 source > Encountered an exception while reading stream rebates_h
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 114, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 179, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 237, in _read_incremental
    for record_counter, record_data in enumerate(records, start=1):
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 535, in read_records
    yield from self._read_from_slice(file_reader, stream_slice)
  File "/airbyte/integration_code/source_s3/source_files_abstract/stream.py", line 326, in _read_from_slice
    for record in file_reader.stream_records(f):
  File "/airbyte/integration_code/source_s3/source_files_abstract/formats/csv_parser.py", line 154, in stream_records
    streaming_reader = pa_csv.open_csv(
  File "pyarrow/_csv.pyx", line 1273, in pyarrow._csv.open_csv
  File "pyarrow/_csv.pyx", line 1137, in pyarrow._csv.CSVStreamingReader._open
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Empty CSV file
2022-12-07 20:16:04 source > Finished syncing rebates_h
2022-12-07 20:16:04 source > SourceS3 runtimes:
Syncing stream rebates_h 0:03:56.341347
2022-12-07 20:16:04 source > Empty CSV file

I have empty files in the pattern but also I have non empty files.

sajarin · December 30, 2022, 5:38pm

Hey @boggdan,

Sorry for the delay, I’m not sure that this problem has been fixed. It seems like the S3 source connector does not handle empty CSV files very well. This issue is on our team’s backlog but it is unclear when this will be fixed. If you have ideas on how to handle empty files better, we encourage anyone to contribute to our connectors. Otherwise, I suppose the only workaround in the interim is to avoid feed empty files through the S3 source. Hope this helps, happy to answer further questions.

Topic		Replies	Views
Issue with empty CSVs to S3 Connector Questions connector , question , s3 , issue , resolution	1	17	September 16, 2024
Source S3 (CSV) - Throwing Error When Columns contain JSON strings Connector Questions & Issues source-s3	5	1199	November 15, 2022
When csv parsing error, need more option Connector Questions & Issues source-s3	1	637	March 15, 2023
Help setting up s3 source Connector Questions & Issues getting-started , data-loading , connectors	7	1460	July 14, 2022
Issue with S3 connector in pyairbyte API, Terraform and Other Topics pyairbyte , question , s3-connector , csv-files , sync-failure	0	31	June 20, 2024

Source S3 - Issue reading CSV when only header row present

Related topics