Cannot read the S3 parquet file created by S3 destination connector

  • Is this your first time deploying Airbyte?: No / Yes

  • OS Version / Instance: MacOS

  • Memory / Disk: you can use something like 4Gb / 1 Tb

  • Deployment: Docker

  • Airbyte Version: * 0.39.20-alpha

  • Source name/version: S3 0.1.15

  • Destination name/version: S3 0.3.10

  • Step: The issue is happening during sync

  • Description:

I was doing a test run to read the Airbyte S3 parquet file using Dremio. So for that, I put a sample .parquest file in my S3 source folder and do a sync to another folder. Then I try to read both the source file and copied file using Dremio from S3.

The source file is readable and the airbyte copied file is not readable. From Dremio logs I got this error:
Unable to coerce from the file’s data type “timestamp” to the column’s data type “int64” in table “2022_07_11_1657578941501_0. parquet”, column “_ab_source_file_last_modified.member0”

I attached the schema it was showing in my connector settings.
So "_ab_source_file_last_modified " should be a string.?

Also in a Dremio file preview feature, I can see that “_ab_source_file_last_modified " is detected as an Object.So what should be the " _ab_source_file_last_modified **” type ?

Why the parquet file created by Airebyte is not readable** Attached are the file and screenshots. Any help is appreciated

File: Slack
Thread: Slack

This is the schema details I got from pyarrow.parquet

  • _airbyte_ab_id: string not null
  • _airbyte_emitted_at: timestamp[ms, tz=UTC] not null
  • extra: double
  • mta_tax: double
  • VendorID: int32
  • ehail_fee: int32
  • trip_type: double
  • RatecodeID: double
  • tip_amount: double
  • fare_amount: double
  • DOLocationID: int32
  • PULocationID: int32
  • payment_type: double
  • tolls_amount: double
  • total_amount: double
  • trip_distance: double
  • passenger_count: double
  • store_and_fwd_flag: string
  • _ab_source_file_url: string
  • congestion_surcharge: double
  • lpep_pickup_datetime: string
  • improvement_surcharge: double
  • lpep_dropoff_datetime: string
  • _ab_source_file_last_modified: struct<member0: timestamp[us, tz=UTC], member1: string>
  • child 0, member0: timestamp[us, tz=UTC]
  • child 1, member1: string
  • _airbyte_additional_properties: map<string, string (‘_airbyte_additional_properties’)>
  • child 0, _airbyte_additional_properties: struct<key: string not null, value: string not null> not null
  •   child 0, key: string not null
    
  •   child 1, value: string not null
    
  • – schema metadata –
  • parquet.avro.schema: ‘{“type”:“record”,“name”:“s3_taxi_data”,“fields”:[{"’ + 1750
  • writer.model.name: ‘avro’
1 Like

Hey would suggest you to create a github issue for this so that team can look into it