Issue to load Latin-1 Encoded file from sFTP

Hi team,

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu
  • Memory / Disk: 2 vcpus, 8 GiB memory
  • Deployment: Docker
  • Airbyte Version: Airbyte Version: 0.36.3-alpha
  • Source name/version: Source: File (0.2.10)
  • Destination name/version: Destination: Snowflake (0.4.24)
  • Step: The issue is happening during sync
  • Description: :

I experienced an issue loading a file encoded with ISO-8859-1 (seems similar to Source S3: error with encoding reading csv file · Issue #9059 · airbytehq/airbyte · GitHub)

My source is a Csv file on a SFTP serveur

First file row contain caractere é, wich is encoded as 0xE9

I set my csv read option as {“encoding”:“latin-1”}, but it seems encoding options is not used :
But still having issue in the log
2022-05-11 18:52:14 source > Failed to read data of PERMONLY at scp://Fuze_BI_PermOnly.CSV: UnicodeDecodeError**(‘utf-8’,** b’“PERM_INV_NO”,“INVOICE_DATE”,“CLIENT_NO”,“CLIENT_NAME”,“PO”,“CANDIDATE_NAME”,“MANDATE”,“SALARY”,“BILLED_AMT”,“NET_MARGIN_PCT”,“TERRITORY”,“SALES_REP”,“SALES_REP_NAME”,“SALES_REP2”,“SALESREP2_NAME”,“SALES_PAID”,“RECRUITER”,“RECRUITER_NAME”,“RECRUITER_PAID”,“RECRUITER_2”,“RECRUITER_2_NAME”,“NOTES”\r\n"043573",“2019/01/15”,“01022”,“AESP GROUP “,” “,“Vincent Godard “,” Op\xe9rateur De Machine CNC BEAM”,” .00”," 1365.00"," .00",“QR”,"ME ","6443-CHENIER, MARIE-EVE “,” “,” “,” ",“PB”,"Pamela Badran “,” "\r\n’, 416, 417, ‘invalid continuation byte’)

NB1 I am able to read the file correctly with Panda and this csv read option
NB2 When I manually convert my file to utf8, it works fine (but not an option for Production)

Hello Lopez this looks an issue with the connector when trying to open the file using open_smart library.

Hi Marco,

I think I found 1 issue is that when I changed encoding option in source I DID NOT refresh source schema in connection. I believe that is why it was still trying to use utf-8

Now I an stuck at refreshing source schema, it is freezing (more than 1 hour) even if I dropped most of my file (kept only 5 top rows …)

Not sure how to move forward

If I try to rebuild connection from scratch then I have bellow error message :

Is there somewhere I can look for a detailed log ?

Could you try editing this lines:

https://github.com/airbytehq/airbyte/blob/07ae81cbd6693b7b306a338202e14e2a8680655e/airbyte-integrations/connectors/source-file/source_file/client.py#L115-L116

`        return smart_open.open(uri, transport_params=transport_params, mode=mode, encoding="latin-1")`

        return smart_open.open(self.full_url, mode=mode, encoding="latin-1")

and build locally and test it?