Problem in salesforce chunk decoding

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu, MacOS
  • Memory / Disk: 8Gb / 500 Gb
  • Deployment: Docker
  • Airbyte Version: 0.39.34-alpha
  • Source name/version: Salesforce / 1.0.10
  • Destination name/version: BigQuery / 1.1.11
  • Step: during sync
  • Description:

I am currently trying to sync text with Japanese language from salesforce.

The imported text data was garbled characters, so we investigated.
First of all, my text data was decoded by ISO-8859-1. This is an unexpected behavior. (decoding with utf-8 is expected)

Characters such as Japanese are represented with variable length in UTF-8 (e.g. ‘ち’ is \xe3\x81\xa1)
If the chunk size is fixed at 1024 and decoded, the original text is not properly delimited and cannot be decoded with the error unexpected end of data.
Especially the case when multiple languages are mixed.
Therefore, it is thought that the text is decoded in ISO-8859-1 and garbled characters are generated.

The actual error log is as follows
Falling back to ISO-8859-1 encoding. Error: 'utf-8' codec can't decode byte 0xe6 in position 16793: unexpected end of data

The request is decoded immediately after chunking and saved to tmp_file, but the
Wouldn’t it be better to save all the unprocessed data once and then decode it?

Hi, @Nakachi! There’s an open issue related to decoding non-ascii characters here. I’ve linked your post to it and will follow up with the database team to make sure it is on our roadmap!

1 Like

The open issue: