- Is this your first time deploying Airbyte?: No
- OS Version / Instance: Ubuntu, MacOS
- Memory / Disk: 8Gb / 500 Gb
- Deployment: Docker
- Airbyte Version: 0.39.34-alpha
- Source name/version: Salesforce / 1.0.10
- Destination name/version: BigQuery / 1.1.11
- Step: during sync
- Description:
I am currently trying to sync text with Japanese language from salesforce.
The imported text data was garbled characters, so we investigated.
First of all, my text data was decoded by ISO-8859-1. This is an unexpected behavior. (decoding with utf-8 is expected)
Characters such as Japanese are represented with variable length in UTF-8 (e.g. ‘ち’ is \xe3\x81\xa1)
If the chunk size is fixed at 1024 and decoded, the original text is not properly delimited and cannot be decoded with the error unexpected end of data
.
Especially the case when multiple languages are mixed.
Therefore, it is thought that the text is decoded in ISO-8859-1 and garbled characters are generated.
The actual error log is as follows
Falling back to ISO-8859-1 encoding. Error: 'utf-8' codec can't decode byte 0xe6 in position 16793: unexpected end of data
The request is decoded immediately after chunking and saved to tmp_file, but the
Wouldn’t it be better to save all the unprocessed data once and then decode it?