Problem in salesforce chunk decoding

Nakachi-S · July 7, 2022, 8:31am

Is this your first time deploying Airbyte?: No
OS Version / Instance: Ubuntu, MacOS
Memory / Disk: 8Gb / 500 Gb
Deployment: Docker
Airbyte Version: 0.39.34-alpha
Source name/version: Salesforce / 1.0.10
Destination name/version: BigQuery / 1.1.11
Step: during sync
Description:

I am currently trying to sync text with Japanese language from salesforce.

The imported text data was garbled characters, so we investigated.
First of all, my text data was decoded by ISO-8859-1. This is an unexpected behavior. (decoding with utf-8 is expected)

github.com

airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L51-L66


      
          def decode(self, chunk):
              """
              Most Salesforce instances use UTF-8, but some use ISO-8859-1.
              By default, we'll decode using UTF-8, and fallback to ISO-8859-1 if it doesn't work.
              See implementation considerations for more details https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/implementation_considerations.htm
              """
              if self.encoding == DEFAULT_ENCODING:
                  try:
                      decoded = chunk.decode(self.encoding)
                      return decoded
                  except UnicodeDecodeError as e:
                      self.encoding = "ISO-8859-1"
                      self.logger.info(f"Could not decode chunk. Falling back to {self.encoding} encoding. Error: {e}")
                      return self.decode(chunk)
              else:
                  return chunk.decode(self.encoding)

Characters such as Japanese are represented with variable length in UTF-8 (e.g. ‘ち’ is \xe3\x81\xa1)
If the chunk size is fixed at 1024 and decoded, the original text is not properly delimited and cannot be decoded with the error unexpected end of data.
Especially the case when multiple languages are mixed.
Therefore, it is thought that the text is decoded in ISO-8859-1 and garbled characters are generated.

The actual error log is as follows
Falling back to ISO-8859-1 encoding. Error: 'utf-8' codec can't decode byte 0xe6 in position 16793: unexpected end of data

The request is decoded immediately after chunking and saved to tmp_file, but the
Wouldn’t it be better to save all the unprocessed data once and then decode it?

natalyjazzviolin · July 7, 2022, 11:41am

Hi, @Nakachi! There’s an open issue related to decoding non-ascii characters here. I’ve linked your post to it and will follow up with the database team to make sure it is on our roadmap!

marcosmarxm · July 26, 2022, 4:55pm

The open issue: https://github.com/airbytehq/airbyte/issues/14659

Topic		Replies	Views
Salesforce > BigQuery encoding problem Connector Questions & Issues data-loading	10	777	July 25, 2022
Salesforce UnicodeDecodeError Connector Questions & Issues data-loading	24	585	July 14, 2022
Salesforce Connection failing on undecodable byte at end of stream and extra props Connector Questions & Issues data-loading , connectors	8	273	July 14, 2022
Issue to load Latin-1 Encoded file from sFTP Connector Questions & Issues data-loading , connectors	6	492	July 14, 2022
Guidance / update on MySQL source connector issues Connector Questions & Issues source-mysql	3	286	July 14, 2022

Problem in salesforce chunk decoding

Related topics