Salesforce UnicodeDecodeError

  • Is this your first time deploying Airbyte?: No
  • OS Version: debian-10
  • Memory / Disk: 16GB
  • Deployment: Docker
  • Airbyte Version: 0.36.1-alpha
  • Source name/version: Salesforce 1.0.3
  • Destination name/version: BigQuery 1.1.1
  • Step: Salesforce fails to sync
  • Description:
    After running a Salesforce incremental sync for two weeks, everything worked fine until yesterday when I encountered the same error in product and staging. After retrying staging the error solved itself but this didn’t work in production.
    I retried production more than 10 times already with different sync modes(incremental, full replication etc) but nothing worked. I also reset the incremental sync and started from scratch but that didn’t work as well. As a last resort, I deleted the Airbyte connection and all BigQuery data and tried running everything from scratch, but the error was still there.
    Please note that production and staging are fetching from the same salesforce enviriment just writing to different BigQuery projects. Also, I have one Airbyte instance for both staging/prod.
    See the logs below
    logs-529.txt (3.5 MB)
1 Like

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 8379: unexpected end of data

@i-could you try to update Salesforce connector to latest version?

@marcosmarxm I also just updated to latest version of airbyte server but that didn’t help. So the connection looks like this:

Airbyte Version: 0.36.1-alpha
Source: Salesforce (1.0.3)
Destination: BigQuery (1.1.1)

@marcosmarxm any idea what might be going wrong? :slight_smile:

@i-zaykov the 2 connections you showed in the picture use the same Source (api token, start date) to two different Bigquery db?

yes, exactly @marcosmarxm Same Salesforce connection but different BigQuery projects - staging and prod.

To be honest this problem doesn’t make much sense to me :sweat_smile:
How the same connector to the same destination doesn’t works.
Would be possible to disable the staging and only try to sync for prod?
Do you have any other connection to your Bigquery prod destination?

@marcosmarxm I also find it super strange because the sync was working just fine for 2 weeks and it failed over Easter.

Would be possible to disable the staging and only try to sync for prod?

Yes, tried that but didn’t work.

Do you have any other connection to your Bigquery prod destination?

Yes 4 more connection that are setup the same way salesforce is setup and they have being working fine so far.

Additionally, I tried creating a new connection to splt salesforce production and staging but did not succeed. I also tried it the other way around by creating new credentials for BigQuery, but the result was the same.

@marcosmarxm is there something else we can try?

For me the same issue arose using Salesforce to Bigquery. This happened after upgrading to 0.36.1-alpha and saving the connection to include one more SF object. This logically triggered a reset and is now failing

2022-04-21 21:19:21 e[44msourcee[0m > Syncing stream: CampaignMember 
2022-04-21 21:19:23 e[44msourcee[0m > Sleeping 0.501 seconds while waiting for Job: CampaignMember/7500900000GVrpSAAT to complete. Current state: InProgress
2022-04-21 21:19:24 e[44msourcee[0m > Sleeping 0.502718281828459 seconds while waiting for Job: CampaignMember/7500900000GVrpSAAT to complete. Current state: InProgress
2022-04-21 21:19:24 e[44msourcee[0m > Encountered an exception while reading stream SourceSalesforce
Traceback (most recent call last):
  File "/airbyte/integration_code/source_salesforce/source.py", line 109, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 215, in _read_incremental
    for record_counter, record_data in enumerate(records, start=1):
  File "/airbyte/integration_code/source_salesforce/streams.py", line 364, in read_records
    for count, record in self.read_with_chunks(self.download_data(url=job_full_url)):
  File "/airbyte/integration_code/source_salesforce/streams.py", line 276, in download_data
    data_file.writelines(self.filter_null_bytes(chunk.decode("utf-8")))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5406: unexpected end of data
2022-04-21 21:19:24 e[44msourcee[0m > 'utf-8' codec can't decode byte 0xc3 in position 5406: unexpected end of data
Traceback (most recent call last):
  File "/airbyte/integration_code/main.py", line 13, in <module>
    launch(source, sys.argv[1:])
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 127, in launch
    for message in source_entrypoint.run(parsed_args):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 118, in run
    for message in generator:
  File "/airbyte/integration_code/source_salesforce/source.py", line 126, in read
    raise e
  File "/airbyte/integration_code/source_salesforce/source.py", line 109, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 215, in _read_incremental
    for record_counter, record_data in enumerate(records, start=1):
  File "/airbyte/integration_code/source_salesforce/streams.py", line 364, in read_records
    for count, record in self.read_with_chunks(self.download_data(url=job_full_url)):
  File "/airbyte/integration_code/source_salesforce/streams.py", line 276, in download_data
    data_file.writelines(self.filter_null_bytes(chunk.decode("utf-8")))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5406: unexpected end of data

@wisse yeah, quite strange indeed. Does it happened to all objects or only to the new one?
@marcosmarxm any thought? :slight_smile:

I got a similar error.

Set up Salesforce → Snowflake connection for the first time today and tried to sync about 20 objects. All objects worked except one. I isolated the failed object into its own connection and tried a few more times with different sync modes and start dates. Same errors.

airbyte version: 0.35.67-alpha
snowflake version: 0.4.24
salesforce version: 1.0.4

logs-2772.txt (55.1 KB)

2022-04-22 19:22:50 e[44msourcee[0m > Encountered an exception while reading stream SourceSalesforce
Traceback (most recent call last):
  File "/airbyte/integration_code/source_salesforce/source.py", line 109, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 248, in _read_full_refresh
    for record in records:
  File "/airbyte/integration_code/source_salesforce/streams.py", line 364, in read_records
    for count, record in self.read_with_chunks(self.download_data(url=job_full_url)):
  File "/airbyte/integration_code/source_salesforce/streams.py", line 276, in download_data
    data_file.writelines(self.filter_null_bytes(chunk.decode("utf-8")))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 2476: unexpected end of data
2022-04-22 19:22:50 e[44msourcee[0m > 'utf-8' codec can't decode byte 0xc2 in position 2476: unexpected end of data
Traceback (most recent call last):
  File "/airbyte/integration_code/main.py", line 13, in <module>
    launch(source, sys.argv[1:])
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 127, in launch
    for message in source_entrypoint.run(parsed_args):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 118, in run
    for message in generator:
  File "/airbyte/integration_code/source_salesforce/source.py", line 126, in read
    raise e
  File "/airbyte/integration_code/source_salesforce/source.py", line 109, in read
    yield from self._read_stream(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 159, in _read_stream
    for record in record_iterator:
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/abstract_source.py", line 248, in _read_full_refresh
    for record in records:
  File "/airbyte/integration_code/source_salesforce/streams.py", line 364, in read_records
    for count, record in self.read_with_chunks(self.download_data(url=job_full_url)):
  File "/airbyte/integration_code/source_salesforce/streams.py", line 276, in download_data
    data_file.writelines(self.filter_null_bytes(chunk.decode("utf-8")))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 2476: unexpected end of data
2022-04-22 19:22:51 e[32mINFOe[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 0 (0 bytes)

Two of the older objects, which previously have always worked (campaignmember and opportunities) are failing.

Did anyone find a cause, solution or workaround yet?

not at least on my end @wisse.
@marcosmarxm can you advice on next steps?

Sorry the long delay I was OOO. @i-zaykov I tried to use opportunity and campaigns object but wasn’t able to reproduce the error. What objects are you trying to sync? From what version did you upgrade?

With my limited knowledge I’ve set up VSCode to attach a debugger to the main.py read command. What I found is the following:

In airbyte-integrations/connectors/source-salesforce/source_salesforce/source.py line 276 I found the chunk ends with \xc3. Which is only part of the ü (\xc3\xbc) character which it should represent. So I think chunking and then decoding utf-8 causes the issue.

@marcosmarxm does this provide enough information to replicate this issue?

@i-zaykov @Shanhui-Daybreak Looking through the code on git and doing my own testing, this issue can be avoided by downgrading to 1.0.2.

Since this issue only occurs when a piece of string gets chunked it looked like the issue appeared randomly but actually it got introduced here 🐛 [On call / 171] Source Salesforce: fixed the bug when `Bulk` fetch … · airbytehq/airbyte@644ace4 · GitHub

2 Likes

Thanks! I downgraded it to 1.0.2 and it worked.

Thanks @wisse to identify the part causing bug in code.

I opened a Github issue Source Salesforce: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 8379: unexpected end of data · Issue #12372 · airbytehq/airbyte · GitHub ask team to work and solve the issue in next version.

1 Like

@wisse thank you so much! That worked out for me as well.