Source S3 (CSV) - Throwing Error When Columns contain JSON strings

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: EC2 t2.medium (Linux/UNIX)
  • Memory / Disk: 4GB / 50GB
  • Deployment: Docker
  • Airbyte Version: 0.40.3
  • Source name/version: S3 (0.1.26)
  • Destination name/version: Redshift (0.3.51)
  • Step: The issue is happening during sync.
  • Description: I have a sync from the Amazon Advertising API to an S3 bucket. Files are separated into ‘folders’ by stream name and are encoded as CSV files. I have opted for root-level normalization for the files.
    When moving the files from the S3 bucket to Redshift, some streams do not seem to work. I am receiving an error pyarrow.lib.ArrowInvalid: CSV parse error: Expected 7 columns, got 28 (or variations of that).

My S3 settings for the CSV format are:
Delimiter: ,
Infer Datatypes: false
Quote Character: "
Escape Character: \
Encoding: utf8
Double Quote: true
Allow Newlines in Values: false
Additional Reader Options: {"strings_can_be_null": true, "quoted_strings_can_be_null": true, "null_values": ["NA", "NULL", ""], "auto_dict_encode": true}
Advanced Options: none
Block Size: 10000
Manually Enforced Data Schema: {}

I’ve tried different combinations of infer datatypes, escape characters, double quotes etc, but none have worked so far.
I’ve attached a sample below of the CSV I’m trying to load (stripped of any identifying data)

_airbyte_ab_id,_airbyte_emitted_at,metric,profileId,recordType,reportDate,updatedAt
c97a7d6d-df00-4b6f-aeb3-a342da984a8f,1667868014696,"{""attributedUnitsOrdered30d"":""0"",""asin"":null,""attributedConversions7dSameSKU"":""0"",""attributedSales7dOtherSKU"":null,""targetingExpression"":""similar-product"",""attributedSales1dSameSKU"":""0"",""attributedUnitsOrdered14d"":""0"",""attributedUnitsOrdered7dOtherSKU"":null,""attributedConversions14d"":""0"",""campaignName"":""Campaign Name"",""attributedSales14dSameSKU"":""0"",""targetingText"":""similar-product"",""attributedConversions30d"":""0"",""attributedSales14d"":""0"",""attributedSales7dSameSKU"":""0"",""impressions"":""529"",""attributedConversions1dSameSKU"":""0"",""attributedSales30dOtherSKU"":null,""attributedUnitsOrdered30dOtherSKU"":null,""attributedSales1dOtherSKU"":null,""attributedSales30d"":""0"",""attributedUnitsOrdered7d"":""0"",""attributedConversions14dSameSKU"":""0"",""attributedConversions30dSameSKU"":""0"",""attributedUnitsOrderedNewToBrand14d"":""0"",""attributedConversions7d"":""0"",""attributedSales30dSameSKU"":""0"",""adGroupId"":""123456789"",""attributedOrdersNewToBrand14d"":""0"",""attributedSales1d"":""0"",""attributedUnitsOrdered14dOtherSKU"":null,""otherAsin"":null,""attributedSalesNewToBrand14d"":""0"",""clicks"":""0"",""attributedUnitsOrdered1dOtherSKU"":null,""attributedConversions1d"":""0"",""targetId"":""123456"",""currency"":""USD"",""campaignId"":""12345678"",""adId"":null,""sku"":null,""attributedSales7d"":""0"",""targetingType"":""TARGETING_EXPRESSION"",""cost"":""0"",""attributedSales14dOtherSKU"":null,""attributedUnitsOrdered1d"":""0"",""adGroupName"":""Ad group""}",12345678,targets,20221101,2022-11-07T16:40:14-08:00
ae1db8d7-6682-4535-b6d9-439ff55cedff,1667868014697,"{""attributedUnitsOrdered30d"":""0"",""asin"":null,""attributedConversions7dSameSKU"":""0"",""attributedSales7dOtherSKU"":null,""targetingExpression"":""asin=\""B012345678\"""",""attributedSales1dSameSKU"":""0"",""attributedUnitsOrdered14d"":""0"",""attributedUnitsOrdered7dOtherSKU"":null,""attributedConversions14d"":""0"",""campaignName"":""Campaign Name"",""attributedSales14dSameSKU"":""0"",""targetingText"":""asin=\""B012345678\"""",""attributedConversions30d"":""0"",""attributedSales14d"":""0"",""attributedSales7dSameSKU"":""0"",""impressions"":""2"",""attributedConversions1dSameSKU"":""0"",""attributedSales30dOtherSKU"":null,""attributedUnitsOrdered30dOtherSKU"":null,""attributedSales1dOtherSKU"":null,""attributedSales30d"":""0"",""attributedUnitsOrdered7d"":""0"",""attributedConversions14dSameSKU"":""0"",""attributedConversions30dSameSKU"":""0"",""attributedUnitsOrderedNewToBrand14d"":""0"",""attributedConversions7d"":""0"",""attributedSales30dSameSKU"":""0"",""adGroupId"":""123456789"",""attributedOrdersNewToBrand14d"":""0"",""attributedSales1d"":""0"",""attributedUnitsOrdered14dOtherSKU"":null,""otherAsin"":null,""attributedSalesNewToBrand14d"":""0"",""clicks"":""0"",""attributedUnitsOrdered1dOtherSKU"":null,""attributedConversions1d"":""0"",""targetId"":""123456"",""currency"":""USD"",""campaignId"":""12345678"",""adId"":null,""sku"":null,""attributedSales7d"":""0"",""targetingType"":""TARGETING_EXPRESSION"",""cost"":""0"",""attributedSales14dOtherSKU"":null,""attributedUnitsOrdered1d"":""0"",""adGroupName"":""Ad group""}",12345678,targets,20221101,2022-11-07T16:40:14-08:00
dd32b491-0d8f-43bf-a8a3-7a1c969c6dde,1667868014697,"{""attributedUnitsOrdered30d"":""0"",""asin"":null,""attributedConversions7dSameSKU"":""0"",""attributedSales7dOtherSKU"":null,""targetingExpression"":""asin=\""B012345678\"""",""attributedSales1dSameSKU"":""0"",""attributedUnitsOrdered14d"":""0"",""attributedUnitsOrdered7dOtherSKU"":null,""attributedConversions14d"":""0"",""campaignName"":""Campaign Name"",""attributedSales14dSameSKU"":""0"",""targetingText"":""asin=\""B012345678\"""",""attributedConversions30d"":""0"",""attributedSales14d"":""0"",""attributedSales7dSameSKU"":""0"",""impressions"":""3"",""attributedConversions1dSameSKU"":""0"",""attributedSales30dOtherSKU"":null,""attributedUnitsOrdered30dOtherSKU"":null,""attributedSales1dOtherSKU"":null,""attributedSales30d"":""0"",""attributedUnitsOrdered7d"":""0"",""attributedConversions14dSameSKU"":""0"",""attributedConversions30dSameSKU"":""0"",""attributedUnitsOrderedNewToBrand14d"":""0"",""attributedConversions7d"":""0"",""attributedSales30dSameSKU"":""0"",""adGroupId"":""123456789"",""attributedOrdersNewToBrand14d"":""0"",""attributedSales1d"":""0"",""attributedUnitsOrdered14dOtherSKU"":null,""otherAsin"":null,""attributedSalesNewToBrand14d"":""0"",""clicks"":""0"",""attributedUnitsOrdered1dOtherSKU"":null,""attributedConversions1d"":""0"",""targetId"":""123456"",""currency"":""USD"",""campaignId"":""12345678"",""adId"":null,""sku"":null,""attributedSales7d"":""0"",""targetingType"":""TARGETING_EXPRESSION"",""cost"":""0"",""attributedSales14dOtherSKU"":null,""attributedUnitsOrdered1d"":""0"",""adGroupName"":""Ad group""}",12345678,targets,20221101,2022-11-07T16:40:14-08:00

Error Log is below:

  File "pyarrow/ipc.pxi", line 683, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 7 columns, got 28: "22592105-2943-4830-9941-2d7d58297ff5",1667789430582,"{""attributedSales30dOtherSKU"":null,""att ...
,retryable=<null>,timestamp=1667955164669], io.airbyte.config.FailureReason@3fec4313[failureOrigin=source,failureType=<null>,internalMessage=io.airbyte.workers.general.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 1,externalMessage=Something went wrong within the source connector,metadata=io.airbyte.config.Metadata@6e73b620[additionalProperties={attemptNumber=0, jobId=352, connector_command=read}],stacktrace=java.util.concurrent.CompletionException: io.airbyte.workers.general.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 1
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1807)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: io.airbyte.workers.general.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 1
	at io.airbyte.workers.general.DefaultReplicationWorker.lambda$getReplicationRunnable$6(DefaultReplicationWorker.java:364)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
	... 3 more
,retryable=<null>,timestamp=1667955165003]]]

Any ideas as to how I can get this working?

Hey,

Thanks for the question. Have you tried manually defining the schema? Given that one of the columns is an object, it seems to be parsing the commas in the json as delimiters which is causing the error.

I also found this post in our forums where a user reported a similar problem: https://discuss.airbyte.io/t/help-setting-up-s3-source/539

Hi Sajarin,

Thanks for the reply! I did see that post. Have just tried defining a custom schema as below:

{"_airbyte_ab_id":"string",
"_airbyte_emitted_at":"string",
"metric":"object",
"profileId":"string",
"recordType":"string",
"reportDate":"string",
"updatedAt":"string"}

Just keeping everything else as a string for simplicity.

The resulting schema in Airbyte is below

Still no luck unfortunately. The connector is reading the commas in the object as delimiting columns. I have a feeling this might just be a drawback of CSV, but if you have any other ideas let me know! Will attempt some other formats in the meantime.

I think this is due to your current csv encoding. A snippet from your current csv:

"{""attributedUnitsOrdered30d"":""0"",

You have the following config:

Quote Character: "
Escape Character: \

In order to parse with that config, your csv needs to be written as:

"{\"attributedUnitsOrdered30d\":\"0\",

We’re successfully parsing csvs containing objects this way - it’s definitely doable. Your current csv just can’t be escaped properly. You could try setting " as the escape character, but I’m not sure if you can have the same character as the escape and quote character. Otherwise, can you control the csv writer at all?

Thanks Adam! I actually managed to get it working by setting the escape character to be a backtick ` (it wasn’t letting me have that field blank for some reason).

The issue seemed to be that the CSV writer was marking a single quote in some of those embedded strings as \"", but then the reader was reading it as a literal " character followed by a quote. Ignoring the escape character in conjunction with the treat double quotes as single quote option ended up doing the trick.

Hello coleydev, it’s been a while without an update from us. Are you still having problems or did you find a solution?