Help setting up s3 source

arunsivasankaran · April 10, 2022, 12:13am

Is this your first time deploying Airbyte?: Yes
OS Version / Instance: Ubuntu
Memory / Disk: 8gb / 30 gb
Deployment: docker
Airbyte Version: 0.35.65-alpha
Source name/version: s3
Destination name/version: redshift
Step: sync
Description: Need help figuring out configuration settings for s3 source.

I am trying to setup airbyte for the first time. I have an s3 source setup and a redshift destination setup. I am running into problems during sync. I have tried a bunch of different settings…but can’t figure it out.

The current error is…

pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 9

I assume this is because of some config setting. There are no columns per see in these files. They are jsonnl files.

My configs…

encoding: utf-8
Delimiter: ,
Quote Char: "
Advanced Options: {"column_names": ["test_column"] }

The files I am trying to process are kept in this folder structure (in the s3 bucket)…

service-events/dt=2021-10-03/staging-firehose-datalake-stream-5-2021-10-03-16-45-26-b4fc185e-9133-4aaf-b6d4-f0616f0f2812

Here is content of one of these files…

{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"a bit","sourceRoot":"bit","sourcePrefix":"a","targetPhrase":"ein Bisschen","incrementalViews":1,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"job","sourceRoot":"job","targetPhrase":"Arbeit","incrementalViews":1,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"information","sourceRoot":"information","targetPhrase":"Information","incrementalViews":1,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"example","sourceRoot":"example","targetPhrase":"Beispiel","incrementalViews":1,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"greatly","sourceRoot":"greatly","targetPhrase":"sehr","incrementalViews":1,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"example","sourceRoot":"example","targetPhrase":"Beispiel","incrementalViews":0,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"day","sourceRoot":"day","targetPhrase":"Tag","incrementalViews":1,"eventName":"translationViewed"}
{"userId":"615630dc81cd91175ee075fc","sourceLanguage":"en","targetLanguage":"de","sourcePhrase":"functionality","sourceRoot":"functionality","targetPhrase":"Funktionalität","incrementalViews":1,"eventName":"translationViewed"}

alafanechere · April 11, 2022, 4:32pm

Hi @arunsivasankaran,
The root cause of your problem is that your input files are not CSV format but JSONL format. The s3 source connector does not yet support this format.
There was a WIP PR that tried to implement a JSON parser: 🎉 Source S3: add json parser by Tutuchan · Pull Request #9117 · airbytehq/airbyte · GitHub

Feel free to take over if you want to finish this integration!

arunsivasankaran · April 11, 2022, 5:07pm

can I not treat the data like a 1 column csv?

alafanechere · April 11, 2022, 5:17pm

I would not recommend it but you could try to user \n delimiter.

arunsivasankaran · April 11, 2022, 5:23pm

does the delimiter not mean the separator between each column? Or am I misunderstanding the setting?. I want to treat the each line in the file as a single column with a json string being the value

alafanechere · April 11, 2022, 6:07pm

A delimiter delimits columns, if you use , you will have 9 columns as you have 9 , per line.
As your file is not a CSV a hack could be to use \n to consider each line as a column and a record. I’m not sure if this works as \n also delimits lines, hence records. You’ll also have some changes to make to the quoting setting and also probably have to set the shema manually as something like: {"data": "string"}.

In a nutshell, through some hack with reader parameter you might achieve your desired output, but this is not really something I’d recommend.

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
Source S3 (CSV) - Throwing Error When Columns contain JSON strings Connector Questions & Issues source-s3	5	1194	November 15, 2022
Sync worker failed S3 -> Redshift Connector Questions & Issues destination-redshift , source-s3	5	441	July 14, 2022
Cannot load JSON files with S3 Source Connector Questions & Issues source-s3	3	847	July 14, 2022
Source S3 - One row of mostly nulled data Connector Questions & Issues destination-redshift , data-loading , source-s3	10	737	September 19, 2022
Destination S3 key error when moving large amount of data Connector Questions & Issues destination-redshift , data-loading , connectors	10	977	July 14, 2022

Help setting up s3 source

Related topics