- Is this your first time deploying Airbyte?: Yes
- OS Version / Instance: AWS
- Memory / Disk: t2.large
- Deployment: Docker compose
- Airbyte Version: 0.42
- Source name/version: s3
- Destination name/version: s3
- Step: csv file parsing
- Description:
hello
A parsing error occurs when reading a csv file in an s3 bucket.
The cause was that the value of one of the fields in the csv was ", so one field value was missing.
—log
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 48 columns, got 47: 106409 1678502377545 1678502377954 PE NULL ES-PE ES-MX NULL HE_DTV_W21P_AFADATAA 03.34.65 W21P …
,retryable=,timestamp=1678843540912], io.airbyte.config.FailureReason@e9730d4[failureOrigin=source,failureType=,internalMessage=Source didn’t exit properly - check the logs!,externalMessage=Something went wrong within the source connector,metadata=io.airbyte.config.Metadata@103fd213[additionalProperties={attemptNumber=0, jobId=25, connector_command=read}],stacktrace=io.airbyte.workers.internal.exception.SourceException: Source didn’t exit properly - check the logs!
at io.airbyte.workers.general.DefaultReplicationWorker.lambda$readFromSrcAndWriteToDstRunnable$7(DefaultReplicationWorker.java:437)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: io.airbyte.workers.exception.WorkerException: Source process exit with code 1. This warning is normal if the job was cancelled.
at io.airbyte.workers.internal.DefaultAirbyteSource.close(DefaultAirbyteSource.java:158)
at io.airbyte.workers.general.DefaultReplicationWorker.lambda$readFromSrcAndWriteToDstRunnable$7(DefaultReplicationWorker.java:435)
… 4 more
–
Is there any way to solve this? I need a method to skip when a parsing error occurs, or a function to fill with null when there are insufficient field values.
For reference, in pyspark, it seems to be filled with nulls in the above situation.