Struggling with S3 to Snowflake connection

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Linux
  • Memory / Disk: you can use something like 128GB / 1TB
  • Deployment: Docker
  • Airbyte Version: 0.40.9
  • Source name/version: S3 - 0.1.21
  • Destination name/version: Snowflake 0.4.38
  • Step: During sync
  • Description:
    Currently unable to get a connection from an S3 bucket to Snowflake to sync. It has worked in the past. We have three sources from the same bucket, from different folder prefixes each. These were originally created under slightly older versions of airbyte and the source (0.39.7 I think), but the instance and connectors have since been upgraded.
    For a few weeks now these have not worked and it’s unclear why. There are a few problems I can tell so far:
  1. The logs won’t load. If you click on the status of the run, it just hangs like this, until after a while it brings up an error message and you have to refresh the page. This means you can’t find the logs on the EC2 also, since you can’t find the path/id for the log file.


    After a while of trying to load one of the failed status logs:

  2. Can’t cancel the sync. It doesn’t respond, just stays like this.

  3. The sync won’t complete. As you can see above, there have been several failures.

One potential issue might be that there are a lot of small files. Maybe this is causing some problems.
Number of files per folder (i.e. per source) = 79,323 with total size 2.1 GB

The syncs have been running for a couple of days without success, which seems odd given that the size is not that big and there are plenty of resources on the EC2 available (CPU, memory and disk space usage are all low).

So I need to:

  1. Be able to find and query the logs in some way to dive into the problem
  2. Solve the problem :slight_smile:

(Edit)
I’ve recreated the connector so that I could follow the logs from the start. It’s been hanging on this for a while. Last message is:
2022-09-27 08:49:12 destination > 2022-09-27 08:49:12 INFO i.a.i.d.s.StagingConsumerFactory(lambda$onStartFunction$1):134 - Preparing tmp tables in destination completed.
Will update if it progresses.
6235d81b_dbe6_43bd_8c5e_30222856eaec_logs_22058_txt.txt (16.1 KB)

Hi @kyle-mackenzie-indee, sorry to hear you’re having all these issues!

Re logs not loading: Could you take a look at the console log in dev tools and see what’s being logged in there?

I think these types of errors may happen if Airbyte isn’t authenticated correctly, I know it’s a super simple ask - but could you try logging out & restarting Airbyte?

I’ve recreated one of the sources and connections and updated my original post with the early status logs, though no errors yet.

Below is the result in chrome dev tools console of trying to load the status logs of one of the others. First two in yellow are just my extensions I think.

Maybe relevant, maybe separate. But if I upgrade a connector and/or airbyte version, do existing connectors, sources and destinations automatically start using the new versions, or do they need to be recreated each time?

This looks like a connection issue related to the health of your deployment. A few ideas:

  1. Could you try redeploying Airbyte?
  2. Are there server logs?
  3. This may be a server timeout issue, could you see if you can extend the server timeout?

What would a redployment involve? Simply restarting the docker-compose (as with an upgrade), or a fresh install, wiping everything on the EC2?
I’ve done the former without success.
If the latter, is there an easy and clean way which keeps the existing sources, destinations and connections?

I tried recreating the connector also. Seems to be running but is incredibly slow. 2 days to do 19MB

Grabbed some logs. I’ve only kept the first 160 or so lines, since the rest was 2,000,000+ rows of the same Detected mismatched datatype on column
I feel like this could be what’s causing log problems for the one that’s been allowed to run for a long time.
8d17fe3d_00c9_4d8e_b253_7d0897d5d073_logs_1211_txt.txt (34.2 KB)

Also tried with the longer timeout, but the issue at the moment seems to be the incredibly slow speed (possibly does hit timeout eventually)

Couldn’t get this working and due to timescales had to give up and switch to another solution

Sorry to hear you’ve had to switch to another solution! If you are still open to debugging this, there are several options that can be tweaked in the S3 source connector that might be causing errors:

  1. Double quote toggle
  2. New line toggle

And another thing that can be configured are the additional reader options. These can accept custom paremeters:
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions

If you have time to upload an example of the data you’re sourcing from S3, that would help!