Difficulty using File connector to download Excel xls file over https

Summary

User is facing difficulty using the File connector to download an Excel xls file over https. They have tried different files and web servers but encountered https errors and format issues. Seeking advice on how to resolve this common use case.


Question

Hi folks,

I’m having difficulty using using the File connector. Specifically, downloading an Excel xls file over https.

I’ve tried a couple of different files, differing web servers, differing structures. I’m getting an https error in one case and was told by support in one case the file format may be an issue and in the other that the problem would be looked at by others.

This is a really basic, common use case so I’m frustrated that such a promising tool seems unable to easily do this. The options for setting up the file connector are few in number so I don’t think it’s user stupidity. Although it could be! :slightly_smiling_face:

Here’s an example file I’m trying to download - https://www.derby.gov.uk/media/derbycitycouncil/contentassets/documents/licences/taxis/vehicles/dcc-private-hire-vehicle-public-register-april2024.xls|https://www.derby.gov.uk/media/derbycitycouncil/contentassets/documents/licences/taxi[…]icles/dcc-private-hire-vehicle-public-register-april2024.xls

Any advice is greatly appreciated.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["file-connector", "excel-xls", "https", "download", "format-issues"]

<@U0705PYALMN> are you using Airbyte Cloud or local?

Can you try the following?

  1. Convert .xls to .xlsx -> :white_check_mark::x:?
  2. Store the excel file in a GDrive folder -> :white_check_mark::x:?
  3. Simplify your excel: 1 tab, 1 column, 3 rows -> :white_check_mark::x:?

I’ve tried the conversion and it doesn’t work. As regards reformatting the file that’s not practical. I need Airbyte to do this autonomously.

Here’s some detail.

Google drive links to file uploaded in both xls and xlsx formats:
https://docs.google.com/spreadsheets/d/1BgpLakMqrLP8ut-XTdULAbCVh2NPlk8y/edit?usp=drive_link&amp;ouid=116900388473101210455&amp;rtpof=true&amp;sd=true|dcc-private-hire-vehicle-public-register-april2024.xlsx
https://docs.google.com/spreadsheets/d/1Z3FDIeunjJA1pPeFJI99-7PSy9EYaiM0/edit?usp=drive_link&amp;ouid=116900388473101210455&amp;rtpof=true&amp;sd=true|dcc-private-hire-vehicle-public-register-april2024.xls
Steps I’ve carried out:

  1. Uploaded both an xls and xlsx to two different files stored on Google Drive. Permissions set to allow anyone with the link access.
  2. Setup a new connection to the file using the file connector (xls file). Failed. Error “Configuration check failed. … Failed to load please use the Official Google Sheets Source connector”
  3. Setup a new connection to the file using the file connector (xlsx file). Failed. Same error.
  4. Setup a new conection using the Google Sheets connector to the xlsx file
  5. Failed. Detailed error below.

Reformatting request - This isn’t practical. It necessitates downloading a file manually, modifying the file manually, saving it as an xlsx, uploading it to Google Drive etc. All of this defeats the purpose of using an ELT tool like Airbyte to automate the process. Sorry, I don’t wish to sound mean but it should be obvious this isn’t a solution anyone can use.

Security comment - When using the Google sheets connector I have to authenticate and give permission to See and download all your Google Drive files.. For a commercial use case this isn’t practical. I’ve set permissions on the file to make it viewable to anyone with the link. That means Airbyte can see the file without the need for full access to all other files. Why is this not enough? It results in a far more secure solution for everyone.

Configuration check failed
Config error:

Learn more
Internal message: Config error:
Failure origin: source
Failure type: config_error

Stacktrace
File “/usr/local/lib/python3.9/site-packages/googleapiclient/_helpers.py”, line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File “/usr/local/lib/python3.9/site-packages/googleapiclient/http.py”, line 938, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://sheets.googleapis.com/v4/spreadsheets/1BgpLakMqrLP8ut-XTdULAbCVh2NPlk8y?includeGridData=false&amp;alt=json returned “This operation is not supported for this document”. Details: “This operation is not supported for this document”>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/airbyte/integration_code/main.py”, line 8, in <module>
run()
File “/airbyte/integration_code/source_google_sheets/run.py”, line 15, in run
launch(source, sys.argv[1:])
File “/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py”, line 235, in launch
for message in source_entrypoint.run(parsed_args):
File “/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py”, line 115, in run
yield from map(AirbyteEntrypoint.airbyte_message_to_string, self.check(source_spec, config))
File “/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py”, line 139, in check
check_result = self.source.check(self.logger, config)
File “/airbyte/integration_code/source_google_sheets/source.py”, line 62, in check
raise AirbyteTracedException(
airbyte_cdk.utils.traced_exception.AirbyteTracedException: Config error:

Sorry, forgot to mention. I’m using Airbyte Cloud.

But are those files, excel files or google sheet? cause there are specific connector for each format

The error log dump provided is from the Google sheets connector

Here’s a screenshot of the error shown when trying to use the file connector to download via http.

Hey Kevin,

Really appreciate your help on this. I’m trying my very best to see what I can do also.

I’ve taken two source Excel files, both different formats, converted them to CSV and uploaded them to Google Drive. Both files don’t import. Attached a screen showing the error. Same error on both.

The files on Google drive are:

https://drive.google.com/file/d/1c3h4W0Ab3lfAaXInZ_Qh4r64jBfNVGZ9/view?usp=sharing

https://drive.google.com/file/d/1K6zSJbdRRO2Px__swtJRQfIzDQarhwrO/view?usp=sharing

Both files have full public access. I’ve also authenticated my Google Drive account in Airbyte. The Google authentication passed within Airbyte.

I’m really really lost with this. I figured I’d just get a couple of csv files uploaded so I can use them as a source and get onto evaluating other areas including transformation and uploading to a Mongo database.

For the sake of troubleshooting can you try to:
• Download the file
• Transform it into a .xlsx file
• Place the file in a google drive folder
• Create a source connector with excel
• in the URL the path should look like this: https://drive.google.com/drive/u/0/folders/XXXXXX
cf. screenshot

If this work then at least you know it works so you can:
• Keep this solution and find a way to automate the manual steps
• Troubleshoot the next steps
If this does not work then it is weird cause for me it does work

if you use the Google Drive connector then you need to create a stream

Also Airbyte does not like files with empty rows or titles in csv based on my experience with the tool.

Might be another reason why this one does not work: https://drive.google.com/file/d/1K6zSJbdRRO2Px__swtJRQfIzDQarhwrO/view