Summary
User requests enhancements to the Google Drive source connector for S3 destination, including adding
file_id
andfile_url
to metadata and enabling raw file extraction from Google Drive.
Question
Hi,
We’re using the Google Drive source with an S3 destination and need to implement some critical enhancements to the connector. Here’s what we’re looking to achieve:
• Add file_id
and file_url
to Metadata:
- Currently, the connector fetches details like file content, file name, and file path, but it doesn’t include
file_id
orfile_url
. file_name
is currently used as the primary key, but we’d like to switch tofile_id
as it’s more robust.- We’ve identified changes needed in the
airbyte_cdk[file-based]
to include these additional fields. - Question: Will modifying the generic
airbyte_cdk[file-based]
to support these fields affect other connectors relying on it?
• Enable Raw File Extraction:
The current setup uses the Unstructured platform for text extraction, but I’d like to be able to:
- Extract tables from Excel files.
- Extract both tables and images from PDFs.
- Store raw files (binary data) from Google Drive directly to S3 without any processing.
- Question: Is there a way to configure the connector to extract and store raw files instead of processed text?
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.