Issue with Azure Blob source connector for extracting PDF/DOCX/PPTX contents

Summary

The user is facing an issue with the Azure Blob source connector when trying to fetch PDF/DOCX/PPTX contents. Fetching CSV contents works, but fetching other file types fails.


Question

Hello!
Reporting here as we’re trying to setup the Azure Blob source connector to extract <https://airbyte.com/blog/airbyte-now-supports-extracting-text-from-documents|unstructured file type data> (pdfs and docx mainly)
:white_check_mark: Fetching PDF/DOCX/PPTX contents from Gdrive using the provided connector works (source-gdrive v0.0.3)
:white_check_mark: Fetching CSV contents from Azure Blob using the provided connector (source-azureblobstorage v0.2.3) works
:x: Fetching PDF/DOCX/PPTX contents from Azure Blob using the provided connector (source-azureblobstorage v0.2.3) fails
We’re new to Airbyte, sorry if we overlooked some obvious things
Thanks!



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["azure-blob-source-connector", "pdf", "docx", "pptx", "fetching", "failure"]