Summary
User is experiencing a parsing error while extracting data from a OneDrive PDF file, indicating a potential mismatch between the file type and configuration or an unparseable record. The error suggests downloading the ‘punkt_tab’ resource from NLTK.
Question
hello , I’m working with OneDrive and encountered the following error while trying to extract data from a file:
"_airbyte_raw_id": "9a1e42a1-f8a7-4aa0-ae5a-89f145858346",
"_airbyte_extracted_at": 1732719459974,
"_airbyte_generation_id": 0,
"_airbyte_meta": {
"changes": [],
"sync_id": 22848019
},
"_airbyte_data": {
"content": null,
"document_key": "OneDrive.pdf",
"_ab_source_file_parse_error": "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=OneDrive.pdf message=\n**********************************************************************\n Resource \u001B[93mpunkt_tab\u001B[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001B[31m>>> import nltk\n >>> nltk.download('punkt_tab')\n \u001B[0m\n For more information see: <https://www.nltk.org/data.html>\n\n Attempted to load \u001B[93mtokenizers/punkt_tab/english/\u001B[0m\n\n Searched in:\n - '/nonexistent/nltk_data'\n - '/usr/local/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/local/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n",
"_ab_source_file_last_modified": "2024-11-14T11:45:15.000000Z",
"_ab_source_file_url": "OneDrive.pdf"
}
}```
Has anyone encountered this issue before? Any guidance on resolving the parsing problem and ensuring files are extracted correctly would be greatly appreciated.
<br>
---
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1732722082466189) if you want
to access the original thread.
[Join the conversation on Slack](https://slack.airbyte.com)
<sub>
['onedrive', 'data-extraction', 'parsing-error', 'nltk', 'punkt_tab']
</sub>