OneDrive Data Extraction Error

Summary

User is experiencing a parsing error while extracting data from a OneDrive PDF file, indicating a potential mismatch between the file type and configuration or an unparseable record. The error suggests downloading the ‘punkt_tab’ resource from NLTK.


Question

hello , I’m working with OneDrive and encountered the following error while trying to extract data from a file:

  "_airbyte_raw_id": "9a1e42a1-f8a7-4aa0-ae5a-89f145858346",
  "_airbyte_extracted_at": 1732719459974,
  "_airbyte_generation_id": 0,
  "_airbyte_meta": {
    "changes": [],
    "sync_id": 22848019
  },
  "_airbyte_data": {
    "content": null,
    "document_key": "OneDrive.pdf",
    "_ab_source_file_parse_error": "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=OneDrive.pdf message=\n**********************************************************************\n  Resource \u001B[93mpunkt_tab\u001B[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001B[31m&gt;&gt;&gt; import nltk\n  &gt;&gt;&gt; nltk.download('punkt_tab')\n  \u001B[0m\n  For more information see: <https://www.nltk.org/data.html>\n\n  Attempted to load \u001B[93mtokenizers/punkt_tab/english/\u001B[0m\n\n  Searched in:\n    - '/nonexistent/nltk_data'\n    - '/usr/local/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/local/lib/nltk_data'\n    - '/usr/share/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/lib/nltk_data'\n    - '/usr/local/lib/nltk_data'\n**********************************************************************\n",
    "_ab_source_file_last_modified": "2024-11-14T11:45:15.000000Z",
    "_ab_source_file_url": "OneDrive.pdf"
  }
}```
Has anyone encountered this issue before? Any guidance on resolving the parsing problem and ensuring files are extracted correctly would be greatly appreciated.

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C021JANJ6TY/p1732720203573769) if you want
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
['onedrive', 'data-extraction', 'parsing-error', 'nltk', 'punkt_tab']
</sub>