OneDrive Data Extraction Parsing Error

Summary

User is experiencing a parsing error while extracting data from a OneDrive PDF file, indicating a potential mismatch between the file type and configuration or an unparseable record. The error suggests downloading the ‘punkt_tab’ resource from NLTK.


Question

hello , I’m working with OneDrive and encountered the following error while trying to extract data from a file:

  "_airbyte_raw_id": "9a1e42a1-f8a7-4aa0-ae5a-89f145858346",
  "_airbyte_extracted_at": 1732719459974,
  "_airbyte_generation_id": 0,
  "_airbyte_meta": {
    "changes": [],
    "sync_id": 22848019
  },
  "_airbyte_data": {
    "content": null,
    "document_key": "OneDrive.pdf",
    "_ab_source_file_parse_error": "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=OneDrive.pdf message=\n**********************************************************************\n  Resource \u001B[93mpunkt_tab\u001B[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001B[31m&gt;&gt;&gt; import nltk\n  &gt;&gt;&gt; nltk.download('punkt_tab')\n  \u001B[0m\n  For more information see: <https://www.nltk.org/data.html>\n\n  Attempted to load \u001B[93mtokenizers/punkt_tab/english/\u001B[0m\n\n  Searched in:\n    - '/nonexistent/nltk_data'\n    - '/usr/local/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/local/lib/nltk_data'\n    - '/usr/share/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/lib/nltk_data'\n    - '/usr/local/lib/nltk_data'\n**********************************************************************\n",
    "_ab_source_file_last_modified": "2024-11-14T11:45:15.000000Z",
    "_ab_source_file_url": "OneDrive.pdf"
  }
}```
Has anyone encountered this issue before? Any guidance on resolving the parsing problem and ensuring files are extracted correctly would be greatly appreciated.

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1732722082466189) if you want
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
['onedrive', 'data-extraction', 'parsing-error', 'nltk', 'punkt_tab']
</sub>

I am seeing a similar issue with nltk failing to download a resource.

In my case I am trying to build a connector that uses a FileBasedSource. The unstructured_parser tries to download punkt to the /nonexistent path, fails, and the build terminates.

... 
LookupError:                                                                                                                                                                                                                                    
                    **********************************************************************                                                                                                                                                                          
                      Resource [93mpunkt[0m not found.                                                                                                                                                                                                              
                      Please use the NLTK Downloader to obtain the resource:                                                                                                                                                                                        
                                                                                                                                                                                                                                                                    
                      [31m>>> import nltk                                                                                                                                                                                                                           
                      >>> nltk.download('punkt')                                                                                                                                                                                                                    
                      [0m                                                                                                                                                                                                                                           
                      For more information see: <https://www.nltk.org/data.html>                                                                                                                                                                                      
                                                                                                                                                                                                                                                                    
                      Attempted to load [93mtokenizers/punkt.zip[0m                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                    
                      Searched in:                                                                                                                                                                                                                                  
                        - '/nonexistent/nltk_data'                                                                                                                                                                                                                  
                        - '/usr/local/nltk_data'                                                                                                                                                                                                                    
                        - '/usr/local/share/nltk_data'                                                                                                                                                                                                              
                        - '/usr/local/lib/nltk_data'                                                                                                                                                                                                                
                        - '/usr/share/nltk_data'                                                                                                                                                                                                                    
                        - '/usr/local/share/nltk_data'                                                                                                                                                                                                              
                        - '/usr/lib/nltk_data'                                                                                                                                                                                                                      
                        - '/usr/local/lib/nltk_data'                                                                                                                                                                                                                
                    **********************************************************************  
...
File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 49, in <module>                                                                                                         
                        nltk.download("punkt")   
...
File "/usr/local/lib/python3.10/os.py", line 225, in makedirs                                                                                                                                                                                 
                        mkdir(name, mode)                                                                                                                                                                                                                           
                    PermissionError: [Errno 13] Permission denied: '/nonexistent'.    ```
Any assistance would be appreciated.

<@U067A0KLF33> and <@U0826L71K4K> please report the issue in Github I’m going to share this with connector team.
Please inform what platform version you’re using and connector version.