Nltk resource download failure in FileBasedSource connector

Summary

The connector build fails due to nltk attempting to download the ‘punkt’ resource to a non-existent directory, resulting in a LookupError and subsequent PermissionError.


Question

I am seeing a similar issue with nltk failing to download a resource.

In my case I am trying to build a connector that uses a FileBasedSource. The unstructured_parser tries to download punkt to the /nonexistent path, fails, and the build terminates.

... 
LookupError:                                                                                                                                                                                                                                    
                    **********************************************************************                                                                                                                                                                          
                      Resource [93mpunkt[0m not found.                                                                                                                                                                                                              
                      Please use the NLTK Downloader to obtain the resource:                                                                                                                                                                                        
                                                                                                                                                                                                                                                                    
                      [31m>>> import nltk                                                                                                                                                                                                                           
                      >>> nltk.download('punkt')                                                                                                                                                                                                                    
                      [0m                                                                                                                                                                                                                                           
                      For more information see: <https://www.nltk.org/data.html>                                                                                                                                                                                      
                                                                                                                                                                                                                                                                    
                      Attempted to load [93mtokenizers/punkt.zip[0m                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                    
                      Searched in:                                                                                                                                                                                                                                  
                        - '/nonexistent/nltk_data'                                                                                                                                                                                                                  
                        - '/usr/local/nltk_data'                                                                                                                                                                                                                    
                        - '/usr/local/share/nltk_data'                                                                                                                                                                                                              
                        - '/usr/local/lib/nltk_data'                                                                                                                                                                                                                
                        - '/usr/share/nltk_data'                                                                                                                                                                                                                    
                        - '/usr/local/share/nltk_data'                                                                                                                                                                                                              
                        - '/usr/lib/nltk_data'                                                                                                                                                                                                                      
                        - '/usr/local/lib/nltk_data'                                                                                                                                                                                                                
                    **********************************************************************  
...
File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 49, in <module>                                                                                                         
                        nltk.download("punkt")   
...
File "/usr/local/lib/python3.10/os.py", line 225, in makedirs                                                                                                                                                                                 
                        mkdir(name, mode)                                                                                                                                                                                                                           
                    PermissionError: [Errno 13] Permission denied: '/nonexistent'.    ```
Any assistance would be appreciated.

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1733339726564329) if you want
to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
['nltk', 'punkt', 'filebasedsource', 'connector-builder', 'permission-error']
</sub>