Source S3 - Failed attempts using up data space. Where is airbyte storing/caching data?

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: Ubuntu 21.04
  • Memory / Disk: 450gb
  • Deployment: Docker
  • Airbyte Version: * 0.39.37-alpha
  • Source name/version: S3
  • Destination name/version: Postgresql
  • Step: Sync
  • Description: I am running a sync from an s3 data lake, to a Postgresql db. A new file with the new data is created in the s3 source and needs to be synced to the psql db.
    The first sync is a bigger 42gb sync. There were a quite a few failed attempts on account of network issues and time constraints. Its taking more than 6 hours so had to cancel a few attempts midway
    I am testing on my laptop, i had started with around 60gb of free space, but across the 6 to 7 failed attempts my space has been completely used up.
    With my last sync failing at 21 gb with no space left on device.
    Where is the data stored during the failed attempts and how can i clear it? without having to uninstall and reinstall the airbyte docker.

Hi @programmeddeath1, welcome to the community and thanks for your post!

I’m a bit confused on your setup. You’ve allocated 450gb of memory for the airbyte instance? Could you run the docker system df command and post it here?

Furthermore, here’s an article that may help you manage and clear the cache/storage of the docker container: Cleaning local docker cache | Rene Hernandez's Site

Hi I am just running the airbyte docker. 450GB is the total drive space available locally. The docker cache is not the issue, the images are only around 30gb. My problem is I had multiple failed attempts of sync from the s3 buckets to the local postgresql database. Each failed sync downloaded to 20 to 40gb of data before getting disconnected. Where is the downloaded data stored in the sync process. The final posstgresql table is blank.

Hi @programmeddeath1, when a sync fails, no data is stored in the destination. The data is instead stored in _airbyte_tmp_* tables. After a sync succeeds we have a macro that deletes these tmp tables. You can find more information here: Clean up unused _airbyte_tmp tables · Issue #7011 · airbytehq/airbyte · GitHub

Unfortunately if a sync fails, these tmp tables are not automatically deleted. Automatically deleting tmp tables from failed syncs is in the works though: automatically clean up temporary tables from failed syncs · Issue #4059 · airbytehq/airbyte · GitHub

Hope this helps!