Problem with local ETL because of new airbyte columns added to source and full data transfer issues

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: MacOS
  • Memory / Disk: 4Gb
  • Deployment: Docker
  • Airbyte Version: 0.39.34-alpha
  • Source name/version: Snowflake
  • Destination name/version: BigQuery
  • Step: The issue is happening during sync
  • Description:
    Hi first of all apologies if this is not the place to post, if so please delete the question and just forward it where is okay or tell us where to post.
    We are evaluating Airbyte at the moment for some transfers and we really like it at start, but now having some setbacks and would like to ask for advice:
  1. extending the raw tables with default Airbyte columns breaks our ETL, we can’t insert rows because column list doesn’t match (is there a workaround there or we would also need to map this new fields in every ETL script) - that doesn’t sound really user friendly
    basically our ETL is having multiple insert statements, etc to tables that we will then incrementally move, so it’s either re-writing everything to merge and ignore airbyte columns or these is some better setting
  2. when trying to move large tables (it even started breaking with tables with 2M raws, so we didn’t even test with 2G or more) the job fails since after 3 attempts (trying in 50k batches and fail) - does using a larger machine in container solves this or we are missing some setup - we use GCS Stage for BigQuery upload as recommended - this should probably be okay with larger machine, but just to check if it’s expected

any advice on these two things
Thanks

Hey,

  1. I am not sure if i got this right, the airbyte generated table has 10 columns and your custom script expects 12 columns? Where I am coming from is what is that stopping you from matching Airbyte columns to what you need
  2. We will know better if you could share the logs
  1. the other way around. table has 10 columns, after the first sync Airbyte added (_AIRBYTE_AB_ID, _AIRBYTE_EMITTED_AT, _AIRBYTE_NORMALIZED_AT, _AIRBYTE_TABLE_NAME_HASHID). we can of course map this extra columns, but imagine we are transferring 100 tables, so we need to completely change insert into with these new columns, or change everything to merge and not map these new columns. so the question is would is the best practice for this?
  2. interesting is that logs show that dbt part finished, but when you check the message it say:
    Last attempt:
    200.01 MB | 134,669 emitted records | no records | 6m 29s | Sync
    Failure Origin: destination, Message: Something went wrong within the destination connector

2022-07-13 15:04:29 - Additional Failure Information: io.airbyte.workers.general.DefaultReplicationWorker$DestinationException: Destination process exited with non-zero exit code 137

200.01 MB | 134,669 emitted records | no records | 6m 29s
Failure Origin: destination, Message: Something went wrong within the destination connector

but I trust the problem is the local docker setup

Hey

  1. Will custom DBT act as some solution here?
  2. Could you share the complete logs file here?

Hi harshith, first of all thanks a lot for assisting here, let me try and give a better explanation of our use case.
Our DWH is on Snowflake, so our entire ETL is done there and we have multiple transformations running over night (set of create or replace, insert overwrite, insert into, merge… set of statements). We have a use case where we either completely or partially (incrementally last 5 days) move these tables to BigQuery.
To transfer the tables everything works as expected (full transfer works great, incremental also - we map unique_id and update_timestamp as a reference for new data) and all is great.
Problem comes when these 4 columns are added to tables in Snowflake, so everything that is not create or replace will break our ETL over night. We can of course change merge statements, introduce DBT as ETL and do it from there, but you have to admit that all of that is super extra work and should be bit simpler, so we are worried that we are missing something obvious here.

Hey could you create a github issue over this so that team can help/discuss there