Problem with local ETL because of new airbyte columns added to source and full data transfer issues

dsec · July 14, 2022, 8:24am

Is this your first time deploying Airbyte?: Yes
OS Version / Instance: MacOS
Memory / Disk: 4Gb
Deployment: Docker
Airbyte Version: 0.39.34-alpha
Source name/version: Snowflake
Destination name/version: BigQuery
Step: The issue is happening during sync
Description:
Hi first of all apologies if this is not the place to post, if so please delete the question and just forward it where is okay or tell us where to post.
We are evaluating Airbyte at the moment for some transfers and we really like it at start, but now having some setbacks and would like to ask for advice:

extending the raw tables with default Airbyte columns breaks our ETL, we can’t insert rows because column list doesn’t match (is there a workaround there or we would also need to map this new fields in every ETL script) - that doesn’t sound really user friendly
basically our ETL is having multiple insert statements, etc to tables that we will then incrementally move, so it’s either re-writing everything to merge and ignore airbyte columns or these is some better setting
when trying to move large tables (it even started breaking with tables with 2M raws, so we didn’t even test with 2G or more) the job fails since after 3 attempts (trying in 50k batches and fail) - does using a larger machine in container solves this or we are missing some setup - we use GCS Stage for BigQuery upload as recommended - this should probably be okay with larger machine, but just to check if it’s expected

any advice on these two things
Thanks

harshith · July 14, 2022, 5:56pm

Hey,

I am not sure if i got this right, the airbyte generated table has 10 columns and your custom script expects 12 columns? Where I am coming from is what is that stopping you from matching Airbyte columns to what you need
We will know better if you could share the logs

dsec · July 16, 2022, 4:37pm

the other way around. table has 10 columns, after the first sync Airbyte added (_AIRBYTE_AB_ID, _AIRBYTE_EMITTED_AT, _AIRBYTE_NORMALIZED_AT, _AIRBYTE_TABLE_NAME_HASHID). we can of course map this extra columns, but imagine we are transferring 100 tables, so we need to completely change insert into with these new columns, or change everything to merge and not map these new columns. so the question is would is the best practice for this?
interesting is that logs show that dbt part finished, but when you check the message it say:
Last attempt:
200.01 MB | 134,669 emitted records | no records | 6m 29s | Sync
Failure Origin: destination, Message: Something went wrong within the destination connector

2022-07-13 15:04:29 - Additional Failure Information: io.airbyte.workers.general.DefaultReplicationWorker$DestinationException: Destination process exited with non-zero exit code 137

200.01 MB | 134,669 emitted records | no records | 6m 29s
Failure Origin: destination, Message: Something went wrong within the destination connector

but I trust the problem is the local docker setup

harshith · July 18, 2022, 4:27am

Hey

Will custom DBT act as some solution here?
Could you share the complete logs file here?

dsec · July 18, 2022, 8:54am

Hi harshith, first of all thanks a lot for assisting here, let me try and give a better explanation of our use case.
Our DWH is on Snowflake, so our entire ETL is done there and we have multiple transformations running over night (set of create or replace, insert overwrite, insert into, merge… set of statements). We have a use case where we either completely or partially (incrementally last 5 days) move these tables to BigQuery.
To transfer the tables everything works as expected (full transfer works great, incremental also - we map unique_id and update_timestamp as a reference for new data) and all is great.
Problem comes when these 4 columns are added to tables in Snowflake, so everything that is not create or replace will break our ETL over night. We can of course change merge statements, introduce DBT as ETL and do it from there, but you have to admit that all of that is super extra work and should be bit simpler, so we are worried that we are missing something obvious here.

harshith · July 19, 2022, 4:06am

Hey could you create a github issue over this so that team can help/discuss there

Topic		Replies	Views
Raw JSON schema not adding new column to schema Connector Questions & Issues normalization , data-loading	1	242	February 22, 2023
Airtable fails to load new columns Connector Questions & Issues data-loading	1	334	June 30, 2022
Hide _airbyte columns created during sync Connector Questions & Issues source-google-sheets , destination-bigquery , data-loading , connectors	3	967	October 11, 2022
Destination data types all varchars Connector Questions & Issues source-postgres , destination-snowflake , normalization	5	326	November 29, 2022
BigQuery source array not normalized correctly in Snowflake destination Connector Questions & Issues normalization	7	527	August 4, 2022

Problem with local ETL because of new airbyte columns added to source and full data transfer issues

Related topics