Destination Bigquery sync process never starts

Is this your first time deploying Airbyte: No
OS Version / Instance: Debian 10 (buster) on AWS EC2, 8 cores, 16GB RAM
Deployment: Docker
Airbyte Version: 0.35.45-alpha
Source name: My SQL (0.4.13) - Incremental
Destination : BigQuery (0.5.0) - Deduped + history

Step : Writing data to BigQuery

Hello ! My connection is running well on the source-side getting 8 million rows from 10 tables in MySQL with the latest log being : 2022-03-25 14:36:32 source > 2022-03-25 14:36:32 INFO i.a.i.s.m.MySqlSource(main):200 - {} - completed source: class io.airbyte.integrations.source.mysql.MySqlSource.

However, after succeeding with the source there is no info on the Destination-bigquery worker, with the latest log line being about the Source completed. It would stay in running state for 20 hours (so far), with no further log to investigate, whether it be in the UI, the Docker containers or in the BigQuery console. After 20 hours or so, the connector would simply trigger a deadline exceeded error.

Is there any logs I can check ? Is the deduped+history mode recommended for such data volumes ? (8 millions data points + 10 tables)

Thanks,

1 Like

@NahidOulmi if you sync only one small table the sync works? The suggestion is to check if the problem is happening because of the amount of data or there is something fishy in the sync workflow.

@marcosmarxm I synced one table with ~6 million rows and it failed, and I synced with one smaller table (~300 000 rows) and BigQuery responded so it’s probably related to the amount of data to load. Are there some data sync limits on the BigQuery destination ? How would you suggest I process for further debugging ?

For large syncs it’s recommended to use GCS Staging @NahidOulmi. Can you confirm you’re using it already? Maybe tweak the block size to a bigger value.

@marcosmarxm yes I am using GCS Staging. The logs states that Avro chunks are being written to GCS, however, in my GCS bucket, I don’t see these temp files. The connector has access to the GCS bucket (a reset command would work fine, writing an Avro file to the bucket). Also, I updated the chunk size to 10m as recommended istead of 5m.

Also, it failed again today on syncing 30 smaller MySQL tables (Less than 10mb each).

Hey @NahidOulmi could you please upgrade your destination bigquery connector to version 1.0.1? 0.5.0 is pretty old :smile: .

If you still encounter an error could you please share:

  • your full sync logs
  • your server logs (downloadable from the settings page in the UI)

Could you also check what containers are running when there are no log displayed and their memory consumption? You could try to upsize EC2 to get more memory.

So I thought the GCS staging was at fault here, so I tested with the GCS destination (version 0.1.24, JSONL format) instead of the BigQuery destination and I get the same issue, where no file would be loaded at the destination.

I have other connections (for example, MongoDB connections) that uses the same GCS destination, but with the mySQL source for some reason it is failing.

@alafanechere yes I will send the logs :slight_smile:

Thanks,

Here are the logs:

logs-1934.txt (2.6 MB)

Syncing the source in full-refresh mode instead of Incremental mode seemed to fix the issue, so the issue is related to the combination of Incremental mode MySQL source - GCS/BigQuery destination. I need to work in incremental mode.

2 Likes

Hi!

Has anyone figured out what’s at fault here?

I’m setting up my first connection (MySQL CDC → BigQuery via GCS Staging) and it seems to be hanging at the same stage: 2022-04-22 14:22:06 source > 2022-04-22 14:22:05 INFO i.a.i.s.m.MySqlSource(main):208 - completed source: class io.airbyte.integrations.source.mysql.MySqlSource.

I’m seeing 3 AVRO files in GCS, but my tmp tables in BigQuery are empty and none of the final ones have been created…

Other than setting the sync in full refresh and not incremental, is there a way to fix this?

Thanks!

Hi @NahidOulmi ,
Could you please also try to upgrade your source connector to its latest version (you are running source-mysql 0.4.9 and 0.5.7 is the latest).
According to your logs the read behaves as expected.
It looks likes there might be an issue with your destination setup:

2022-03-31 07:42:11 e[32mINFOe[m i.a.v.j.JsonSchemaValidator(test):56 - JSON schema validation failed. 
errors: $.compression_codec: is missing but it is required, $.format_type: does not have a value in the enumeration [Avro]
2022-03-31 07:42:11 e[32mINFOe[m i.a.v.j.JsonSchemaValidator(test):56 - JSON schema validation failed. 
errors: $.flattening: is missing but it is required, $.format_type: does not have a value in the enumeration [CSV]

Could you double check what value you’ve set for these fields? (it might be a red herring)
I think you chosed JSONL. Do you mind trying with Avro?

My goal with these suggestions is to determine if the problem is on the source or destination side…

I’m seeing the same issue, using mysql source 0.5.6 and destination Big Query 1.1.1 - I have not tried using full refresh as I also wish to use incremental - if I remove GCS staging, smaller tables are syncing to BQ OK - but on our larger tables the sync to BQ throws a 404 - which from reading related airbyte github issues allude to problems with the BQ SDK the workaround is to enable GCS staging, so here I am :wink:

2022-04-25 15:13:18 INFO i.a.v.j.JsonSchemaValidator(test):56 - JSON schema validation failed. 
errors: $.credential: is not defined in the schema and the schema does not allow additional properties, $.part_size_mb: is not defined in the schema and the schema does not allow additional properties, $.gcs_bucket_name: is not defined in the schema and the schema does not allow additional properties, $.gcs_bucket_path: is not defined in the schema and the schema does not allow additional properties, $.keep_files_in_gcs-bucket: is not defined in the schema and the schema does not allow additional properties, $.method: must be a constant value Standard
2022-04-25 15:13:18 INFO i.a.w.p.KubePodProcess(close):713 - (pod: airbyte / normalization-sync-22-2-qxlfb) - Closed all resources for pod

Hey @NahidOulmi, do you still encounter this problem after upgrade MySQL source ?
@Ufou I’m not sure you encounter the exact same problem. Do you mind opening a second topic, sharing your full logs and the related Github issue you found? Thanks!