Mysql to clickhouse fails with Memory exception for incremental dedup data for 1GB table

I am trying to sync a 1gb table from Mysql to clickhouse. The source download completes but the normalization to clickhouse fails with a memory exception.

   Code: 241. DB::Exception: Memory limit (total) exceeded: would use 14.00 GiB (attempt to allocate chunk of 4259840 bytes), maximum: 13.94 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: While executing Mer
  compiled Code at ../build/run/airbyte_utils/models/generated/airbyte_incremental/scd/fm_revive/rv_data_summary_Dev_ad_hourly_scd.sql,retryable=<null>,timestamp=1670421696119], io.airbyte.config.FailureReason@7c411496[failureOrigin=normalization,failureType=system_error,internalMessage=:HTTPDriver url http://172.31.45.109:8123 returned response code 500),externalMessage=Normalization failed during the dbt run. This may indicate a problem with the data itself.,metadata=io.airbyte.config.Metadata@212b1f9f[additionalProperties={attemptNumber=0, jobId=13, from_trace_message=true}],stacktrace=AirbyteDbtError: 
38 of 72 ERROR creating sql incremental model fm_revive.rv_data_summary_Dev_ad_hourly_scd .............................. [ERROR in 284.88s]
Database Error in model rv_data_summary_Dev_ad_hourly_scd (models/generated/airbyte_incremental/scd/fm_revive/rv_data_summary_Dev_ad_hourly_scd.sql)
  :HTTPDriver url http://172.31.45.109:8123 returned response code 500)
   Code: 241. DB::Exception: Memory limit (total) exceeded: would use 14.00 GiB (attempt to allocate chunk of 4259840 bytes), maximum: 13.94 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: While executing Mer
  compiled Code at ../build/run/airbyte_utils/models/generated/airbyte_incremental/scd/fm_revive/rv_data_summary_Dev_ad_hourly_scd.sql
38 of 72 ERROR creating sql incremental model fm_revive.rv_data_summary_Dev_ad_hourly_scd .............................. [ERROR in 284.88s]
Database Error in model rv_data_summary_Dev_ad_hourly_scd (models/generated/airbyte_incremental/scd/fm_revive/rv_data_summary_Dev_ad_hourly_scd.sql)
  :HTTPDriver url http://172.31.45.109:8123 returned response code 500)
   Code: 241. DB::Exception: Memory limit (total) exceeded: would use 14.00 GiB (attempt to allocate chunk of 4259840 bytes), maximum: 13.94 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: While executing Mer
  compiled Code at ../build/run/airbyte_utils/models/generated/airbyte_incremental/scd/fm_revive/rv_data_summary_Dev_ad_hourly_scd.sql,retryable=<null>,timestamp=1670421696119]]]
2022-12-07 14:01:36 normalization > HTTPDriver url http://172.31.45.109:8123 returned response code 500)
2022-12-07 14:01:36 normalization > Code: 241. DB::Exception: Memory limit (total) exceeded: would use 14.00 GiB (attempt to allocate chunk of 4259840 bytes), maximum: 13.94 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: While executing MergeSortingTransform. (MEMORY_LIMIT_EXCEEDED) (version 22.11.1.1058 (official build))
2022-12-07 14:01:36 normalization > 
2022-12-07 14:01:36 INFO i.a.w.t.TemporalAttemptExecution(get):162 - Stopping cancellation check scheduling...
2022-12-07 14:01:36 INFO i.a.c.t.TemporalUtils(withBackgroundHeartbeat):283 - Stopping temporal heartbeating...
2022-12-07 14:01:36 INFO i.a.c.i.LineGobbler(voidCall):114 - 
2022-12-07 14:01:36 INFO i.a.c.i.LineGobbler(voidCall):114 - ----- END DEFAULT NORMALIZATION -----
2022-12-07 14:01:36 INFO i.a.c.i.LineGobbler(voidCall):114 - 

The airbyte server has 8GB RAM, the clickhouse server has 16GB RAM.
Is it possible to reduce the batch size for the data being pushed into the destination to handle this within 16GB RAM. What will be the RAM of the server needed to sync tables above 10GB or 50GB etc?

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

  • It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
  • Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
  • Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
  • We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

Hey! I’ve seen a few normalization issues pop up, could you please give more details on your instance and the connector versions?

Source connector mysql - latest
Destination connector clickhouse - latest
airbyte - latest

I just pulled it the same day on Dec 7 on a new aws instance m6 with 8GB RAM and added those connectors and ran this movement.
It was a 1 GB table with datetime as the cursor, and two primary keys. ( I tried each of the keys as a cursor and then datetime which is updated for each record)
It failed on a 4 GB table too which had one primary key.
It usually fails with memory error while Mergesorting which is what shows in the logs.

Here is a doc on scaling:
https://docs.airbyte.com/operator-guides/scaling-airbyte

The memory section should be helpful:
Memory
As mentioned above, we are mainly concerned with scaling Sync jobs. Within a Sync job, the main memory culprit is the Source worker.
This is because the Source worker reads up to 10,000 records in memory. This can present problems for database sources with tables that have large row sizes. e.g. a table with an average row size of 0.5MBs will require 0.5 * 10000 / 1000 = 5GBs of RAM. See this issue for more information.
Our Java connectors currently follow Java’s default behaviour with container memory and will only use up to 1/4 of the host’s allocated memory. e.g. On a Docker agent with 8GBs of RAM configured, a Java connector limits itself to 2Gbs of RAM and will see Out-of-Memory exceptions if this goes higher. The same applies to Kubernetes pods. You may want to customize this by setting JOB_MAIN_CONTAINER_MEMORY_REQUEST and JOB_MAIN_CONTAINER_MEMORY_LIMIT environment variables to custom values.
Note that all Source database connectors are Java connectors. This means that users currently need to over-specify memory resource for Java connectors.

And a doc on configuring connector resources:
https://docs.airbyte.com/operator-guides/configuring-connector-resources

Let me know if this clears things up for you!