Summary
How to configure Airbyte to run multiple parallel streams and start sync for specific streams within a connection
Question
Hi folks… how can i make airbyte to run multiple parallel streams? I have airbyte oss running which we deployed via airbyte helm charts with default values on to aws eks . Configured 2 Streams & they running one after other but not in parallel.
Also is there anyway we can start sync for one of the stream in a connection rather than all of them at once? We have some api endpoints which are to be refreshed every 24hrs and some every 2 hours. Is there anyway we can achieve it?
<#C027KKE4BCZ|help-connector-development>
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.
Join the conversation on Slack
["airbyte", "multiple-parallel-streams", "sync", "specific-streams", "api-endpoints", "refresh", "aws-eks", "helm-charts"]
Airbyte allows you to configure the maximum number of each type of worker allowed to run in parallel. This can be done by setting environment variables. For your case, you might want to adjust the MAX_SYNC_WORKERS
environment variable, which controls the maximum number of Sync workers allowed to run in parallel. The default value for this variable is currently set to 5. Here are the environment variables you can tweak to increase the workload of your Airbyte instance: - MAX_SPEC_WORKERS
: Maximum number of Spec workers allowed to run in parallel. - MAX_CHECK_WORKERS
: Maximum number of Check connection workers allowed to run in parallel. - MAX_DISCOVERY_WORKERS
: Maximum number of Discovery workers allowed to run in parallel. - MAX_SYNC_WORKERS
: Maximum number of Sync workers allowed to run in parallel. You can find more details on configuring jobs & workers in the Airbyte documentation. Please note that these configurations should be done carefully, as increasing the number of workers might put more load on your system. Make sure your system can handle the increased load before making these changes.
For now, configuring separate connections with selected streams might be a way to go
thanks for quick response <@U05JENRCF7C>… we have 100s of partners each having different api endpoints (min 5 - max 20) for data ingestion requirement currently across 6 products. So we will have minimum of 100x5 (min endpoints)x6 = 3000 connections at a minimum scale. Can airbyte handle this scale? I mean its going to keep increasing.
you need to check this on your own
in my project I have only ~60 connections, but I have few tips for you
• check architecture diagram https://docs.airbyte.com/understanding-airbyte/high-level-view
• you need to configure enough workers for your workload (no easy answer how many, you need observe memory/cpu use and adjust accordingly)
• spread all connections throughout the day, so you will have a lower number of simultaneous connections
• use cluster autoscaling
Hi All / <@U05JENRCF7C>… i have scaled up my deployments temporal from 1 to 4 and workers from 1 to 4 but still no parallelization in one connection… One connection has 3 streams which are not interdependent. Am hoping that we can run them in parallel. Btw i tried looking for MAX_SPEC_WORKERS/MAX_SYNC_WORKERS parameters but can’t find it in values file…
Laks, connectors today don’t run in parallel. If you need to achieve parallelization you need to create different connections to run each stream/table/endpoint as it’s own.
oh… ok… thanks <@U01MMSDJGC9>… just curious if i have to make every endpoint as a new connection, i might end up having few thousands of connections. Can/Will airbyte handle such a load?