Running Multiple Parallel Streams and Starting Sync for Specific Streams in Airbyte

slack-user-airbyte · May 14, 2024, 4:25pm

Summary

How to configure Airbyte to run multiple parallel streams and start sync for specific streams within a connection

Question

Hi folks… how can i make airbyte to run multiple parallel streams? I have airbyte oss running which we deployed via airbyte helm charts with default values on to aws eks . Configured 2 Streams & they running one after other but not in parallel.

Also is there anyway we can start sync for one of the stream in a connection rather than all of them at once? We have some api endpoints which are to be refreshed every 24hrs and some every 2 hours. Is there anyway we can achieve it?

<#C027KKE4BCZ|help-connector-development>

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

_{["airbyte", "multiple-parallel-streams", "sync", "specific-streams", "api-endpoints", "refresh", "aws-eks", "helm-charts"]}

slack-user-airbyte · May 14, 2024, 7:09pm

Airbyte allows you to configure the maximum number of each type of worker allowed to run in parallel. This can be done by setting environment variables. For your case, you might want to adjust the MAX_SYNC_WORKERS environment variable, which controls the maximum number of Sync workers allowed to run in parallel. The default value for this variable is currently set to 5. Here are the environment variables you can tweak to increase the workload of your Airbyte instance: - MAX_SPEC_WORKERS: Maximum number of Spec workers allowed to run in parallel. - MAX_CHECK_WORKERS: Maximum number of Check connection workers allowed to run in parallel. - MAX_DISCOVERY_WORKERS: Maximum number of Discovery workers allowed to run in parallel. - MAX_SYNC_WORKERS: Maximum number of Sync workers allowed to run in parallel. You can find more details on configuring jobs & workers in the Airbyte documentation. Please note that these configurations should be done carefully, as increasing the number of workers might put more load on your system. Make sure your system can handle the increased load before making these changes.

slack-user-airbyte · May 14, 2024, 7:09pm

For now, configuring separate connections with selected streams might be a way to go

slack-user-airbyte · May 14, 2024, 7:09pm

thanks for quick response <@U05JENRCF7C>… we have 100s of partners each having different api endpoints (min 5 - max 20) for data ingestion requirement currently across 6 products. So we will have minimum of 100x5 (min endpoints)x6 = 3000 connections at a minimum scale. Can airbyte handle this scale? I mean its going to keep increasing.

slack-user-airbyte · May 14, 2024, 7:09pm

you need to check this on your own
in my project I have only ~60 connections, but I have few tips for you

• check architecture diagram https://docs.airbyte.com/understanding-airbyte/high-level-view
• you need to configure enough workers for your workload (no easy answer how many, you need observe memory/cpu use and adjust accordingly)
• spread all connections throughout the day, so you will have a lower number of simultaneous connections
• use cluster autoscaling

slack-user-airbyte · May 14, 2024, 7:10pm

Hi All / <@U05JENRCF7C>… i have scaled up my deployments temporal from 1 to 4 and workers from 1 to 4 but still no parallelization in one connection… One connection has 3 streams which are not interdependent. Am hoping that we can run them in parallel. Btw i tried looking for MAX_SPEC_WORKERS/MAX_SYNC_WORKERS parameters but can’t find it in values file…

slack-user-airbyte · May 14, 2024, 7:10pm

Laks, connectors today don’t run in parallel. If you need to achieve parallelization you need to create different connections to run each stream/table/endpoint as it’s own.

slack-user-airbyte · May 14, 2024, 7:10pm

oh… ok… thanks <@U01MMSDJGC9>… just curious if i have to make every endpoint as a new connection, i might end up having few thousands of connections. Can/Will airbyte handle such a load?

Topic		Replies	Views
Dealing with multiple API endpoints in Airbyte cloud for streams API, Terraform and Other Topics airbyte-cloud , api , question , streams , multiple-api-endpoints	0	37	May 14, 2024
Handling Multiple API Endpoints in Airbyte Cloud API, Terraform and Other Topics airbyte-cloud , api , multiple-api-endpoints , single-stream	0	7	November 28, 2024
Chaining multiple streams in custom connectors Connector Questions cdk , custom-connectors , connector , question , yaml	0	96	May 16, 2024
Is it possible to continue a sync after failure of a stream? Connector Development	7	412	July 14, 2022
Syncing multiple streams in parallel within the same connection Connector Questions connection , connector , question , syncing-multiple-streams , parallel-sync	1	118	July 23, 2024

Running Multiple Parallel Streams and Starting Sync for Specific Streams in Airbyte

Summary

Question

Related topics