Syncing multiple streams in parallel within the same connection

slack-user-airbyte · July 23, 2024, 6:11am

Summary

The user is looking for a way to sync multiple streams in parallel within the same connection to speed up the data syncing process from marketing platforms like FB, Google, and Bing.

Question

Hi Folks!

Need one help at a fundamental level. I have a solution wherein I fetch lots of ads data from marketing platforms like FB, Google, Bing. However, when I setup a connection and start syncing it, I realized that all the streams are getting fetched sequentially. Due to this, even a small ad account is taking multiple hours to get synced (data-range 6months).

Can anyone help me in making this faster, given that my resource consumption is similar. One way that I have figured out is to setup multiple connections for each week/month and then run them in parallel. Is there a way using which I can sync multiple streams in parallel inside the same connection??

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

_{["syncing-multiple-streams", "parallel-sync", "marketing-platforms", "data-syncing", "connection"]}

slack-user-airbyte · July 23, 2024, 6:16am

There isn’t anything within the platform that allows for this currently. You could look and see if there’s a feature request on GitHub or put one in, as this comes up pretty regularly.

One thing to keep in mind is that you have to be careful when parallelizing syncs like this as some APIs may not play nice in terms of throttling/rate limiting (not to mention potentially very high resource usage to handle all the streams at once). For those who do, you need to be careful to ensure that you don’t take final data when the first stream finishes, as the others haven’t (so imagine if you have campaigns but not the ads layer of data, or costs but no performance metrics . . . could really skew your reporting). So even if you split them up, you’ll need to come up with a way to ensure that all those jobs have completed before you build the data.

I personally think there’s a use case here, especially when you’re pulling multiple objects that are non-interrelated. Or if you’re using the job status webhook to trigger your workflows (so you aren’t building your data until all streams are complete anyway). Just be aware that there’s a non-trivial nature to this, both from a resource consumption, atomicity, and correctness standpoint.

Topic		Replies	Views
Running Multiple Parallel Streams and Starting Sync for Specific Streams in Airbyte Connector Development platform , airbyte , refresh , question , api-endpoints	7	822	May 14, 2024
Multiple streams of the same type(class) in one source connector Connector Development connectors	4	907	July 14, 2022
Parallelizing API requests for parent stream entities Connector Development api , feature , parent-stream , parallelize-api-requests , independent-requests	7	56	June 24, 2024
Is it possible to continue a sync after failure of a stream? Connector Development	7	437	July 14, 2022
Iterate across multiple endpoints? (e.g. for a set of ad ids) Connector Development	7	682	July 14, 2022

Syncing multiple streams in parallel within the same connection

Summary

Question

Related topics