Recommendations for speeding up HubSpot data pull in Airbyte

Summary

The user is seeking advice on how to reduce the sync duration for HubSpot data pull in Airbyte after experiencing an increase in sync time despite upgrading Airbyte and the HubSpot source version. The user wants to achieve a sync frequency of every 5 minutes and is looking for tips to optimize the process.


Question

Any recommendations for anyone who uses HubSpot source on how to speed-up the data pull? In our staging, where we have almost no activity, it used to take about 3 min 7seconds consistently to do incremental sync. After upgrading Airbyte to 1.0, and updating HubSpot source to v21 from v18, incremental sync time actually increased to 4min 37s, so almost 50% increase in sync time, but amount of records/size synced did not change. In our production, sync takes 30+ hours and it does not seem to be getting any faster. I would like to reduce this sync to every 5min, which should be possible as FiveTran is doing that currently. Are there any tips on reducing sync duration? I don’t think hardware is an issue as there’s enough ram/cpu, so bottleneck seems to be somewhere else. Thank you.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["hubspot-source", "data-pull", "speed-up", "incremental-sync", "sync-duration", "optimization", "sync-frequency"]

Is there a particular stream that’s slow? do you have any in the sync that are non-incremental (e.g. email_subscriptions)?

The one that takes the longest is contacts_property_history . It seems like that runs first in every sync, and everything else doesn’t even run until that finishes. In the latest run, it’s been 3 hours to extract 300k data. I think it could potentially go faster than that?

you might need to check the source code to be sure, but I wonder if that is being read as a child stream of Contacts (forcing it to actually read contacts, then the history for each contact).

That can get very slow very fast, and for non-updated contacts it still has to send all those requests.

That’s my initial guess, but we’re not using history objects for high-volume streams like that currently

it also eats up a ton of API requests, meaning you’re more likely to get throttled or rate limited

shouldn’t be wildly different than Fivetran though, unless the stream is configured differently

and the more frequently you’re running, the faster it should go (since there won’t be nearly as high a number of changes since the last sync)

yes, that’s what I was hoping. I was hoping that it would ‘catch-up’ and get faster and faster, but that doesn’t seem to be the case. :disappointed:

Our biggest HubSpot accounts (15 million+ contacts) take a long time, partly due to the number of updates being issued to the contact records by various workflows. But more than that it’s because we pull the email_events stream which can be extremely noisy.

What universe of contact counts are you talking?

Wow. Whoa. So, first off, thank you for asking! I’ve pinged our Python Sources team to take a look. We don’t like performance regressions.

Secondly, we have some tricks up our sleeve, once we push hubspot to v5 of the CDK we should get a lot of performance gains.

I’ve been thinking on how to respond to Justin’s question, without revealing too much, but I think I can say that at this point, FiveTran is syncing from the same source every 4 min, whereas Airbyte is still taking 30 hours and no sign of the amount of time reducing. I see lot of them as ‘Starting…’ under ‘Latest sync’ column, but they don’t really start for hours. Really hoping that this can be much faster. Currently, my airbyte only runs HubSpot and Mixpanel syncing, but Mixpanel only runs once an hour, so shouldn’t impact HubSpot performance.

The amount of contacts/data we have, I don’t expect them to take 30 hours.