OOM on Pubsub using hubspot source

pras · May 2, 2022, 10:19pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: GKE, n2-standard-16
Memory / Disk: 100gb
Deployment: Kubernetes
Airbyte Version: 0.35.66-alpha
Source name/version: source-hubspot 0.1.52
Destination name/version: destination-pubsub 0.1.4
Step: Initial/First sync failure on very large dataset
Description:

We are trying to sync a very large hubspot account which has over 3million contacts (others are high, but contacts is the largest)
The source sync’s company finishes fine and moves onto contact, however at some point after the destination seems to be OOM and exit, making this fail.
We have tried bumping resource allocation for worker all the way upto 32Gb and same problem (running another one with 54Gb now - in progress)
Trying to find if its a leak somewhere or some code path accumulating all data before pushing it out to destination. Should the pubsub destination extend from CommitOnState consumer instead of FailureTracking consumer? Or should the source be sending periodic state messages for flushing it out instead of accumulating? If neither there is definitely a leak and any pointers on where to look/fix will be appreciated. Happy to send PR.

marcosmarxm · May 2, 2022, 11:16pm

If you sync a smaller stream or less data with a shorter period the sync works?

pras · May 3, 2022, 2:01am

Yeah it finished companies (around 200k) before moving to contacts. Since this is the first sync we can’t really reduce this further, need those contacts processed as well. For making sure initial sync successful, we disabled this and running manually until it’s successful.

marcosmarxm · May 4, 2022, 1:23am

Split the connection between larger stream is an option Prasanna? Also if you want further investigation you can open a Github issue adding more info about how many records etc

lgomezm · May 4, 2022, 5:09pm

With the 54Gb configuration, it managed to commit a new state, even though the 3 sync attempts ended in failures. This is not a problem anymore because incremental syncs deal with significantly less amount of records, but it could happen again if we needed to sync another large Hubspot account.

For other sources, we have split the connection to sync big streams isolatedly, but we have seen rate limit issues, as the split connections may end up running in parallel. Marcos, do you know if there’s any way of working around that?

marcosmarxm · May 5, 2022, 1:52am

For other sources, we have split the connection to sync big streams isolatedly, but we have seen rate limit issues, as the split connections may end up running in parallel. Marcos, do you know if there’s any way of working around that?

I’ll ask the connector team Luis

Topic		Replies	Views
Source-hubspot contact_list_membership/contacts extraction performance optimization Q&A source-hubspot , destination-redshift , connectors	1	329	August 11, 2022
Syncs taking almost an hour for <20kb of data Platform, Deploy & Infra Issues data-loading , connectors , kubernetes , deploy	5	483	February 2, 2023
Source Hubspot: failed sync Request-URI Too Large for contact stream Connector Questions & Issues source-hubspot	1	906	September 14, 2021
HubSpot deal_pipelines endpoint missing rows on sync Connector Questions & Issues source-hubspot , connectors	4	410	July 14, 2022
Hubspot failing on sync Connector Questions & Issues connectors	2	149	December 23, 2022

OOM on Pubsub using hubspot source

Related topics