OOM on Pubsub using hubspot source

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: GKE, n2-standard-16
  • Memory / Disk: 100gb
  • Deployment: Kubernetes
  • Airbyte Version: 0.35.66-alpha
  • Source name/version: source-hubspot 0.1.52
  • Destination name/version: destination-pubsub 0.1.4
  • Step: Initial/First sync failure on very large dataset
  • Description:
  • We are trying to sync a very large hubspot account which has over 3million contacts (others are high, but contacts is the largest)
  • The source sync’s company finishes fine and moves onto contact, however at some point after the destination seems to be OOM and exit, making this fail.
  • We have tried bumping resource allocation for worker all the way upto 32Gb and same problem (running another one with 54Gb now - in progress)
  • Trying to find if its a leak somewhere or some code path accumulating all data before pushing it out to destination. Should the pubsub destination extend from CommitOnState consumer instead of FailureTracking consumer? Or should the source be sending periodic state messages for flushing it out instead of accumulating? If neither there is definitely a leak and any pointers on where to look/fix will be appreciated. Happy to send PR.

If you sync a smaller stream or less data with a shorter period the sync works?

Yeah it finished companies (around 200k) before moving to contacts. Since this is the first sync we can’t really reduce this further, need those contacts processed as well. For making sure initial sync successful, we disabled this and running manually until it’s successful.

Split the connection between larger stream is an option Prasanna? Also if you want further investigation you can open a Github issue adding more info about how many records etc

With the 54Gb configuration, it managed to commit a new state, even though the 3 sync attempts ended in failures. This is not a problem anymore because incremental syncs deal with significantly less amount of records, but it could happen again if we needed to sync another large Hubspot account.

For other sources, we have split the connection to sync big streams isolatedly, but we have seen rate limit issues, as the split connections may end up running in parallel. Marcos, do you know if there’s any way of working around that?

For other sources, we have split the connection to sync big streams isolatedly, but we have seen rate limit issues, as the split connections may end up running in parallel. Marcos, do you know if there’s any way of working around that?

I’ll ask the connector team Luis