Memory / Disk: 128GB RAM/ 100GB SSD/ 8CPU (I started with 32GB and increased since it was never enough for Airbyte)
Deployment: Clean server on AWS & Git clone the repo & docker-compose up
Airbyte Version: 0.40.25
Source name/version: Shopify 0.3.0
Destination name/version: Postgres 0.3.26
Step: During sync
We have 2 Shopify stores we sync to Postgres. One store has about ~7000 orders & ~7000 customers. The second store has about ~150000 orders & 150000 customers.
During the sync of some specific streams, memory consumption would max out and kill the server.
We increased the memory enough to solve the issue for the smaller shop. But for our bigger store, I realized it just doesn’t make sense, and also doesn’t really work (unless I put in a crazy amount of RAM which would cost a fortune and would not make sense to use only for syncing two stores)
The problematic streams seem to be all the streams that run per object.
For example: transactions, order_refunds, order_risks, fulfillment_orders
On the bigger shop, from what I understand it means the sync job will iterate all ~150,000 orders.
The same is true for syncs that relate to the customer like metafield_customers
I tried syncing only one stream per job, the stream called transactions. The entire 128GB RAM of the server filled up by the time Airbyte finished reading about 100,000 records from that stream. Memory usage always builds up to kill the server.
So I can possibly solve this by increasing the memory even more. But again, that does not make sense to me! and it is very expensive!
Am I missing something?
I understand that Airbytes would read 10,000 records in memory before it flushes and I think I see it work well for orders or customers streams. But why doesn’t it do the same to the other streams I mentioned?
Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:
It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.
Thank you for your time and attention.
The Community Assistance Team
Hey, I actually see this problem too. my deployment is k8s and looks like the source sync images are not memory efficient. I assume the objects are accumulated into some in memory buffer before being flushed to destination, instead of using a streaming behaviour. this is not ideal. I think the best thing I can do is look at the source code of shopify connector and see if this is a specific problem to shopify which can possibly be fixed, or this is a global issue with how airbyte works…
Encountering the same issue when trying to do an initial sync of a shopify store. Investigating the source code leads me to wonder if the streams that utilize a parent stream effectively cause the parent stream to stay loaded into memory. If that’s the case, one could get around this issue by syncing streams based on the parent stream - eg Orders, Order Refunds, Order Risks, Transactions, MetaField Orders all appear to have the Orders as a parent stream.
For what it’s worth, I pared my sync down to abandoned checkouts, orders and transactions. Abandoned Checkouts and Orders both executed with minimal memory usage. Then Transactions executed and I watched the memory usage in the source pod increase linearly until it ran out of memory. It was also surprisingly slow - logs were showing just 1-2 orders processed a second, despite (from what I can tell) not needing to hit the Shopify API. I wonder if the ShopifySubstream class is inefficient from both a memory and cpu standpoint, but I cannot tell at first glance why.