Scaling Applications on Kubernetes

Summary

User seeks guidance on designing a scalable application on Kubernetes that can handle parallel processing tasks triggered by multiple users, and inquires about the capability of Kubernetes to manage parallel loops across multiple pods.


Question

Hi Community.

I have been tasked to build an application that can run on Kubernetes and be as scalable as possible.

Basically the challenge is that the application should receive requests to make certain processing tasks and these tasks are basically a group of queries that run against a database.

Now there are two things, these queries should run in parallel as much as possible and different users might trigger many processes at same time.

Considering the bottleneck is not the database…

What is the best design to ensure that the application scales in Kubernetes by managing new pods etc?

If I am running a parallel loop lets say in python, can Kubernetes break it down into multiple pods?

Thank you for the help while I try to learn these new concepts to app design.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

['kubernetes', 'scalability', 'parallel-processing', 'pods-management', 'application-design']

Since this community is dedicated to the Airbyte ETL platform, I’m not sure you’ll get a lot of feedback here as it’s not product-related. You may want to ask something like this in one of the more Kubernetes-focused communities or forums.

With that said, I think you’re thinking about it wrong. When working on a task-oriented workflow, you likely don’t want the overhead of idle infrastructure from a cost standpoint, and the workers likely don’t need a lot of shared components.

But also, and probably more importantly: solve the first problems first and know them inside and out: how are you getting and queueing these jobs? How will you ensure that a job is only processed once? How will you know how much concurrency to allow? What happens with the results of the queries? How quickly do they need to run? Can they be batch workloads? Are there realtime/streaming workloads? What are the cost constraints for the business on this system? Is cost more important than performance? What about reliability? (these are all just hypothetical questions, but things you need to think through and that someone would need an idea of to make robust recommendations)

These are pretty well-defined issue areas in data, so I’d do some research before jumping to solutioneering.

If something like Google Cloud Run or Cloud Functions meets your requirements, it would likely be more cost-effective because it can spin to zero and you don’t need to manage any of the infrastructure. My rule of thumb is to choose what problem I’m solving and have the lowest infrastructure footprint as possible. So favor the simplest solution. In this case, if you can get by with the limitations of Cloud Functions (which most can), use that. If you need things like custom Docker images or more in the way of persistent disk, use Cloud Run (technically these are in the same family now; it’s about how much customization you have, with the tradeoff of cost and complexity). If you need even more control but want to automate scaling, look for something like GKE Autopilot. Eliminate the management of what you can :slightly_smiling_face:

You may find that all you need is a Pub/Sub queue and a Cloud function to automatically grab the new job, and can set your concurrency limits and call it a day without deploying or managing any infrastructure.

But there are many types of task queues, many times needs for dependencies between jobs, or different criteria you’ll need to choose your tooling. You need to get those baselines together before anyone can really help you effectively, because they can only understand your problem as well as you do.

The right solution also depends on where your data is hosted. If it’s Google Cloud Platform (e.g. BigQuery), my recommendations above stand. If the data is S3/Redshift, it may look more like AWS Lambda. If it’s not in the cloud, things get a bit trickier because Cloud-based IaaS and PaaS offerings will likely reduce your costs, but may have undesirable trade-offs in terms of performance because of the latency between those platforms and your data.

I love Kubernetes, but management of k8s is not for the faint of heart, and if you’re not already pretty deeply knowledgeable it can be a lot of work to maintain and secure (especially when there’s PII and compliance requirements involved). So don’t let that discourage you, but make sure you count the cost of your time.

I’ve literally watched at least a dozen engineers burn months of their time to create fragile systems that effectively re-produced things like Pub/Sub queues. Or to create a query task runner when their cloud data warehouse already had one built in.

Work smarter, not harder :slightly_smiling_face:

Cloud I am quite familiar and would solve it easily. Problem is this solution is meant to work only on-premises and it should work similarly to a SaaS because the license management etc will be on cloud.

I guess I will try to use RabbitMQ and see how it works. Many of these components are new to me so will take some time to get there, but eventually will :slight_smile:

Apologies for posting this question in the Airbyte group anyway. I thought I was on the Kubernetes one :slight_smile:

Thanks for your answer and your dedication to help.

Since this community is dedicated to the Airbyte ETL platform, I’m not sure you’ll get a lot of feedback here as it’s not product-related. You may want to ask something like this in one of the more Kubernetes-focused communities or forums.

With that said, I think you’re thinking about it wrong. When working on a task-oriented workflow, you likely don’t want the overhead of idle infrastructure from a cost standpoint, and the workers likely don’t need a lot of shared components.

But also, and probably more importantly: solve the first problems first and know them inside and out: how are you getting and queueing these jobs? How will you ensure that a job is only processed once? How will you know how much concurrency to allow? What happens with the results of the queries? How quickly do they need to run? Can they be batch workloads? Are there realtime/streaming workloads? What are the cost constraints for the business on this system? Is cost more important than performance? What about reliability? (these are all just hypothetical questions, but things you need to think through and that someone would need an idea of to make robust recommendations)

These are pretty well-defined issue areas in data, so I’d do some research before jumping to solutioneering.

If something like Google Cloud Run or Cloud Functions meets your requirements, it would likely be more cost-effective because it can spin to zero and you don’t need to manage any of the infrastructure. My rule of thumb is to choose what problem I’m solving and have the lowest infrastructure footprint as possible. So favor the simplest solution. In this case, if you can get by with the limitations of Cloud Functions (which most can), use that. If you need things like custom Docker images or more in the way of persistent disk, use Cloud Run (technically these are in the same family now; it’s about how much customization you have, with the tradeoff of cost and complexity). If you need even more control but want to automate scaling, look for something like GKE Autopilot. Eliminate the management of what you can :slightly_smiling_face:

You may find that all you need is a Pub/Sub queue and a Cloud function to automatically grab the new job, and can set your concurrency limits and call it a day without deploying or managing any infrastructure.

But there are many types of task queues, many times needs for dependencies between jobs, or different criteria you’ll need to choose your tooling. You need to get those baselines together before anyone can really help you effectively, because they can only understand your problem as well as you do.

The right solution also depends on where your data is hosted. If it’s Google Cloud Platform (e.g. BigQuery), my recommendations above stand. If the data is S3/Redshift, it may look more like AWS Lambda. If it’s not in the cloud, things get a bit trickier because Cloud-based IaaS and PaaS offerings will likely reduce your costs, but may have undesirable trade-offs in terms of performance because of the latency between those platforms and your data.

I love Kubernetes, but management of k8s is not for the faint of heart, and if you’re not already pretty deeply knowledgeable it can be a lot of work to maintain and secure (especially when there’s PII and compliance requirements involved). So don’t let that discourage you, but make sure you count the cost of your time.

I’ve literally watched at least a dozen engineers burn months of their time to create fragile systems that effectively re-produced things like Pub/Sub queues. Or to create a query task runner when their cloud data warehouse already had one built in.

Work smarter, not harder :slightly_smiling_face:

Cloud I am quite familiar and would solve it easily. Problem is this solution is meant to work only on-premises and it should work similarly to a SaaS because the license management etc will be on cloud.

I guess I will try to use RabbitMQ and see how it works. Many of these components are new to me so will take some time to get there, but eventually will :slight_smile:

Apologies for posting this question in the Airbyte group anyway. I thought I was on the Kubernetes one :slight_smile:

Thanks for your answer and your dedication to help.