Optimizing Data Pipeline Ingestion


Determining the impact of data volume versus number of tables on data pipeline ingestion performance.



We are ingesting data from MySQL db in AWS to BigQuery and we had some pipelines failing. After splitting the pipelines into smaller pipelines with less tables, all pipelines worked fine. Therefore my question is:

• What impacts more a data pipeline ingestion — amount of data or number of tables ? i.e. when we have a pipeline of 50 tables failing, should we split into two pipelines with 25 tables or should we split so that each pipeline ingest same amount of data with regards to tables ?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["data-pipeline-ingestion", "mysql-db", "aws", "bigquery", "pipeline-optimization"]