Source E2E-Test: Problem scaling EC2

stellachung · June 22, 2022, 11:01pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: Ubuntu
Memory / Disk: c5.9xlarge
Deployment: EC2
Airbyte Version: 0.39.9-alpha
Source name/version: airbyte/source-e2e-test:2.1.0
Destination name/version: airbyte/destination-e2e-test:0.2.2
Step: Running E2E syncs
Description:

We are running ~250 E2E connections on our airbyte EC2 instance and ~190 of the jobs are hanging. We have our env set to:

MAX_SYNC_WORKERS=15
MAX_SPEC_WORKERS=15
MAX_CHECK_WORKERS=15
MAX_DISCOVER_WORKERS=15

Here is an example of the logs for a hanging job:

2022-06-22 19:56:37 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):108 - Docker volume job log path: /tmp/workspace/606/0/logs.log
2022-06-22 19:56:38 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):113 - Executing worker wrapper. Airbyte version: 0.39.9-alpha
2022-06-22 19:56:41 e[32mINFOe[m i.a.c.i.LineGobbler(voidCall):82 - Checking if airbyte/source-e2e-test:2.1.0 exists...
2022-06-22 19:56:41 e[32mINFOe[m i.a.c.i.LineGobbler(voidCall):82 - airbyte/source-e2e-test:2.1.0 was found locally.
2022-06-22 19:56:41 e[32mINFOe[m i.a.w.p.DockerProcessFactory(create):108 - Creating docker job ID: 606
2022-06-22 19:56:41 e[32mINFOe[m i.a.w.p.DockerProcessFactory(create):163 - Preparing command: docker run --rm --init -i -w /data/606/0 --log-driver none --name source-e2e-test-check-606-0-lymui --network host -v airbyte_workspace:/data -v /tmp/airbyte_local:/local -e WORKER_CONNECTOR_IMAGE=airbyte/source-e2e-test:2.1.0 -e WORKER_JOB_ATTEMPT=0 -e AIRBYTE_ROLE= -e WORKER_ENVIRONMENT=DOCKER -e AIRBYTE_VERSION=0.39.9-alpha -e WORKER_JOB_ID=606 airbyte/source-e2e-test:2.1.0 check --config source_config.json
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):194 - Running sync worker cancellation...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):198 - Interrupting worker thread...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):201 - Cancelling completable future...
2022-06-22 19:57:11 e[33mWARNe[m i.a.w.t.CancellationHandler$TemporalCancellationHandler(checkAndHandleCancellation):53 - Job either timed out or was cancelled.
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getWorkerThread$2):161 - Completing future exceptionally...
io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Error checking connection, status: Optional.empty, exit code: 143
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:82) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	... 3 more
2022-06-22 19:57:11 e[33mWARNe[m i.t.i.a.POJOActivityTaskHandler(activityFailureToResult):307 - Activity failure. ActivityId=9404a889-fd1b-361e-93b3-53ce628318df, activityType=Run, attempt=1
java.util.concurrent.CancellationException: null
	at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) ~[?:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$3(TemporalAttemptExecution.java:204) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.temporal.CancellationHandler$TemporalCancellationHandler.checkAndHandleCancellation(CancellationHandler.java:52) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$4(TemporalAttemptExecution.java:207) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

What do the logs mean?
Is there any way to increase job parallelism on EC2? (For context we are also experimenting with EKS but we are trying EC2 because it is more familiar to us from an infrastructure standpoint). From our metrics it looks like > 90% of memory and CPU of the instance is unused

sajarin · June 23, 2022, 8:50pm

Hi @stellachung, thanks for the post. Increasing the MAX_SYNC_WORKERS value would most likely increase the job parallelism. Try setting the MAX_SYNC_WORKERS value to the maximum cap of jobs you’d want to have running in parallel and let me know if that helps.

danieldiamond · June 24, 2022, 1:29am

What do the logs mean?

I think this is a reoccurring theme and wonder if we either need better logging or a documentation piece e.g. Debug errors | dbt Developer Hub

davinchia · June 24, 2022, 5:26am

The crucial part is this. We check whether we connect as part of the sync job to better differentiate between bad configuration failures and actual sync failures. There was an error at this stage.

Docker parallelism is determined by the number of workers. Right now there are 15 workers for each type of job, which means 15 parallels jobs for each operation at once i.e. 15 max parallel sync jobs. I’d expect the number of running jobs to increase up till this max number of workers for that type.

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
How to configure to have a larger number of concurrency? Q&A kubernetes , deploy	25	7227	June 12, 2023
Syncs taking almost an hour for <20kb of data Platform, Deploy & Infra Issues data-loading , connectors , kubernetes , deploy	5	493	February 2, 2023
Temporal Scheduler Causing Multiple Syncs to Hang Connector Questions & Issues data-loading	14	1200	July 14, 2022
Running sync worker cancellation Connector Questions & Issues getting-started , connectors	2	934	November 11, 2022
Sync appears to be getting cancelled on it's own when handling large amounts of data Connector Questions & Issues destination-redshift , data-loading , connectors , kubernetes	2	883	October 17, 2022

Source E2E-Test: Problem scaling EC2

Related topics