Source-E2E-Test: Problem scaling EC2

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu
  • Memory / Disk: c5.9xlarge
  • Deployment: EC2
  • Airbyte Version: 0.39.9-alpha
  • Source name/version: airbyte/source-e2e-test:2.1.0
  • Destination name/version: airbyte/destination-e2e-test:0.2.2
  • Step: Running E2E syncs
  • Description:

We are running ~250 E2E connections on our airbyte EC2 instance and ~190 of the jobs are hanging. We have our env set to:

MAX_SYNC_WORKERS=15
MAX_SPEC_WORKERS=15
MAX_CHECK_WORKERS=15
MAX_DISCOVER_WORKERS=15

Here is an example of the logs for a hanging job:

2022-06-22 19:56:37 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):108 - Docker volume job log path: /tmp/workspace/606/0/logs.log
2022-06-22 19:56:38 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):113 - Executing worker wrapper. Airbyte version: 0.39.9-alpha
2022-06-22 19:56:41 e[32mINFOe[m i.a.c.i.LineGobbler(voidCall):82 - Checking if airbyte/source-e2e-test:2.1.0 exists...
2022-06-22 19:56:41 e[32mINFOe[m i.a.c.i.LineGobbler(voidCall):82 - airbyte/source-e2e-test:2.1.0 was found locally.
2022-06-22 19:56:41 e[32mINFOe[m i.a.w.p.DockerProcessFactory(create):108 - Creating docker job ID: 606
2022-06-22 19:56:41 e[32mINFOe[m i.a.w.p.DockerProcessFactory(create):163 - Preparing command: docker run --rm --init -i -w /data/606/0 --log-driver none --name source-e2e-test-check-606-0-lymui --network host -v airbyte_workspace:/data -v /tmp/airbyte_local:/local -e WORKER_CONNECTOR_IMAGE=airbyte/source-e2e-test:2.1.0 -e WORKER_JOB_ATTEMPT=0 -e AIRBYTE_ROLE= -e WORKER_ENVIRONMENT=DOCKER -e AIRBYTE_VERSION=0.39.9-alpha -e WORKER_JOB_ID=606 airbyte/source-e2e-test:2.1.0 check --config source_config.json
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):194 - Running sync worker cancellation...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):198 - Interrupting worker thread...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):201 - Cancelling completable future...
2022-06-22 19:57:11 e[33mWARNe[m i.a.w.t.CancellationHandler$TemporalCancellationHandler(checkAndHandleCancellation):53 - Job either timed out or was cancelled.
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getWorkerThread$2):161 - Completing future exceptionally...
io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Error checking connection, status: Optional.empty, exit code: 143
	at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:82) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	... 3 more
2022-06-22 19:57:11 e[33mWARNe[m i.t.i.a.POJOActivityTaskHandler(activityFailureToResult):307 - Activity failure. ActivityId=9404a889-fd1b-361e-93b3-53ce628318df, activityType=Run, attempt=1
java.util.concurrent.CancellationException: null
	at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) ~[?:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$3(TemporalAttemptExecution.java:204) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.temporal.CancellationHandler$TemporalCancellationHandler.checkAndHandleCancellation(CancellationHandler.java:52) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$4(TemporalAttemptExecution.java:207) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
  • What do the logs mean?
  • Is there any way to increase job parallelism on EC2? (For context we are also experimenting with EKS but we are trying EC2 because it is more familiar to us from an infrastructure standpoint). From our metrics it looks like > 90% of memory and CPU of the instance is unused

Hi @stellachung, thanks for the post. Increasing the MAX_SYNC_WORKERS value would most likely increase the job parallelism. Try setting the MAX_SYNC_WORKERS value to the maximum cap of jobs you’d want to have running in parallel and let me know if that helps.

  • What do the logs mean?

I think this is a reoccurring theme and wonder if we either need better logging or a documentation piece e.g. Debugging errors | dbt Docs

The crucial part is this. We check whether we connect as part of the sync job to better differentiate between bad configuration failures and actual sync failures. There was an error at this stage.

Docker parallelism is determined by the number of workers. Right now there are 15 workers for each type of job, which means 15 parallels jobs for each operation at once i.e. 15 max parallel sync jobs. I’d expect the number of running jobs to increase up till this max number of workers for that type.