- Is this your first time deploying Airbyte?: No
- OS Version / Instance: Ubuntu
- Memory / Disk: c5.9xlarge
- Deployment: EC2
- Airbyte Version: 0.39.9-alpha
- Source name/version: airbyte/source-e2e-test:2.1.0
- Destination name/version: airbyte/destination-e2e-test:0.2.2
- Step: Running E2E syncs
- Description:
We are running ~250 E2E connections on our airbyte EC2 instance and ~190 of the jobs are hanging. We have our env set to:
MAX_SYNC_WORKERS=15
MAX_SPEC_WORKERS=15
MAX_CHECK_WORKERS=15
MAX_DISCOVER_WORKERS=15
Here is an example of the logs for a hanging job:
2022-06-22 19:56:37 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):108 - Docker volume job log path: /tmp/workspace/606/0/logs.log
2022-06-22 19:56:38 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):113 - Executing worker wrapper. Airbyte version: 0.39.9-alpha
2022-06-22 19:56:41 e[32mINFOe[m i.a.c.i.LineGobbler(voidCall):82 - Checking if airbyte/source-e2e-test:2.1.0 exists...
2022-06-22 19:56:41 e[32mINFOe[m i.a.c.i.LineGobbler(voidCall):82 - airbyte/source-e2e-test:2.1.0 was found locally.
2022-06-22 19:56:41 e[32mINFOe[m i.a.w.p.DockerProcessFactory(create):108 - Creating docker job ID: 606
2022-06-22 19:56:41 e[32mINFOe[m i.a.w.p.DockerProcessFactory(create):163 - Preparing command: docker run --rm --init -i -w /data/606/0 --log-driver none --name source-e2e-test-check-606-0-lymui --network host -v airbyte_workspace:/data -v /tmp/airbyte_local:/local -e WORKER_CONNECTOR_IMAGE=airbyte/source-e2e-test:2.1.0 -e WORKER_JOB_ATTEMPT=0 -e AIRBYTE_ROLE= -e WORKER_ENVIRONMENT=DOCKER -e AIRBYTE_VERSION=0.39.9-alpha -e WORKER_JOB_ID=606 airbyte/source-e2e-test:2.1.0 check --config source_config.json
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):194 - Running sync worker cancellation...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):198 - Interrupting worker thread...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):201 - Cancelling completable future...
2022-06-22 19:57:11 e[33mWARNe[m i.a.w.t.CancellationHandler$TemporalCancellationHandler(checkAndHandleCancellation):53 - Job either timed out or was cancelled.
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...
2022-06-22 19:57:11 e[32mINFOe[m i.a.w.t.TemporalAttemptExecution(lambda$getWorkerThread$2):161 - Completing future exceptionally...
io.airbyte.workers.exception.WorkerException: Error while getting checking connection.
at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:86) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:29) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:158) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.airbyte.workers.exception.WorkerException: Error checking connection, status: Optional.empty, exit code: 143
at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:82) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
... 3 more
2022-06-22 19:57:11 e[33mWARNe[m i.t.i.a.POJOActivityTaskHandler(activityFailureToResult):307 - Activity failure. ActivityId=9404a889-fd1b-361e-93b3-53ce628318df, activityType=Run, attempt=1
java.util.concurrent.CancellationException: null
at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) ~[?:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$3(TemporalAttemptExecution.java:204) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
at io.airbyte.workers.temporal.CancellationHandler$TemporalCancellationHandler.checkAndHandleCancellation(CancellationHandler.java:52) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$4(TemporalAttemptExecution.java:207) ~[io.airbyte-airbyte-workers-0.39.0-alpha.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
- What do the logs mean?
- Is there any way to increase job parallelism on EC2? (For context we are also experimenting with EKS but we are trying EC2 because it is more familiar to us from an infrastructure standpoint). From our metrics it looks like > 90% of memory and CPU of the instance is unused