Managing and Deleting Inactive and Deprecated Connections in Airbyte

slack-user-airbyte · May 14, 2024, 7:17pm

    c.*,
    string_agg(co.operation_id::text,',') as operation_ids_agg,
    sm.auto_propagation_status
FROM
    connection c
    LEFT JOIN connection_operation co ON co.connection_id = c.id
    LEFT JOIN schema_management sm ON sm.connection_id = c.id
    JOIN actor a ON c.source_id = a.id
WHERE
    a.workspace_id = '47d889b1-8b87-42ee-8082-f04e13765c7a'::uuid
    AND c.status &lt;&gt; 'deprecated'
GROUP BY
    c.id, sm.auto_propagation_status;```
The query should be something like this in postgres.

slack-user-airbyte · May 14, 2024, 7:17pm

my explain plan on this ended up like this:

  Group Key: c.id, sm.auto_propagation_status
  ->  Sort  (cost=1924.96..1925.27 rows=124 width=1841) (actual time=1.584..1.595 rows=122 loops=1)
        Sort Key: c.id, sm.auto_propagation_status
        Sort Method: quicksort  Memory: 240kB
        ->  Nested Loop Left Join  (cost=6.86..1920.65 rows=124 width=1841) (actual time=0.070..1.432 rows=122 loops=1)
              ->  Nested Loop Left Join  (cost=6.58..1873.01 rows=124 width=1837) (actual time=0.064..1.051 rows=122 loops=1)
                    ->  Nested Loop  (cost=6.43..1840.10 rows=124 width=1821) (actual time=0.061..0.945 rows=122 loops=1)
                          ->  Bitmap Heap Scan on actor a  (cost=6.14..717.10 rows=240 width=16) (actual time=0.052..0.312 rows=240 loops=1)
                                Recheck Cond: (workspace_id = '47d889b1-8b87-42ee-8082-f04e13765c7a'::uuid)
                                Heap Blocks: exact=226
                                ->  Bitmap Index Scan on actor_workspace_id_idx  (cost=0.00..6.08 rows=240 width=0) (actual time=0.024..0.024 rows=264 loops=1)
                                      Index Cond: (workspace_id = '47d889b1-8b87-42ee-8082-f04e13765c7a'::uuid)
                          ->  Index Scan using connection_source_id_idx on connection c  (cost=0.28..4.67 rows=1 width=1821) (actual time=0.002..0.002 rows=1 loops=240)
                                Index Cond: (source_id = a.id)
                                Filter: (status <> 'deprecated'::status_type)
                    ->  Index Scan using connection_operation_connection_id_idx on connection_operation co  (cost=0.15..0.23 rows=4 width=32) (actual time=0.000..0.000 rows=0 loops=122)
                          Index Cond: (connection_id = c.id)
              ->  Index Scan using connection_idx on schema_management sm  (cost=0.28..0.37 rows=1 width=20) (actual time=0.003..0.003 rows=1 loops=122)
                    Index Cond: (connection_id = c.id)
Planning Time: 0.577 ms
Execution Time: 1.794 ms```

slack-user-airbyte · May 14, 2024, 7:17pm

Most of the cost/time is spent joining actor to connection – probably not a lot that can be done there given the current table architecture. Possibly if the connections table already had the workspace_id as a column; this would be faster?

slack-user-airbyte · May 14, 2024, 7:17pm

so yeah, i guess the only way to speed this up at the moment is have less rows to scan - ie delete the connections and their associated source/dest configs

slack-user-airbyte · May 14, 2024, 7:17pm

Yeap, but the weird thing is: The query runs fast on the database. So maybe it’s not the problem .

slack-user-airbyte · May 14, 2024, 7:17pm

  Group Key: c.id, sm.auto_propagation_status
  -> Sort (cost=468.53..468.79 rows=102 width=1376) (actual time=20.420..20.436 rows=102 loops=1)
     Sort Key: c.id, sm.auto_propagation_status
     Sort Method: quicksort Memory: 194kB
     -> Nested Loop (cost=314.28..465.13 rows=102 width=1376) (actual time=16.508..20.199 rows=102 loops=1)
        -> Hash Right Join (cost=314.01..343.48 rows=102 width=1376) (actual time=16.490..16.608 rows=102 loops=1)
           Hash Cond: (co.connection_id = c.id)
           -> Seq Scan on connection_operation co (cost=0.00..24.91 rows=991 width=32) (actual time=0.816..1.680 rows=991 loops=1)
           -> Hash (cost=312.73..312.73 rows=102 width=1360) (actual time=14.693..14.695 rows=102 loops=1)
              Buckets: 1024 Batches: 1 Memory Usage: 141kB
              -> Hash Right Join (cost=282.20..312.73 rows=102 width=1360) (actual time=11.782..14.606 rows=102 loops=1)
                 Hash Cond: (sm.connection_id = c.id)
                 -> Seq Scan on schema_management sm (cost=0.00..27.71 rows=1071 width=20) (actual time=0.009..2.663 rows=1071 loops=1)
                 -> Hash (cost=280.93..280.93 rows=102 width=1356) (actual time=11.765..11.766 rows=102 loops=1)
                    Buckets: 1024 Batches: 1 Memory Usage: 140kB
                    -> Seq Scan on connection c (cost=0.00..280.93 rows=102 width=1356) (actual time=0.337..11.614 rows=102 loops=1)
                       Filter: (status <> 'deprecated'::status_type)
                       Rows Removed by Filter: 1092
        -> Index Scan using actor_pkey on actor a (cost=0.28..1.19 rows=1 width=16) (actual time=0.035..0.035 rows=1 loops=102)
           Index Cond: (id = c.source_id)
           Filter: (workspace_id = '26204905-8ef0-47c4-a86a-a3e8cc3950aa'::uuid)
 Planning Time: 8.428 ms
 Execution Time: 20.961 ms```

slack-user-airbyte · May 14, 2024, 7:17pm

yeah, possibly not the DB; could be tied up inside the app then. Good thing you tested it; would be going down a rabbit-hole for minimal gain

slack-user-airbyte · May 14, 2024, 7:17pm

I kind of still suspect N+1 query problem

slack-user-airbyte · May 14, 2024, 7:17pm

the web responds with a lot more data than exists in the results of this query; so it has to hydrate it from somewhere

slack-user-airbyte · May 14, 2024, 7:17pm

also just the sheer size of the data it responds with can also contribute to slowness

slack-user-airbyte · May 14, 2024, 7:17pm

I’ve also noticed that sometimes the db queries return quickly, and other times they are painfully slow (which suggests contention with a write lock)

slack-user-airbyte · May 14, 2024, 7:18pm

you may want to try running that query periodically and seeing what kind of variance you get in the execution times

slack-user-airbyte · May 14, 2024, 7:18pm

sure, I let you know!

slack-user-airbyte · May 14, 2024, 7:18pm

There is an issue related that relates to the number of connections: https://github.com/airbytehq/airbyte/issues/27364

slack-user-airbyte · May 14, 2024, 7:18pm

And another interesting discovery was that when I hit this endpoint: https://airbyte.loadsmart.io/api/v1/connections/list|api/v1/connections/list, the response is fast, which contributes to the conclusion that the database answers fast (right?)

however, this endpoint is slow ~2min to list the connections:
https://airbyte.loadsmart.io/api/v1/web_backend/connections/list|api/v1/web_backend/connections/list

slack-user-airbyte · May 14, 2024, 7:18pm

It seems that when the front is not involved the performance is good

slack-user-airbyte · May 14, 2024, 7:18pm

Yeah, the web Backend hydrates connections with more info. I think the response embeds svg images too

slack-user-airbyte · May 14, 2024, 7:18pm

Still I think the DB does have latency spikes

slack-user-airbyte · May 14, 2024, 7:18pm

Yeah, I will run that query periodically

slack-user-airbyte · May 14, 2024, 7:18pm

Thanks a lot for now

Topic		Replies	Views
How to restore a deleted connection Guides	0	801	December 22, 2022
Deleting Unused Sources with Errors Platform Questions platform , connections , question , errors , deleting-unused-sources	0	1	July 11, 2024
Reset Connection Button in Airbyte UI Platform Questions platform , airbyte-ui , question , reset-connection , button	5	116	September 21, 2024
Connectors Disappeared on Airbyte Cloud with Deprecated Status Platform Questions connectors , airbyte-cloud , api , bug , deprecated	6	16	July 16, 2024
Trouble with connections after updating to Airbyte v1.0 Platform Questions platform , question , trouble-connections , updating-airbyte-v10 , existing-connections	2	2	October 24, 2024

Managing and Deleting Inactive and Deprecated Connections in Airbyte

Related topics