Improving Connector Speed

Hi All,

We are trying to improve sync speed for Facebook connector by playing with parameters provided in airbyte/.env at master · airbytehq/airbyte · GitHub

There are two areas of focus.

  1. Sync speed. For facebook marketing, it takes 5 minutes to download even small amounts of data.

  2. Normalization is taking many minutes - with just kilo bytes of data.

Can you please help us how to improve?

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: GCP - Debian GNU/Linux 11
  • Memory / Disk: 16 GB
  • Deployment: Docker
  • Airbyte Version: 0.40.2
  • Source name/version: Facebook/0.2.63
  • Destination name/version: Postgres/0.3.20
  • Step: Sync and Normalization
  • Description: Goal is to get sync done in under 1 min

There is no log from 18:51:54 to 18:54:25, in the screenshot attached.

Can you help us understand what it is doing? Is there a scope for improvement there?

Hi @Vijay_Alilaghatta, could you please give me a few more details: what Airbyte & connector versions you’re using, what’s the memory/disk allocation?

Thank you!

Also note, I don’t see that delay in my Airbyte instance running locally during the interval mentioned above.

I have added the details @natalyjazzviolin

Here is another instance where seems like Airbyte is stuck.

2022-09-29 21:30:32 normalization > Found 212 models, 0 tests, 0 snapshots, 0 analyses, 558 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics

2022-09-29 21:46:16 normalization > Concurrency: 8 threads (target=‘prod’)

2022-09-29 21:46:28 normalization > 1 of 53 START incremental model public.ip5pxydtrvafvsh_ads… [RUN]

2022-09-29 21:46:30 normalization > 2 of 53 START incremental model public.ip5pxydtrvafvsh_campaigns… [RUN]

Could you please tell me:

  1. Exactly how much data you’re syncing.
  2. If you were using previous versions - what they were
  3. If you were using previous versions - what the sync time was

Thanks!

Hi @natalyjazzviolin :

  1. Its just around 4MB of data
  2. It has been the same for many versions, we are just trying to look carefully now. Since we want to reduce sync time from 5 mins to 1 min.
  3. Main thing that prompted was, only my local machine the exact same configuration takes less than 1 min.

I will attach two log files, one in GCP and another in localhost, that show significant difference in sync times.

My initial thought is that this is normal due to connection times and such, but let me double check.

gcp-googleanalytics-logs-1074.txt (56.4 KB)
localhost-googleanalytics-logs-163.txt (55.5 KB)

There are three parts -

  1. Sync from source
  2. Wait time before normalization starts
  3. Normalization time

gcp
sync: 30s
wait time: 2min 21s
normalization time: 7 seconds

This shows wait time:

2022-09-28 14:00:38 normalization > Found 24 models, 0 tests, 0 snapshots, 0 analyses, 558 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics
2022-09-28 14:02:59 normalization > Concurrency: 8 threads (target=‘prod’)

localhost:

Does not have wait time!

2022-09-28 18:00:41 normalization > Found 24 models, 0 tests, 0 snapshots, 0 analyses, 558 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics
2022-09-28 18:00:42 normalization > Concurrency: 8 threads (target=‘prod’)

If we can resolve just that one part, we will be much ahead.

But there is discrepancy in normalization times also. But we can wait for that.

It should be something pretty simple… something related to environment, since we are using the same deployment and same sync configuration etc, even same account where data is coming from.

And what is the normalization time for the local deployment?