Optimizing ETL times for Facebook connections in Airbyte on AWS EC2 instances

Summary

The user is experiencing long ETL times for Facebook connections in Airbyte on AWS EC2 instances, specifically for large accounts. They are seeking optimization tips beyond the general ones available on the site.


Question

Hi everyone!

I have two selfhosted instances (AWS EC2’s m6a.4xlarge - where they run the Airbyte docker setup [v0.63.10]) with around 500 connections each - one for google and the other for facebook. All going to BigQuery (using the Google Cloud Storage bucket)

Google side does fine, but facebook suffers from some very long ETL times. Obviously for smaller accounts, it is ok (much better on subsequent runs) but first runs can take a while - then for large accounts it can really become an issue with multiple day long runs with many attempts. Do you have any optimization tips (other than whats on the site - that is more general and not as helpful for specifically facebook)

Thanks in advance for any assistance/insights!



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["optimization", "ETL-times", "Facebook-connection", "AWS-EC2", "Airbyte", "large-accounts"]

part with ingress looks a bit off, I’d put it under webapp (webapp and global are on the same level of indentation)
I’d avoid curly braces and commas for annotations

  ingress:
    enabled: true
    hosts:
      - host: "<http://airbyte.example.com|airbyte.example.com>"
        paths:
          - path: "/"
            pathType: ImplementationSpecific
    annotations:
      <http://nginx.ingress.kubernetes.io/proxy-body-size|nginx.ingress.kubernetes.io/proxy-body-size>: "16m"
      <http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "1800"
      <http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "1800"```

<@U05JENRCF7C> back again. facing this issue:
Internal Server Error: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(sections=[])), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the environment variable AWS_WEB_IDENTITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(sections=[])): Profile file contained no credentials for profile 'default': ProfileFile(sections=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): Failed to load credentials from IMDS.]",

Here are my values.yml

  storage:
    type: "S3"
    storageSecretName: airbyte-config-secrets
    bucket:
      log: dicer-airbyte-bucket
      state: dicer-airbyte-bucket
      workloadOutput: dicer-airbyte-bucket
    s3:
      region: "us-east-1"
      authenticationType: instanceProfile
  secretsManager:
    type: awsSecretManager
    awsSecretManager:
      region: "us-east-1"
      authenticationType: instanceProfile
  auth:
    cookieSecureSetting: "false"

webapp:
  ingress:
    enabled: true
    hosts:
      - host: "<http://airbyte.dicer.ai|airbyte.dicer.ai>"
        paths:
          - path: "/"
            pathType: ImplementationSpecific
    annotations:
      <http://nginx.ingress.kubernetes.io/proxy-body-size|nginx.ingress.kubernetes.io/proxy-body-size>: "16m"
      <http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "1800"
      <http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "1800"```
Only note here is `authenticationType: credentials` did not work for me (just assuming because I don’t have aws cli set up on the server - but the EC2 does have the proper instance role attached)

and my secret.yaml
```apiVersion: v1
kind: Secret
metadata:
  name: airbyte-config-secrets
type: Opaque
stringData:
  s3-access-key-id: REDACTED
  s3-secret-access-key: REDACTED
  aws-secret-manager-access-key-id: REDACTED
  aws-secret-manager-secret-access-key: REDACTED```
`abctl local install --secret secret.yaml --values values.yaml`
works fine - but the connections fail with the error above

(and of course not seeing anything in the s3 bucket)

For now, please create IAM user and create access key for that user.
Error shows that it tried few different strategies and it couldn’t get any credentials.

abctl provisions kind cluster (https://kind.sigs.k8s.io/) that is running on Docker. Inside kind cluster, pods are started for synchronizations. It looks like with all those layers, it cannot get profile instance and generate temporary credentials.
It would require some extra debugging to check if it is possible to communicate with <http://169.254.169.254>
https://github.com/aws/aws-sdk-java/blob/81f205cb12d72132b0099089fc16ee755c2ced63/aws-java-sdk-core/src/main/java/com/amazonaws/util/EC2MetadataUtils.java#L81|EC2MetadataUtils.java#L81

adding this is “working” for me

  storage:
    type: "S3"
    storageSecretName: airbyte-config-secrets
    bucket:
      log: dicer-airbyte-bucket
      state: dicer-airbyte-bucket
      workloadOutput: dicer-airbyte-bucket
    s3:
      region: "us-east-1"
      authenticationType: instanceProfile
  logs:
    accessKey:
      password: ""
      existingSecret: airbyte-config-secrets
      existingSecretKey: s3-access-key-id
    secretKey:
      password: ""
      existingSecret: airbyte-config-secrets
      existingSecretKey: s3-secret-access-key
  secretsManager:
    type: awsSecretManager
    awsSecretManager:
      region: "us-east-1"
      authenticationType: instanceProfile
  auth:
    cookieSecureSetting: "false"

worker:
  extraEnv:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-access-key-id
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-secret-access-key
    - name: STATE_STORAGE_S3_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-access-key-id
    - name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-secret-access-key
    - name: STATE_STORAGE_S3_BUCKET_NAME
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-bucket
    - name: STATE_STORAGE_S3_REGION
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: aws-region

server:
  extraEnv:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-access-key-id
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-secret-access-key
    - name: STATE_STORAGE_S3_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-access-key-id
    - name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-secret-access-key
    - name: STATE_STORAGE_S3_BUCKET_NAME
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: s3-bucket
    - name: STATE_STORAGE_S3_REGION
      valueFrom:
        secretKeyRef:
          name: airbyte-config-secrets
          key: aws-region```

but I think I need to add more

like

    accessKey:
      password: ""
      existingSecret: airbyte-config-secrets
      existingSecretKey: s3-access-key-id
    secretKey:
      password: ""
      existingSecret: airbyte-config-secrets
      existingSecretKey: s3-secret-access-key```
but for state and workloadOutput?

I have different configuration, and it works for me, so you are on your own here :wink: Good luck with finding what will work for you!

haha - totally understand and thanks! maybe the issue could be that <http://169.254.169.254> requires a token now?

```TOKEN=curl -X PUT "<http://169.254.169.254/latest/api/token>" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"

curl http://169.254.169.254/latest/meta-data/profile -H “X-aws-ec2-metadata-token: $TOKEN”```

Check how your synchronizations are distributed during the day. Maybe you can run them at different times with cron schedule. In that way you will have less containers running at the same, less resources consumed (memory/cpu)

Check with ctop how many resources are consumed by containers https://github.com/bcicen/ctop

Check network metrics for EC2 instance if you are using a network I/O credit mechanism. It might be bottleneck for 500 connections.

https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html#gp_network

You can also consider using EKS and horizontal scaling, but deployment and tweaking is a bit tricky at the beginning

<@U05JENRCF7C> Thank you - all great advice!

per ctop: I see that the airbyte-abctl-control-plane is consistently running over 100% CPU. I am going to switch the instance to a c6a.8xlarge to test both CPU and eliminate the network I/O credit variable.

Next, I am going to make calls to the API to spread out the connections. I am sure this will help regardless.

Finally, I really would love to use EKS, my concern here is transferring my existing data (not the bigquery data), but it would be nice to truly migrate my existing setup. Since my initial message I did move to abctl - do you have any paths for that?

Do you use built-in database for Airbyte or do you have external PostgreSQL?

First thing that I’d do is first sample EKS setup with S3 https://docs.airbyte.com/deploying-airbyte/integrations/storage and external PostgreSQL https://docs.airbyte.com/deploying-airbyte/integrations/database
Just to be sure that you can configure working setup. Tweaking it is another story.

> I am going to make calls to the API to spread out the connections
Do you have configuration of sources/destination/connections in a code or you changing it manually or with API?

with the API like so:

        headers = await self._create_headers()
        url = f"{self.base_url}/api/public/v1/connections/{connection_id}"
        payload = {
            "schedule": {
                "scheduleType": "cron",
                "cronExpression": cron_expression
            },
            "namespaceFormat": None
        }
        response = await self._client.patch(url, json=payload, headers=headers)
        response.raise_for_status()
        return response.json()```

I recommend check terraform provider for storing configuration as a code https://github.com/airbytehq/terraform-provider-airbyte
Also, pulumi has introduced support for any terraform provider https://www.pulumi.com/blog/any-terraform-provider/

Oh, while talking about adding --values values.yaml to the abctl command. I noticed abctl local install does not persist existing setups. for example if I run

abctl local install --host <http://airbyte.example.com|airbyte.example.com> --insecure-cookies

then later I want to change values
abctl local install --values values.yml

it will undo the host and cookies. I probably have a fundamental misunderstanding, but wanted to throw that out there. Also, I was having issues running something like abctl local install --host <http://airbyte.example.com|airbyte.example.com> --insecure-cookies --values values.yml when the values interacted with the ingress (i.e. increasing proxy timeout)

Apart from that, the most import part in changing deployments would be copying data between databases
You can use https://dbeaver.io, pg_dump or whatever works for you to have the same data in both databases. Before doing that, I’d advise to stop all connections, and if any are running, wait for them to finish.