Summary
The user is experiencing long ETL times for Facebook connections in Airbyte on AWS EC2 instances, specifically for large accounts. They are seeking optimization tips beyond the general ones available on the site.
Question
Hi everyone!
I have two selfhosted instances (AWS EC2’s m6a.4xlarge - where they run the Airbyte docker setup [v0.63.10]) with around 500 connections each - one for google and the other for facebook. All going to BigQuery (using the Google Cloud Storage bucket)
Google side does fine, but facebook suffers from some very long ETL times. Obviously for smaller accounts, it is ok (much better on subsequent runs) but first runs can take a while - then for large accounts it can really become an issue with multiple day long runs with many attempts. Do you have any optimization tips (other than whats on the site - that is more general and not as helpful for specifically facebook)
Thanks in advance for any assistance/insights!
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
["optimization", "ETL-times", "Facebook-connection", "AWS-EC2", "Airbyte", "large-accounts"]
part with ingress looks a bit off, I’d put it under webapp
(webapp
and global
are on the same level of indentation)
I’d avoid curly braces and commas for annotations
ingress:
enabled: true
hosts:
- host: "<http://airbyte.example.com|airbyte.example.com>"
paths:
- path: "/"
pathType: ImplementationSpecific
annotations:
<http://nginx.ingress.kubernetes.io/proxy-body-size|nginx.ingress.kubernetes.io/proxy-body-size>: "16m"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "1800"
<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "1800"```
<@U05JENRCF7C> back again. facing this issue:
Internal Server Error: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(sections=[])), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the environment variable AWS_WEB_IDENTITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(sections=[])): Profile file contained no credentials for profile 'default': ProfileFile(sections=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): Failed to load credentials from IMDS.]",
Here are my values.yml
storage:
type: "S3"
storageSecretName: airbyte-config-secrets
bucket:
log: dicer-airbyte-bucket
state: dicer-airbyte-bucket
workloadOutput: dicer-airbyte-bucket
s3:
region: "us-east-1"
authenticationType: instanceProfile
secretsManager:
type: awsSecretManager
awsSecretManager:
region: "us-east-1"
authenticationType: instanceProfile
auth:
cookieSecureSetting: "false"
webapp:
ingress:
enabled: true
hosts:
- host: "<http://airbyte.dicer.ai|airbyte.dicer.ai>"
paths:
- path: "/"
pathType: ImplementationSpecific
annotations:
<http://nginx.ingress.kubernetes.io/proxy-body-size|nginx.ingress.kubernetes.io/proxy-body-size>: "16m"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "1800"
<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "1800"```
Only note here is `authenticationType: credentials` did not work for me (just assuming because I don’t have aws cli set up on the server - but the EC2 does have the proper instance role attached)
and my secret.yaml
```apiVersion: v1
kind: Secret
metadata:
name: airbyte-config-secrets
type: Opaque
stringData:
s3-access-key-id: REDACTED
s3-secret-access-key: REDACTED
aws-secret-manager-access-key-id: REDACTED
aws-secret-manager-secret-access-key: REDACTED```
`abctl local install --secret secret.yaml --values values.yaml`
works fine - but the connections fail with the error above
(and of course not seeing anything in the s3 bucket)
For now, please create IAM user and create access key for that user.
Error shows that it tried few different strategies and it couldn’t get any credentials.
abctl
provisions kind cluster (https://kind.sigs.k8s.io/) that is running on Docker. Inside kind cluster, pods are started for synchronizations. It looks like with all those layers, it cannot get profile instance and generate temporary credentials.
It would require some extra debugging to check if it is possible to communicate with <http://169.254.169.254>
https://github.com/aws/aws-sdk-java/blob/81f205cb12d72132b0099089fc16ee755c2ced63/aws-java-sdk-core/src/main/java/com/amazonaws/util/EC2MetadataUtils.java#L81|EC2MetadataUtils.java#L81
adding this is “working” for me
storage:
type: "S3"
storageSecretName: airbyte-config-secrets
bucket:
log: dicer-airbyte-bucket
state: dicer-airbyte-bucket
workloadOutput: dicer-airbyte-bucket
s3:
region: "us-east-1"
authenticationType: instanceProfile
logs:
accessKey:
password: ""
existingSecret: airbyte-config-secrets
existingSecretKey: s3-access-key-id
secretKey:
password: ""
existingSecret: airbyte-config-secrets
existingSecretKey: s3-secret-access-key
secretsManager:
type: awsSecretManager
awsSecretManager:
region: "us-east-1"
authenticationType: instanceProfile
auth:
cookieSecureSetting: "false"
worker:
extraEnv:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-secret-access-key
- name: STATE_STORAGE_S3_ACCESS_KEY
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-access-key-id
- name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-secret-access-key
- name: STATE_STORAGE_S3_BUCKET_NAME
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-bucket
- name: STATE_STORAGE_S3_REGION
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: aws-region
server:
extraEnv:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-secret-access-key
- name: STATE_STORAGE_S3_ACCESS_KEY
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-access-key-id
- name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-secret-access-key
- name: STATE_STORAGE_S3_BUCKET_NAME
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: s3-bucket
- name: STATE_STORAGE_S3_REGION
valueFrom:
secretKeyRef:
name: airbyte-config-secrets
key: aws-region```
but I think I need to add more
like
accessKey:
password: ""
existingSecret: airbyte-config-secrets
existingSecretKey: s3-access-key-id
secretKey:
password: ""
existingSecret: airbyte-config-secrets
existingSecretKey: s3-secret-access-key```
but for state and workloadOutput?
I have different configuration, and it works for me, so you are on your own here Good luck with finding what will work for you!
haha - totally understand and thanks! maybe the issue could be that <http://169.254.169.254>
requires a token now?
```TOKEN=curl -X PUT "<http://169.254.169.254/latest/api/token>" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"
curl http://169.254.169.254/latest/meta-data/profile -H “X-aws-ec2-metadata-token: $TOKEN”```
Check how your synchronizations are distributed during the day. Maybe you can run them at different times with cron schedule. In that way you will have less containers running at the same, less resources consumed (memory/cpu)
Check with ctop how many resources are consumed by containers https://github.com/bcicen/ctop
Check network metrics for EC2 instance if you are using a network I/O credit mechanism. It might be bottleneck for 500 connections.
https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html#gp_network
You can also consider using EKS and horizontal scaling, but deployment and tweaking is a bit tricky at the beginning
<@U05JENRCF7C> Thank you - all great advice!
per ctop: I see that the airbyte-abctl-control-plane
is consistently running over 100% CPU. I am going to switch the instance to a c6a.8xlarge
to test both CPU and eliminate the network I/O credit variable.
Next, I am going to make calls to the API to spread out the connections. I am sure this will help regardless.
Finally, I really would love to use EKS, my concern here is transferring my existing data (not the bigquery data), but it would be nice to truly migrate my existing setup. Since my initial message I did move to abctl
- do you have any paths for that?
Do you use built-in database for Airbyte or do you have external PostgreSQL?
First thing that I’d do is first sample EKS setup with S3 https://docs.airbyte.com/deploying-airbyte/integrations/storage and external PostgreSQL https://docs.airbyte.com/deploying-airbyte/integrations/database
Just to be sure that you can configure working setup. Tweaking it is another story.
> I am going to make calls to the API to spread out the connections
Do you have configuration of sources/destination/connections in a code or you changing it manually or with API?
with the API like so:
headers = await self._create_headers()
url = f"{self.base_url}/api/public/v1/connections/{connection_id}"
payload = {
"schedule": {
"scheduleType": "cron",
"cronExpression": cron_expression
},
"namespaceFormat": None
}
response = await self._client.patch(url, json=payload, headers=headers)
response.raise_for_status()
return response.json()```
I recommend check terraform provider for storing configuration as a code https://github.com/airbytehq/terraform-provider-airbyte
Also, pulumi has introduced support for any terraform provider https://www.pulumi.com/blog/any-terraform-provider/
Oh, while talking about adding --values values.yaml
to the abctl
command. I noticed abctl local install
does not persist existing setups. for example if I run
abctl local install --host <http://airbyte.example.com|airbyte.example.com> --insecure-cookies
then later I want to change values
abctl local install --values values.yml
it will undo the host and cookies. I probably have a fundamental misunderstanding, but wanted to throw that out there. Also, I was having issues running something like abctl local install --host <http://airbyte.example.com|airbyte.example.com> --insecure-cookies --values values.yml
when the values interacted with the ingress (i.e. increasing proxy timeout)
Apart from that, the most import part in changing deployments would be copying data between databases
You can use https://dbeaver.io, pg_dump or whatever works for you to have the same data in both databases. Before doing that, I’d advise to stop all connections, and if any are running, wait for them to finish.