Exploring the possibility of utilizing state from an existing connection to establish a new connection in API-based extractions to avoid full sync.
Question
Can I use State from an existing connection to create a new connection so that I do not do a full sync this is for API based extractions
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.
Yes, you can use the state from an existing connection to create a new connection and avoid a full sync. This is particularly useful for incremental syncs. Here are the steps you can follow: 1. Navigate to the existing connection you are duplicating, and go to the Settings tab. Open the Advanced settings to see the connection state, which manages incremental syncs. Copy the state to your clipboard[^1^]. 2. Go to your newly created connection, replace the state with the copied contents from the previous step, then click Update State. This will ensure historical data is not replicated again[^1^]. 3. Enabling the connection will now provide a parallel copy of all streams in the updated format[^1^]. 4. You can move your dashboards to rely on the new tables, then pause the out-of-date connection[^1^]. Please note that updates to the connection state should be handled with extreme care. Updates may break your syncs, requiring a reset to fix. Make changes only as directed by the Airbyte team[^5^]. [^1^]: Source [^5^]: Source
Hmm, I don’t know enough to give you a confident answer yet. The extracts provided do not contain specific information about Airbyte using parallel burst when extracting data using the low-code approach. However, there is an ongoing discussion about parallel data replication in Airbyte on their Github page. For more detailed information, you might want to check out the Airbyte Documentation.
Yes, Airbyte does use parallelization for data extraction and processing. This is achieved through the configuration of workers. Airbyte exposes several environment variables to change the maximum number of each type of worker allowed to run in parallel. These include: - MAX_SPEC_WORKERS: Maximum number of Spec workers allowed to run in parallel. - MAX_CHECK_WORKERS: Maximum number of Check connection workers allowed to run in parallel. - MAX_DISCOVERY_WORKERS: Maximum number of Discovery workers allowed to run in parallel. - MAX_SYNC_WORKERS: Maximum number of Sync workers allowed to run in parallel. The current default value for these environment variables is set to 5. Tweaking these values might help you run more jobs in parallel and increase the workload of your Airbyte instance. In addition, there is an ongoing project around parallel data replication in Airbyte, which is focused on increasing the performance of replication. This includes handling a source with a lot of streams, but each individual stream is a manageable size, and handling a source that has one huge stream. For more information, you can refer to the Airbyte Documentation and the Github Issue on parallel data replication.
Yes, Octavia CLI is still used to manage Airbyte configurations. It provides commands to import, edit, and apply Airbyte resource configurations such as sources, destinations, and connections. It is particularly useful for version controlling resource configurations, reviewing resource configuration changes, and copying resource configurations between Airbyte instances. Octavia CLI also allows you to programmatically create resources when you have more connections or your connections contain several Airbyte streams. You can define configurations in YAML programmatically, bootstrap an Octavia CLI project, import the configuration of an existing instance, create source and destination configurations in YAML, create connection configurations in YAML, edit YAML configuration files, and apply configurations to a different instance. For more details, you can check the Octavia CLI documentation and join the octavia-cli channel in the Airbyte community slack. Please note that Octavia CLI manages Airbyte Configurations as Code (CaC) and not Airbyte Infrastructure as Code (IaC). Thus, Octavia CLI cannot provision an Airbyte instance. Sources: - Version control Airbyte configurations with Octavia CLI - Octavia CLI use cases - Programmatically create resources
Based on the provided extracts, there is no explicit mention of plans to include Datadog as a destination in Airbyte. Currently, Airbyte has integrated Datadog as a source and for monitoring purposes. For monitoring, Airbyte provides extensive logs for each connector and has an integration with Datadog that allows you to monitor and analyze your data pipelines within your Datadog dashboards at no additional cost. This integration is still in development and is expected to be available only for Airbyte Enterprise customers in the future (source). As a source, the Datadog connector in Airbyte collects data from the Datadog API. It supports full refresh sync, incremental sync, and SSL connection (source). For the most accurate and up-to-date information, you may want to check the Airbyte Github page or the Airbyte Documentation.
Based on the provided extracts, there is no explicit mention of a feature in the Airbyte UI Builder that allows marking a stream for execution in parallel calls. The UI Builder is designed to help you create and configure connectors, but it doesn’t specify options for parallel execution of streams. However, there is an ongoing effort to implement a concurrency feature for API sources in Airbyte, as mentioned in the Support QA. This feature is being tested with the Stripe connector and may be extended to other connectors in the future. For the most accurate and up-to-date information, you may want to check the Airbyte Github page or the Airbyte Documentation.
Yes, parallel stream execution is possible in Airbyte. However, as of the current information provided, API sources aren’t parallel. There is ongoing work to implement a concurrency feature, which is being tested with the Stripe connector and may later be extended to other connectors. This means that in the future, multiple streams could potentially be run in parallel, increasing the efficiency of data ingestion. In terms of worker configuration, Airbyte allows you to set the maximum number of each type of worker that can run in parallel. This includes MAX_SPEC_WORKERS, MAX_CHECK_WORKERS, MAX_DISCOVERY_WORKERS, and MAX_SYNC_WORKERS. Adjusting these values can help you run more jobs in parallel and increase the workload of your Airbyte instance. For more information, you can refer to the Support QA and the Airbyte Documentation on Configuring Jobs & Workers.
Is low code using concurrent execution: no, not yet. We will add it at some point, but no target date yet. We have concurrency support in Python CDK though.